High Performance Python Through Workload Acceleration with OMR JitBuilder

by

Dayton J. Allen

Bachelor of Science, Computer Science University of the West Indies, 2017

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

Master of Computer Science

In the Graduate Academic Unit of Computer Science

Supervisor(s): David Bremner, PhD, Computer Science Examining Board: Michael Fleming, PhD, Computer Science, Chair Suprio Ray, PhD, Computer Science External Examiner: Monica Wachowicz, PhD, Geodesy & Geomatics Engineering

This thesis is accepted by the Dean of Graduate Studies

THE UNIVERSITY OF NEW BRUNSWICK June, 2020 Dayton J. Allen, 2020 Abstract

Python remains one of the most popular programming languages in many domains including scientific computing. Its reference implementation, CPython, is by far the most used version. CPython’s runtime is bytecode-interpreted and leaves much to be desired when it comes to performance. Several attempts have been made to improve CPython’s performance such as reimplementing performance-critical code in a more high-performance language (e.g. C, C++, Rust), or, transpiling Python source code to a more high-performance language, which is then called from within CPython through some form of FFI mechanism. Another approach is to JIT com- pile performance-critical Python methods or utilize alternate implementations that include a JIT compiler. JitBuilder provides a simplified interface to the underlying compiler technology avail- able in Eclipse OMR. We propose using JitBuilder to accelerate performance-critical workloads in Python. By creating Python bindings to JitBuilder’s public interface, we can generate native code callable from within CPython without any modifications to its runtime. Results demonstrate that our approach rivals and in many cases outperforms state- of-the-art JIT compiler based approaches in the current ecosystem – namely, Numba and PyPy.

ii Acknowledgements

First and foremost, I would like to thank my thesis supervisor Dr. David Bremner for his guidance and patience during my thesis project. Many times I have pivoted or ventured off into some alternate project and he would nudge me back on course.

I would also like to thank all my CASA colleagues and IBM contacts that provided valuable help and feedback on my project.

Thanks to all my Canadian friends that helped me to navigate being in a foreign country, thousands of kilometres away from home and family.

Thanks to my family for their endless support.

Last but not least, thanks to everyone I know outside Computer Science that showed interest in my thesis project and kept asking questions. I know this is not easy.

iii Table of Contents

Abstract ii

Acknowledgments iii

Table of Contents iv

List of Tables vii

List of Figures viii

Abbreviations ix

1 Introduction 1 1.1 Motivation ...... 2 1.2 Contributions ...... 4 1.3 Organization ...... 4

2 Background & Related Work 6 2.0.1 CPython’s Performance Woes ...... 6 2.0.1.1 Bytecode Optimizations or the Lack Thereof . . . . . 6 2.1 Making CPython Faster ...... 7 2.1.1 C Extension Modules ...... 7 2.1.2 Transpilers ...... 7 2.1.3 JIT Compilation & Other Implementations ...... 8 2.1.3.1 Repurposed JIT Compilers ...... 9

iv 2.1.3.2 Bolt-On JIT Compilers ...... 10 2.2 Regular Expressions ...... 12 2.3 JitBuilder ...... 13

3 Python Bindings to JitBuilder 17 3.1 JitBuilder’s Polyglot Ambitions ...... 17 3.1.1 Python Language Bindings ...... 20 3.1.1.1 Mapping C & C++ Languages features to Python . 22 3.1.2 Autojit ...... 23 3.1.2.1 Workflow ...... 26

4 Regular Expression Engine 30 4.1 Design & Implementation ...... 30 4.1.1 Bindings to Rust’s RegEx bytecode compiler ...... 31 4.1.2 RegEx Interpreter ...... 35 4.1.2.1 Regular Expressions & Finite Automata ...... 35 4.1.2.2 Matching Engine ...... 37 4.1.3 Specializing Regex Compiler ...... 39

5 Evaluation 44 5.1 Experimental Setup ...... 44 5.2 Testing our Implementations ...... 45 5.3 Discussion ...... 46 5.3.1 AutoJIT ...... 46 5.3.2 RegEx Engine ...... 52

6 Conclusions & Future Work 58 6.1 Conclusions ...... 58 6.2 Future Work ...... 61

v Bibliography 70

A Appendix A 71 A.1 Code Repositories ...... 71

Vita

vi List of Tables

4.1 Summary of Supported Regex Features...... 34 4.2 Regex Engine Matching (h|H)ello against “Hello World”. See Sec- tion 4.1.1 for a discussion of the bytecodes used in the table above. . 39

5.1 Relative Speedups versus CPython on our Iterative Fibonacci Bench- mark...... 47 5.2 Relative Speedups versus CPython on our Recursive Fibonacci Bench- mark...... 47 5.3 Relative Speedups versus CPython on our Dot Product Benchmark. . 48 5.4 Relative Speedups versus CPython on our Mandelbrot Benchmark. . 49 5.5 Relative Speedups versus CPython on our Matrix Multiplication Bench- mark...... 50 5.6 Relative Speedups versus our Pure Python Regex Interpreter on Regex Benchmark Set #1...... 54 5.7 Relative Speedups versus our Pure Python Regex Interpreter on Regex Benchmark Set #2...... 56 5.8 Relative Speedups versus our Pure Python Regex Interpreter on Regex Benchmark Set #3...... 56

A.1 Code Repositories...... 71

vii List of Figures

3.1 JitBuilder Client versus Implementation API...... 18 3.2 C API Virtual Function Emulation [5]...... 21 3.3 Auto JIT Workflow...... 27 3.4 Lowering 2D Arrays to JitBuilder...... 29

4.1 Partial NFA for matching a single character: c...... 35

4.2 Partial NFA that describes the alternation: n1|n2...... 35 4.3 Partial NFA that describes matching n once or zero times: n?. . . . . 36 4.4 Partial NFA that describes matching n zero or more times: n∗. . . . . 36 4.5 Partial NFA that describes matching n one or more times: n+. . . . . 36

4.6 Partial NFA that describes the concatenation: n1n2...... 36 4.7 (h|H)ello NFA Representation...... 37 4.8 Regex Matching Engine Flowchart...... 40

5.1 Iterative Fibonacci Benchmark...... 46 5.2 Recursive Fibonacci Benchmark...... 47 5.3 Dot Product Benchmark...... 48 5.4 Mandelbrot Benchmark...... 49 5.5 Matrix Multiplication Benchmark...... 50 5.6 Compilation Times: Python+JitBuilder versus Numba...... 51 5.7 Regex Benchmark Set #1...... 54 5.8 Regex Benchmark Set #2...... 55 5.9 Regex Benchmark Set #3...... 57

viii List of Symbols, Nomenclature or Abbreviations

JIT Just-in-Time Compiler Regex Regular Expression IL Intermediate Language IR Intermediate Representation DFA Deterministic Finite Automata NFA Non-Deterministic Finite Automata FFI Foreign Function Interface JVM Java Virtual Machine CLR Core Language Runtime ASDL Abstract Syntax Description Lanuguage MSIL Microsoft Intermediate Language WIP Work in Progress CASA Centre of Advanced Studies - Atlantic VARARGs Variable Argruments IP Instruction Pointer SP String Pointer PC Program Counter LIFO Last in, First out STDLIB Standard Library VM Virtual Machine API Application Programming Interface CPU Central Processing Unit AST Abstract Syntax Tree EOL End of Life RRC Rust’s Regex Crate  Epsilon

ix Chapter 1

Introduction

Python [44] is a high-level, dynamically-typed programming language. It is currently one of the most popular programming languages, a feat that can be attributed to its simplicity and relatively easy learning curve. These attributes and its rich ecosystem of high-quality libraries make it an attractive option within many domains. Despite its many pros, one of Python’s major weaknesses is the relatively poor performance of its reference implementation (CPython). This lack of performance is mainly due to its runtime being bytecode interpreted as well as its dynamic nature. This dynamism renders many compiler optimizations infeasible.

Several attempts have been made to overcome CPython’s performance limitations, such as developing native C extension modules to the CPython runtime. C extension modules necessitate a deviation from writing pure Python code by requiring code to be developed in a lower-level programming language (C/C++). This can be both a tedious and error-prone activity even for experts. A related approach is to implement general-purpose performance-critical code in high-performance languages (e.g. C++, C, Rust) and interface with Python through some form of foreign function interface (FFI) layer. An FFI is an interface that allows one to call out to code developed in one programming language, from another potentially unrelated programming lan-

1 guage. In the end, this also leads to a C extension module, but the difference is that the module is a binding to functionality from the general-purpose library rather than a fully Python-specific module. This layer of indirection somewhat positions Python as a frontend to high-performance libraries versus a language one can im- plement high-performance code in. Finally, we can also improve the performance of CPython by employing the use of just-in-time (JIT) compilers. Aycock [4] describes JIT compilation as compilation that occurs after a program begins execution. This allows one to collect information during execution to better guide compilation and optimization decisions. Eclipse OMR [23] provides a rich set of language-agnostic reusable components for implementing high-performance programming language runtimes. OMR historically stood for “Open Managed Runtimes” but the name is no longer interpreted as an acronym and has persisted despite the project not being a managed runtime. Amongst the components that OMR provides is a compiler based on the IBM J9 Java Virtual Machine (JVM). Despite its JVM roots, the OMR compiler component is also programming language agnostic. Another component, and perhaps the most pertinent to this thesis, is JitBuilder. JitBuilder is a native acceleration library that provides a simplified interface to the underlying compiler technology available in OMR.

1.1 Motivation

The current approaches to accelerating Python code leave room for improvement. Approaches that deviate too much from the CPython runtime [37] struggle with adoption due to not supporting CPython’s C application programming interface (API) that many popular Python libraries are based on. Approaches based on re- purposed virtual machines (VMs) [1, 16] have failed to produce significant perfor-

2 mance improvements (e.g., Jython1, IronPython2). Castanos et al. [16] found that the main reason for this is not sufficiently focusing on specialization, but instead relying on traditional optimization techniques which do not translate as well to dynamically-typed programming lanuages. Some approaches have been discontin- ued due to architectural constraints that hinder easy evolution [36]. Lastly, some approaches failed to generate an audience and/or required significant effort to main- tain compatibility with CPython and were subsequently discontinued or no longer actively developed [26, 31].

The research goal is to develop Python bindings to JitBuilder and evaluate whether JitBuilder can be used as a more accessible approach to accelerating workloads in Python. With bindings implemented as a Python C extension module, we are able to utilize JitBuilder from within CPython. Consequently, we are able to bypass the aforementioned limitations of current approaches by allowing us to generate native code, callable from CPython without changes to its runtime. The benefits of this approach are twofold:

1. There is no compatibility breakage with existing libraries that use the CPython C API.

2. Low barrier to entry in the form of learning a new API versus learning a new programming language and build tools.

This is also the approach used by Numba [28] and PyLibJit [6]. The difference, however, is that Numba focuses on accelerating numerical computation in Python and isn’t a general solution for accelerating all types of workloads. It also is based on the LLVM JIT compiler. PyLibJit utilizes GNU’s libjit library to compile Python functions to native code. Before the project was abandoned, they were able to

1https://www.jython.org/ 2https://ironpython.net/

3 support a large subset of Python. Much of this support was due to PyLibJit’s reliance on the CPython interpreter for fallback computation. This reliance often resulted in negligible performance improvements due to operations being performed on boxed Python objects.

1.2 Contributions

This thesis contributes an investigation of using JitBuilder as a more accessible means to accelerating workloads in Python. To this end, using Python and JitBuilder, we developed and evaluated the performance characteristics of:

1. Several compute-intensive numerical benchmarks.

2. A specializing regular expression to native code compiler.

The first of these (numerical benchmarks) allows us to evaluate the case of small- scale workloads – on the order of single functions, whereas the latter (regex compiler) allows us to evaluate the case of a larger scale workload. The latter workload also presents us the opportunity to evaluate the potential performance benefits of spe- cialized compiled-to-native-code regular expressions.

Results demonstrate that JitBuilder can generate native code that results in per- formance improvements for various workloads in Python all while maintaining a low barrier to entry compared to current state-of-the-art approaches in the Python ecosystem. Whether it is beneficial to generate specialized native code from a regex is inconclusive based on unquantifiable overhead in our experimental setup.

1.3 Organization

The remainder of this thesis is organized as follows:

4 • Chapter 2 : Provides a survey of the literature surrounding accelerating Python code. We talk about current state-of-the-art approaches as well as seminal work in the Python ecosystem. We also provide some background on regular expressions and Jitbuilder.

• Chapter 3 : Discusses our approach to implementing Python bindings to the JitBuilder library.

• Chapter 4 : Details the implementation of our specializing regular expression to native code compiler.

• Chapter 5 : Provides a description of our experimental setup as well as an illustration and discussion of our experimental results.

• Chapter 6 : Concludes this study and provides directions for future work.

5 Chapter 2

Background & Related Work

2.0.1 CPython’s Performance Woes

CPython’s performance limitations stem mainly from its dynamism. To illustrate, in CPython, operations are dispatched dynamically and arithmetic operations are performed on boxed objects. Boxed objects refer to the fact that all values in Python are objects (PyObjects) that do not map directly to machine types, but are instead wrappers stored as pointers to objects on the heap. These PyObjects can be assigned to, inspected and modified, which provides great expressiveness but does so at the expense of performance. This dynamism coupled with its extreme introspective and reflective capabilities contributes significant overhead to CPython’s runtime [19].

2.0.1.1 Bytecode Optimizations or the Lack Thereof

Quoting Guido van Rossum, Python’s creator, “Python is about having the simplest, dumbest compiler imaginable.” [8] It therefore comes as no surprise that the Python bytecode compiler performs very few optimizations. Simple optimizations such as removing unreachable code are performed as well as more Python-specific optimiza- tions, such as tuple rearrangement to avoid the overhead of building and unpacking a tuple when performing multiple assignments. Montanaro [32] showed that these op-

6 timizations yielded no significant improvements to performance and suggested more radical changes to the CPython runtime such as using a register-based VM versus a stack-based VM. Python’s dynamic nature also renders more powerful optimizations either difficult to implement or flat-out impossible.

2.1 Making CPython Faster

2.1.1 C Extension Modules

One approach to accelerating Python code is by writing a CPython C extension module. C extension modules allow developers to extend CPython with new high- performance modules written in C or C++. The CPython C API provides the necessary functionality to access various components of its runtime. The downside to this approach, however, is that one has to resort to a lower-level programming environment to achieve speedups. Another less obvious downside is that portability is also sacrificed since C extension modules are coupled with the C API of the CPython Python implementation. This renders C extension modules impractical if compatibility across Python implementations is desired.

2.1.2 Transpilers

Transpilation is another approach to speeding up Python code. The basic idea be- hind approaches in this category is to translate Python code to a higher performant programming language, then compile down to a C extension module in order to achieve speedups. Cython [7], perhaps the most popular under this category, is an optimising static compiler for both Python and the Cython programming language. Cython code is a superset of Python code and includes additions such as type decla- rations. Cython also supports calling C functions and can be used to embed CPython into existing applications or to wrap external C libraries. Cython code is compiled

7 to C code, which is then compiled to a C extension module available for import into the CPython runtime. Being a superset of normal Python code, Cython eases the learning curve necessary to implement C extension modules. The downside, however, is that Cython adds technical challenges in the form of an additional programming language and build phase for Cython code. Pythran [24], another optimising static compiler, focuses on scientific computation and supports many of Python’s higher-level concepts such as list, set and dict com- prehensions, lambda functions and generator expressions. There is no support for dynamic features such as polymorphic variables. Pythran converts Python source code to Pythran intermediate representation (IR) which is a subset of the Python abstract syntax tree (AST). Various optimizations are performed on this IR which is then translated into C++ code. The C++ code is subsequently compiled to a C extension module. Pythran currently supports Python 2 which is end-of-life (EOL) January 1st, 2020 and touts “decent” Python 3 support. Similar to Pythran, Shed- skin1 compiles Python code to C++ code which is then compiled to a C extension module. This project only supports Python 2 and is not being actively developed. Nuitka2 is a Python compiler written in Python. It translates Python code to a C level program that is linked against libpython and compiled to an executable or a C extension module. HOPE [3] is a method-at-a-time JIT compiler for a subset of Python. It focuses on accelerating Python code for astrophysical computations. The approach taken by HOPE is to translate Python source code into C++ that is then compiled to a C extension module and loaded into the interpreter.

2.1.3 JIT Compilation & Other Implementations

PyPy [9, 10, 11] is another implementation of Python that uses a tracing JIT com- piler. The approach used for tracing is unusual, in that the interpreter running the

1https://shedskin.github.io/ 2https://nuitka.net/

8 user program is traced instead of tracing the user program directly. To facilitate this approach, PyPy uses two interpreters, one for running the user program (language in- terpreter) and another used by the JIT compiler for tracing (tracing interpreter) [12]. The tracing JIT works by compiling the hot paths within commonly executed loops to machine code. Pyston was intended to be another Python implementation but is no longer actively maintained due to the effort required to maintain compatibility with CPython, amongst other reasons [31]. The Pyston developers sought to repli- cate the improvements achieved in the JavaScript ecosystem in Python through JIT compilation and other techniques. They were able to achieve up to 95% speedups on standard CPython benchmarks. The compilation pipeline consisted of translating Python code to LLVM IR that is then run through the LLVM JIT to generate native code. Before active development ceased, Pyston supported a subset of Python 2 with Python 3 compatibility in the works. RustPython3 is yet another implementation of Python written in the Rust programming language [30] and built with memory safety in mind. This project is currently in its infancy and is not suitable for production workloads.

2.1.3.1 Repurposed JIT Compilers

Castanos et al. [16] described the repurposed JIT phenomenon as occurring when the JIT compiler from an existing statically typed programming language is repurposed to be able to compile code in another language. This is a particularly attractive option given the complexity and engineering effort required to build a JIT compiler from scratch. However, as shown in Castanos et al. and Ishizaki et al. [16, 25], repur- posed JIT compilers have failed to achieve significant performance improvements and still lag behind JIT compilers built for the specific language in mind such as PyPy. Furthermore, Adams et al. [1] found that, for languages that are sufficiently differ-

3https://github.com/RustPython/RustPython

9 ent, a repurposed JIT compiler would limit the performance relative to a custom JIT despite saving significant engineering effort. Examples of repurposed JIT compilers for Python include that is based on the JVM and IronPython that is Core Language Runtime (CLR) based. In the pedantic sense, these can also be thought of as transpilers since Python code is transpiled to JVM bytecode or Microsoft In- termediate Language (MSIL) and execution is forwarded to the respective VMs.

2.1.3.2 Bolt-On JIT Compilers

We define bolt-on JIT compilers as the class of JIT compilers that aim to extend and/or supplement an existing programming language runtime without modifying it. Approaches in this category utilize existing JIT compilers to generate native code, callable from within a target programming language’s runtime through some form of FFI layer. Several of these exist in the Python ecosystem. Psyco [36] was a specializing JIT compiler layered on CPython. This specializing comes from the fact that several versions of a compiled code block are written, with each version being optimized by being specialized to some types of variables. Listing 2.1 demonstrates a simple example where fn could be specialized for integer or float multiplication depending on the type of argument it was called with. Psyco uses partial evaluation to achieve JIT compilation. The lift operator speci- fies that a compile-time value should be forgotten, i.e., considered a runtime value. This leads to greater generality by not necessitating constraints on Python’s dy- namic type system. The specializer does the actual compilation. It does so in a top-down approach, thus compilation starts from general inputs then the unlift op- erator is gradually applied to get more information to make the compiled code more specialized. Early benchmarks demonstrated that Psyco was able to provide sig- nificant speedups over CPython. However, due to tight coupling with the Python interpreter, and the fast pace at which Python was evolving, the project was dis-

10 continued since its architecture made evolution unfeasible. Lam et al. [28] proposed Numba, a Python JIT compiler for optimizing the performance of numerical code written in Python. def fn(n): return n*n Listing 2.1: Psyco Specialization Example.

Numba supports a subset of Python and employs the use of the LLVM JIT compiler to generate native code. In this approach, Python bytecode is converted to Numba IR. Type inference is then performed on this IR. If all the types can be inferred then the IR is passed to LLVM for compilation to native code. In this case, termed No-Python mode, Numba is able to achieve significant speedups on numerical code. If all the types are not inferable then Numba defaults to its fallback case, which is to assume that all values in the function are Python objects. The Python C API and runtime are then used for execution. PyLibJit [6] is a Python library used for the load time generation of machine code in Python. PyLibJit builds on top of Python bindings to the GNU libjit JIT compiler library. Before the project was discontinued, they were able to support a large subset of Python due to PyLibJit’s reliance on the CPython interpreter. The cost of this, however, is that in many cases speedups are negligible since operations are performed on boxed objects and execution has to fall back to the Python interpreter. Significant performance improvements were noted in functions that involved mostly arithmetic operations that PyLibJit was able to perform with unboxed objects. Similar to Numba, Parakeet [38] is an LLVM based JIT compiler for Python. It focuses on an array-oriented subset of Python and was able to achieve performance improvements. However, the project was later discontinued due to its approach of whole function type specialization at the AST level not being scalable enough to support a sufficiently large subset of Python. Jax [14] is a JIT compiler for the automatic differentiation of native Python and

11 NumPy functions. It uses XLA4 to compile and run NumPy programs on GPUs and TPUs. Another project in its infancy worth mentioning is Pyjion5 which is CoreCLR based and makes use of the new frame evaluation API [15] in CPython. The new frame evaluation API allows a per interpreter function pointer to handle the evaluation of frames; this eases the complexity of introducing a method-based JIT compiler to the CPython runtime, amongst other things.

A common theme amongst the approaches in this category is that the greatest per- formance improvements are achieved when Python types are specialized to machine types [16]. Even though this limits dynamism and possibly triggers recompilation overhead, it was found that types rarely change in practice [21, 46]. This shows that specialization remains a powerful tool, an insight that permeates throughout the dynamically-typed programming language ecosystem [33, 35, 40, 41, 47, 49].

2.2 Regular Expressions

A regular expression (regex) is a notation used for describing a regular language. They are powerful string pattern matching constructs used in many areas of Com- puter Science. For example, in Cybersecurity they are used for packet content scan- ning [48] and input validation. The literature describes two major approaches to regular expression matching: 1) backtracking approaches and 2) approaches based on finite automata. Both techniques come with their performance implications and in the general case, finite automata techniques are superior performance-wise. In terms of algorithmic runtimes, backtracking implementations have a worst case exponen- tial runtime O(2n) whereas techniques based on finite automata have a worst case super-linear runtime O(mn) where m represents the size of the regular expression and n represents the size of the string being matched against. Techniques based on fi-

4https://www.tensorflow.org/xla 5https://github.com/microsoft/Pyjion

12 nite automata exploit the fact that regular expressions and finite automata, whether deterministic or non-deterministic, are of equal expressive power. The aforemen- tioned argument is formalized by Kleene’s theorem [27]. Over time, several new constructs have been added to the main regex standards in use today (POSIX & PCRE). Amongst these features are backreferences. In the theoretical sense, regular expressions that include backreferences are not regular expressions and can match non-regular languages. Consequently, this prevents finite automata from matching regular expressions that include backreferences. In practice, these rarely occur [17] so finite automata based matching engines are able to be to used for the majority of regular expressions being used today. Backtracking approaches, however, can match regular expressions that include backreferences. Perhaps, this is what is prevent- ing their ultimate demise. It is also important to note that currently, there is no efficient implementation of backreferences and the problem has been found to be NP-Hard [22].

2.3 JitBuilder

Eclipse OMR is a collection of reusable, language-agnostic components to aid in the development of new or existing programming language runtimes. It was built on the premise that programming language runtimes share several common denominators that can be abstracted out and reused. To illustrate, Go, C# and Java all have garbage collectors, whereas both C# and Java have a JIT compiler. Amongst the components that OMR provides is JitBuilder, a native acceleration library with a high-level interface to OMR’s underlying compiler technology [29]. JitBuilder ex- poses an API that encapsulates the details of the underlying intermediate represen- tation used by OMR’s compiler component. It can be used to write a JIT compiler or to accelerate any type of workload.

13 A summarized list of features given by Stoodley [42] is provided below:

• Data Types: Sized integers, floats, doubles and their vector equivalents. Ar- ray and pointer types. Support for both signed and unsigned type conversions.

• Arithmetic: Add, Mul, Div, Sub, And, Or, Xor and signed/unsigned shift operations.

• Conditionals: ==, !=, <, >, <=, >=.

• Control flow: IfThen, IfThenElse, Switch, DoWhile, WhileDo, ForLoop (Up- /Down) Return, GoTo.

• Memory Operations: Various load and store operations and the ability to create arrays and structs on the stack at compile time.

• Ability to call out to C functions.

• Ability to define bytecodes, their operations and how to sequence them.

• Ability to model Virtual Machine state.

JitBuilder in its current state only supports an optimization strategy based on the warm optimization level of OMR’s compiler component. The optimization levels supported by OMR’s compiler component in ascending order are: cold, warm, hot, veryHot and scorching. The higher the optimization level, the more optimizations are performed and the more complex and expensive they are. Note that this is not an architectural limitation, JitBuilder can be extended to support additional optimization strategies. JitBuilder currently provides code generation support for X86, Power, IBM Z and AArch64.

Listing 2.2 provides an example using JitBuilder’s API to create native code for a function that returns “Hello World”. We start by subclassing JitBuilder’s Method- Builder class. The MethodBuilder class encapsulates the details of the function we

14 wish to have compiled by JitBuilder. In the constructor, we initialize the Method- Builder object with a TypeDictionary parameter as well as define our function name and return type. Our return type here is address since the string is represented as a char *. The TypeDictionary object contains type information for JitBuilder and can be extended to define additional types. import jitbuilder import ctypes class HelloMethod(jitbuilder.MethodBuilder): def __init__(self, td): super().__init__(td) self.define_name("hello_world") self.define_return_type(td.address)

def buildIL(self): self.return_(self.const_string("Hello World")) return True initialized = jitbuilder.initialize_jit() if not initialized: exit() td = jitbuilder.TypeDictionary() hello_method = HelloMethod(td) native_hello_method = jitbuilder.compile_method_builder( hello_method )

CFUNC = ctypes.CFUNCTYPE(ctypes.c_char_p) ctypes.pythonapi.PyCapsule_GetPointer.restype = \ ctypes.c_void_p callable_hello_method = CFUNC( ctypes.pythonapi.PyCapsule_GetPointer( ctypes.py_object(native_hello_method) , 0) ) print(callable_hello_method()) jitbuilder.shutdown_jit()

15 Listing 2.2: Function that returns Hello World in JitBuilder. N.B. this example uses the Python version of the JitBuilder API.

In this case, however, we simply initialize and use a default TypeDictionary object since we use only types understood by JitBuilder. In the buildIL method is where we describe the functionality of our method. Here, we simply return a string containing the text “Hello World.” Next, we initialize the JIT using the initialize jit function, which allocates a code cache, amongst other things. We now initialize our newly defined HelloMethod and pass it to JitBuilder’s compile method builder function for compilation. The compiled function is returned as a void * and needs to be appropriately cast before being usable in Python through an FFI. In this case, we use ctypes. Finally, we shut down our JIT. The process is more involved than writing idiomatic Python code and the complexity is best avoided for simple workloads like this.

16 Chapter 3

Python Bindings to JitBuilder

This chapter describes our approach to creating Python bindings for JitBuilder as well as our approach to automatically JIT compiling a subset of Python code. We detail our technical solution to creating these bindings as well as discuss the technical limitations of the technologies involved and how we worked around them. We also discuss our autojit system and include a grammar definition of the subset of Python our autojit supports. In particular, we describe our implementation approach and the facilities Python provides to facilitate our approach. Finally, we talk about the technical challenges faced when developing our autojit system and our solutions to them.

3.1 JitBuilder’s Polyglot Ambitions

JitBuilder is a C++ based library, but one of its overarching goals is to be utilized as a language-agnostic native acceleration library. That is, JitBuilder does not try to bind developers to a particular programming language nor development environment. To facilitate this goal, developing bindings to JitBuilder should be a sustainable and relatively uncomplicated task. Towards this, the library was refactored into two separate sets of APIs, the client API and the implementation API. The implemen-

17 Figure 3.1: JitBuilder Client versus Implementation API. tation API contains the actual implementation of the library whereas the client API is the API callable from various other programming languages. The client API’s responsibility is to forward calls to the implementation API. An overview of this is illustrated in Figure 3.1. Additionally, a maintainable high-level description of the client API was produced and is currently specified in JSON format. A snippet of the API description is given in Listing 3.1.

In summary, creating bindings to JitBuilder would require one to write a generator script that consumes the client API description and produces language bindings for a target programming language. At the time of writing, the only set of JitBuilder bindings available is C++ based, unsurprising, given JitBuilder’s C++ roots. An early insight, however, was that much of JitBuilder’s API could be hidden behind a C interface. The importance of this arises when we think about creating language bindings to additional programming languages. Most, if not all, mainstream pro- gramming languages include some form of C FFI, so by exposing a C interface we expect to simplify the path to creating additional language bindings for JitBuilder.

18 { "project":"JitBuilder", "version":{"major": 0,"minor": 0,"patch": 0 }, "namespace":["OMR","JitBuilder"], "types": [], "fields": [], "services": [], "classes":[ ... { "name":"IlBuilder", "short-name":"IB", "types": [...], "fields": [...], "constructors": [], "callbacks": [...], "services":[ ... {"name":"IfAnd" ,"overloadsuffix":"WithArgArray" ,"flags": [] ,"return":"none" ,"parms": [ {"name":"allTrueBuilder","type":"IlBuilder", "attributes":["in_out"]}, {"name":"anyFalseBuilder","type":"IlBuilder", "attributes":["in_out"]}, {"name":"numTerms","type":"int32"}, {"name":"terms","type":"JBCondition", "attributes":["array","can_be_vararg"], "array-len":"numTerms"} ] }, ... ], ... }, ... ], ... } Listing 3.1: JitBuilder API Description Snippet.

19 3.1.1 Python Language Bindings

The current language binding generators are implemented as Python scripts. They consume the client API description, then generate bindings for a target program- ming language. Initial development work focused on creating a C client API. As previously mentioned, this would act as the foundation upon which new language bindings would be based. We map all implementation classes one-to-one to a client API class. Class fields are mapped one-to-one with an additional impl field on the client API side. This field is an opaque pointer pointing to the corresponding imple- mentation object. Finally, functions are also mapped one-to-one, where the client API’s sole purpose is to forward calls to the implementation API. We show code in Listing 3.2 to demonstrate how this is done. We first retrieve the impl field (pointer to corresponding implementation API object) and cast it appropriately before for- warding the call to the implementation API function. Note that there is some setup code before the call to retrieve the arguments’ implementation side objects to pass to the call, as well as after the call to set return values to their corresponding client side objects. void IlBuilder::IfAnd(IlBuilder ** allTrueBuilder, IlBuilder ** anyFalseBuilder, int32_t numTerms, IlBuilder::JBCondition ** terms ){ ... static_cast(_impl)->IfAnd(allTrueBuilderArg, anyFalseBuilderArg, numTerms, termsArg); ... } Listing 3.2: Demonstration of Forwarding Calls from Client API to Implementation API. (C++)

Things become more complicated when we need to map virtual functions from the implementation side to the client side since C includes no notion of overriding func- tions. These functions are overridable by the caller from the client API side so we cannot rely on simply forwarding calls to the implementation side to get the cor-

20 Figure 3.2: C API Virtual Function Emulation [5]. rect behaviour. Rather, the implementation API should invoke user overrides on the client side if present. To achieve this, we use function pointers in the client API to set callback functions for each virtual function on the implementation side. Virtual functions on the implementation side are also complemented with a non- virtual dispatch function. This dispatch function is only callable from within the implementation API and is responsible for forwarding calls to the client API set callback function. The default case of this is illustrated in Figure 3.2. Internally, the implementation API uses foo (dispatch function) that is responsible for forwarding the call to FooCallback (client API set callback function). In the default case, the client API callback calls the default function implementation on the implementation side, given one exists. In the case that there is an override on the client side, the FooCallBack function would be responsible for invoking that override. A more thor- ough discussion of JitBuilder’s approach to creating language bindings is given in Banderali [5].

After completing the C API generator, the next stage of development focused on

21 creating Python bindings to the exposed C interface. Python provides a few libraries to accomplish this task, namely, ctypes1, cffi 2 and Cython[7]. Additionally, at the lowest level, one could write a C extension module using CPython’s C API. We decided to use cffi as it provided the best performance/ease-of-use trade-off in our case. In the end, creating bindings atop the C interface did not produce the results we expected and the effort was subsequently discontinued. In particular, creating Pythonic bindings atop the C interface required more manual effort than anticipated since we had to wrap the lower-level C API and additionally create a Pythonic layer on top. In the case of Python, it was possible to create bindings atop the C++ interface. This turned out to require significantly less development effort to achieve a Pythonic API. To create the bindings, we used Pybind11 3, a C++ library. Pybind11 creates appropriate wrappers and exposes the API through a CPython C extension module. Additionally, Pybind11 requires code to be written in C++ and employs the use of macros, template metaprogramming and lambda functions to create a relatively simple API.

3.1.1.1 Mapping C & C++ Languages features to Python

Pybind11 was able to provide an automatic wrapping layer for the majority of the exposed JitBuilder API. Classes and their associated methods were mapped one-to- one and inheritance hierarchies were maintained. Problems only arose in cases where JitBuilder functions expected a pointer-to-a-pointer argument. In the majority of these cases in the JitBuilder C++ API, the intention is to potentially modify and return multiple arguments. To create a wrapper for functions of this style was relatively simple. Functions were made to accept pointer arguments, the addresses of these pointers are passed to the actual JitBuilder function and then the pointers

1https://docs.python.org/3/library/ctypes.html 2https://cffi.readthedocs.io/en/latest/ 3https://github.com/pybind/pybind11

22 are returned as a tuple after the JitBuilder function call. The other case is when JitBuilder expects a pointer-based array. To wrap functions of this type, we simply create a wrapper function that accepts a vector argument and lowers the data pointer and vector size to JitBuilder. We use a single function to demonstrate both of these cases. In Listing 3.3, we have the function prototype of the IfAnd function call in JitBuilder and in Listing 3.4, we have our wrapper function. The JitBuilder API description indicates whether a pointer-to-pointer argument is intended to be an in- out parameter or an array of pointers. Functions with variable arguments (varargs) can also be automatically mapped to Python functions that accept *args arguments. void IfAnd(IlBuilder ** allTrueBuilder, IlBuilder ** anyFalseBuilder, int32_t numTerms, IlBuilder::JBCondition ** terms); Listing 3.3: IfAnd Function Definition.

3.1.2 Autojit

Despite being a high-level API to the underlying compiler technology in OMR, as shown in Figure 2.2, JitBuilder’s API is still pretty low-level. This prompted us to experiment, albeit briefly, with the automatic compilation of Python methods. We call our resulting JIT compiler an autojit. Our autojit serves as a proof of concept for a simplified and a higher-level interface to JitBuilder. The goal was to support a subset of Python amenable to JIT compilation and sufficient enough for us to automatically JIT compile our numerical benchmarks using our autojit. Python provides advanced metaprogramming capabilities to help us to achieve this. In particular, we depend on the following features:

• Python’s AST: Python provides an AST module4 that grants us access to the internal representation of its AST. We use this module to inspect the AST of Python methods that we then compile to a typed subset. The grammar of our

4https://docs.python.org/3/library/ast.html

23 PYBIND11_MODULE(jitbuilder, m) { ... py::class_(m,"IlBuilder") ... .def("if_and", [](OMR::JitBuilder::IlBuilder *self, OMR::JitBuilder::IlBuilder *allTrueBuilder, OMR::JitBuilder::IlBuilder *anyFalseBuilder, int32_t numTerms, std::vector terms) { OMR::JitBuilder::IlBuilder **allTrueBuilder_wrapper = &allTrueBuilder;

OMR::JitBuilder::IlBuilder **anyFalseBuilder_wrapper = &anyFalseBuilder;

self->IfAnd(allTrueBuilder_wrapper, anyFalseBuilder_wrapper, numTerms, terms.data());

return py::make_tuple(allTrueBuilder, anyFalseBuilder); }, py::return_value_policy::reference ... ) ... } Listing 3.4: IfAnd FFI Wrapper Function Snippet (C++).

24 subset specified in Abstract Syntax Description Lanuguage (ASDL) format [45] can be found in Listing 3.5.

• Type Hints: Python’s typing module5 provides support for runtime type hints. Type hints allow us to annotate arbitrary functions and variables with type information. These were added due to the community’s interest in third-party static analyzers for Python and are completely ignored by the CPython run- time. We use these type hints as a means to simplify our implementation as our system performs no type inference.

• Decorators: Not to be confused with the design pattern of the same name, decorators in Python provide us with a simplified syntax for calling higher-order functions. Formally, a decorator is a function that takes another function and augments its behaviour without explicitly modifying it. We use a decorator, @jitbuilder func, to indicate that a method should be compiled. This decorator also performs all the necessary processing to compile a function as well as to handle dispatching function calls.

• FFI functionality: We depend on Python’s FFI capabilities to utilize JitBuilder from within Python. stmt = FunctionDef (identifier name, arguments args, stmt* body, expr ? returns) | Return(expr? Value) | Assign (expr target, expr value, string type) | For(expr target, expr begin, expr end, expr step, stmt* body) | While (expr test, stmt* body) | If(expr test, stmt* body, stmt* orelse) | Expr(expr value) expr = BoolOp(boolop op, expr*) | Prim(expr l, operator op, expr r) | IfExp(expr test, expr body, expr orelse)

5https://docs.python.org/3/library/typing.html

25 | Compare(expr l, cmpop* ops, expr comparators) | Call(expr fn, expr* args) | Int(constant value) | Float(constant value) | Index(expr value, expr value, expr_context ctx) | Var(id, type) expr_context = Load | Store boolop = And | Or operator = Add | Sub | Mult cmpop = Eq | NotEq | Lt | LtE | Gt | GtE arguments = (arg* args) arg = (identifier arg, expr? annotation) Listing 3.5: Abstract Grammar of our Typed Subset specified in ASDL [45] format. (N.B. identifier is a builtin type in ASDL.)

3.1.2.1 Workflow

Functions to be compiled are annotated with type hints and a @jitbuilder func decorator. These type hints are mandatory since our system does no form of type inference and simply retrieves these hints from the AST. The compilation process starts by retrieving the AST representation of the annotated source code. We decided to start compilation at the AST level simply because Python exposed type hints at the AST level but not at the bytecode level. We perform a first pass over this AST to compress and translate it to our aforementioned typed subset. A second pass is then performed on the typed subset to progressively build up OMR IL – the intermediate language of the compiler component of OMR. Building up this IL is facilitated by JitBuilder and is encapsulated within a MethodBuilder object. We deviate quite a bit from mapping Python types to JitBuilder types when it comes to dealing with arrays. Despite having access to type hints, we cannot use

26 Figure 3.3: Auto JIT Workflow.

27 Python lists due to their heterogeneous nature and the fact that we would have to copy the contents to create a strongly typed list, or, operate on the boxed objects within this list. We resort to using homogenous NumPy6 arrays. NumPy arrays provide access to their data pointers which we can lower to our compiled function. In the case of unidimensional arrays, lowering is relatively simple and requires us to appropriately cast the pointer and lower it as in Listing 3.6. We decided to cap support at 2D arrays due to not finding a holistic approach to supporting variable dimensional arrays without introducing the full NumPy C extension module as a dependency. We treat 2D arrays as pointer-to-pointer arguments in Jitbuilder since we cannot determine their shape (number of rows and columns) at compile-time. Note that, the NumPy arrays we use are C-contiguous (row-major ordered) and it is not straightforward to retrieve these as pointer-to-pointer arguments, nor would it be correct if one was supposed to do this. In particular, NumPy treats n-dimensional arrays as a single contiguous block of memory where each element is separated by a stride. This stride refers to the offset in bytes between elements in each dimension when traversing an array. It is non-constant and is dependent on how NumPy decides to layout array elements. Therefore, we instead treat the array’s underlying data pointer as an array of uintptr t arguments, where each value represents the address of the first element in a row in our 2D array. We use uintptr t as it is an integer type large enough to hold any data pointer. We now need to calculate and fill in these addresses using the technique shown in Listing 3.7; a breakdown of the steps with an example is given in Figure 3.4. First, we obtain the address of the first element using arr. array interface [“data”][0]. Next, we create a new array and use pointer arithmetic to calculate the value for each element. Recall that each element represents the address of the first element in a row within the NumPy array. Using numpy.arange(arr.shape[0]), we create the new array of size n, where n is equal to

6https://numpy.org/

28 Figure 3.4: Lowering 2D Arrays to JitBuilder. the number of rows in the NumPy array. We then use arr.strides[0] to determine the byte offset of the first element of each row. Finally, we can now lower this new array and treat it as a pointer-to-pointer argument in JitBuilder. data = arr.ctypes.data_as(ctypes.POINTER(ctype)) Listing 3.6: Lowering 1D Arrays to Compiled Function. data = (arr.__array_interface__["data"][0] + numpy.arange(arr.shape[0]) * arr.strides[0]).astype(numpy. uintp) Listing 3.7: Lowering 2D Arrays to Compiled Function.

After we complete the translation process, the MethodBuilder object is handed off for compilation. Afterwards, the compiled function is returned to Python as a void pointer. Consequently, before this function is usable from within Python we need to perform appropriate casting. For this final stage, we use ctypes to perform the casting. Finally, the function is now usable from within Python. This entire process is depicted in Figure 3.3.

29 Chapter 4

Regular Expression Engine

In this chapter, we provide a complete overview of our pure Python regex interpreter and our specializing regex-to-native-code compiler. We provide a more thorough dis- cussion of regular expressions and their relation to finite automata as well as how we can convert regex constructs to partial NFA states. We also describe how we reused Rust’s regex crate’s (parlance for a library in the Rust community) regex- to-bytecode compiler. Specifically, we detail the available bytecodes and our default regex-to-bytecode compilation options. Additionally, we provide a list of regex fea- tures that our engine supports. Finally, we describe the implementation of our pure Python regex interpreter and our specializing regex-to-native-code compiler as well as discuss and provide an example of how the engines operate.

4.1 Design & Implementation

Following the discussion of approaches to regular expression matching in Section 2.2, we have opted to base our regex matching engine implementation on finite automata- based techniques. In particular, we base our implementation on Rust’s regex crate’s PikeVM. The crate’s PikeVM engine is highly documented and is a modern, high- performance NFA-based implementation that is descended from Pike [34].

30 Development was divided into 3 phases:

1. Implemented bindings to Rust’s regex crate’s regex-to-bytecode compiler.

2. Implemented an interpreter based on NFA.

3. Used JitBuilder to compile Regular Expression to native code.

4.1.1 Bindings to Rust’s RegEx bytecode compiler

To focus on the thesis goal of performance evaluation, we have opted to reuse the regex bytecode compiler from Rust’s regex crate. We created a single wrapper func- tion called compile regex which takes a string representation of a regular expression and compiles it to a regex program. A regex program, in this case, refers to a list of bytecode instructions that represents an NFA and encodes whether the regex is anchored at either end. The regex features that we support are consistent with the supported features from Rust’s regex crate given we reuse their regex bytecode compiler. We give a summary of these features in Table 4.1.

Anchors

ˆ Beginning of input or beginning of line in multi-line matching mode

\A Beginning of input

$ End of input or end of line in multi-line matching mode

\z End of input

\b Word boundary

\B Not a word boundary

Quantifiers

* 0 or more

+ 1 or more

? 0 or 1

31 {n} Exactly n

{n,} n or more

{n,m} at least n but no more than m

? Append an ? to make quantifier non-greedy (lazy) e.g. *?

Perl Character Classes

\s Whitespace

\S Not whitespace

\d Digit

\D Not Digit

\w Word character

\W Not word character

Composites a|b Alternation (a or b) ab Concatenation, a followed by b

ASCII Character Classes

[:upper:] Upper case letters

[:lower:] Lower case letters

[:alpha:] All letters

[:digit:] Digits

[:alnum:] Digits and letters

[:xdigit:] Hexadecimal digits

[:punct:] Punctuation

[:space:] Whitespace

[:cntrl:] Control characters

[:graph:] All visible characters (excludes spaces and control characters)

[:print:] All visible characters and spaces (excludes control characters)

32 [:word:] Word characters (letters, numbers and underscores)

Escape characters

\ Escape following character i.e. treat it as a literal e.g. \*

\t Horizontal tab

\v Vertical tab

\n New line

\r Carriage return

\a Bell

\f Form feed

\123 Octal character code

\x7F or \u007F or Hex character code 2, 4 or 8 digits \U0000007F

\x{10FFFF} Any hex character code corresponding to a Unicode code point

\u{7F} Any hex character code corresponding to a Unicode code point

\U{7F} Any hex character code corresponding to a Unicode code point

Groups and Ranges

. Any character except new line (\n)

(...) Capture group

(?:...) Non-capturing group

(?flags) Set flags within current group

(?flags:...) Set flags for ... non-capturing group

[abc] Union (a or b or c)

[ˆabc] Not (a or b or c)

[a-z] Lower case letter from a to z

[A-Z] Upper case letter from A to Z

33 [0-9] Digit from 0 to 9

Flags

i Case-insensitive match

m Multi-line mode: ˆand $ match begin/end of line

s Allow . to match \n

U Swap the meaning of greedy and lazy quantifiers e.g. * & *?

u Unicode support (enabled by default)

x Ignore whitespace and allow line comments (starting with ‘#‘)

Table 4.1: Summary of Supported Regex Features.

A full list of supported features is provided in the Rust regex crate’s documenta- tion [39]. It is also important to note that in our case, all regular expressions are utf-8 byte-based and are compiled with the default options from Rust’s regex crate. Namely, regular expressions are case-sensitive, single-line – i.e., ˆ and $ refer to the start and end of the input respectively, Unicode-aware, and dots (.) do not match newlines. The following bytecodes are possible:

• Bytes (start, end): indicates a single byte range and is used to match the character at the current location in the input. If the character matches, then we advance the string pointer (SP) and the instruction pointer (IP); otherwise, we stop this thread.

• Match: indicates that the program has reached a match state and we stop the thread.

• Split (goto, goto alternate): indicates that the program should diverge and continue execution at both goto and goto alternate.

• Save (i): indicates that the program should record the current location of the

34 Figure 4.1: Partial NFA for matching a single character: c.

Figure 4.2: Partial NFA that describes the alternation: n1|n2.

input in the ith slot indicated by the save instruction. It is used to track a sub-match (capture) location.

• EmptyLook: represents a zero-width assertion in the program, i.e., it does not consume any input (e.g. ˆ and $).

4.1.2 RegEx Interpreter

4.1.2.1 Regular Expressions & Finite Automata

As previously discussed in section 2.2, regular expressions and finite automata are of equal expressive power. That is, for each regular expression RE, there exists a finite automaton FA that accepts the set of strings generated by RE and for each finite automaton FA, there exists a regular expression RE that generates the set of strings accepted by FA. Below (Figure 4.1 – Figure 4.6) we illustrate converting regex constructs to partial NFA states as in Thompson [43]. Circles represent a single matching construct that recognizes the enclosed character, rectangles represent partial NFAs and dangling arrows represent input and output paths:

35 Figure 4.3: Partial NFA that describes matching n once or zero times: n?.

Figure 4.4: Partial NFA that describes matching n zero or more times: n∗.

Figure 4.5: Partial NFA that describes matching n one or more times: n+.

Figure 4.6: Partial NFA that describes the concatenation: n1n2.

36 Figure 4.7: (h|H)ello NFA Representation.

Tying it all together, dangling output paths are connected to a matching state to represent a complete NFA, as shown in Figure 4.7, where we match the strings “Hello” or “hello”. We include ε transitions in this example. They indicate state transitions that do not consume input and are merely a convenience to simplify the regex to NFA conversion process.

4.1.2.2 Matching Engine

Sharing a common frontend, the matching engine in this thesis closely follows the implementation of the PikeVM1 in Rust’s regex crate. The PikeVM is an NFA im- plementation inspired by the NFA implementation2 in the RE2 regex library. RE2’s NFA implementation is descended from Pike [34] which is, in turn, descended from Thompson’s implementation [43]. The engine operates by sequentially executing a list of opcodes against each possible byte in the input text. While executing the current list of opcodes on a byte, a list of the next opcodes to be executed is populated. After executing the current list, it is swapped with the next list, the input is advanced and the process continues until we ascertain whether or not there was a match. Cox [18] likened this process to executing threads in a VM. Each thread can be thought of as an encapsulation of a regex program and is responsible for maintaining state information to track sub-match boundaries. Here, a regex program simply refers to a list of instructions to execute. The VM is responsible for maintaining the position of the byte currently

1https://github.com/rust-lang/regex/blob/master/src/pikevm.rs 2https://github.com/google/re2/blob/master/re2/nfa.cc

37 being operated against in the input text as well as two lists to track the current (clist) and next (nlist) thread to be executed by the VM. The algorithmic runtime becomes more apparent from the aforementioned process. We can have at most n instructions on a list where n represents the total number of instructions. This bounds the time spent processing a byte to O(n), therefore, the time for processing the entire string is O(nm) where m is the byte length of the string.

The process of matching usually commences on the first byte in the input, but the starting byte position can be manipulated by higher-level APIs in the case of finding all matches in a given input. The VM contains two important operations in addition to the main execution loop responsible for advancing the current byte, namely add and step. The add operation is responsible for following ε transitions starting at the indicated IP and populating nlist with the next set of reachable states from IP whereas the step operation is responsible for advancing the current byte and using clist to compute nlist. Describing the process further, the VM starts by preallocating clist and nlist that are swapped after each step to avoid allocating new lists on each iteration of the loop. Next, beginning at the indicated starting byte, we begin looping over the text with an initial call to add to populate the clist with the first set of instructions to be executed. This process can also occur later in the execution, i.e., using the add operation to populate clist, particularly when the clist is empty and the regex is unanchored, but there has not been a match. In this case, the VM indicates that the regex program should be restarted at the current byte in the input and can be thought of as prepending a “.*?” at the start of the regex. Starting at the next byte, we then begin stepping the regex program indicated by clist using it to compute our nlist. The instructions in clist are run in last in, first out (LIFO) order and we start execution at the next byte because the previous call to add inspects the position before the current byte. After computing nlist, the current byte is advanced, the nlist is swapped with the clist and the process

38 Thread IP SP Operation T1 0: Save (0) Hello World record capture location at slot 0 T1 1: Save (2) Hello World record capture location at slot 2 T1 2: Split (3, 4) Hello World add states 3 & 4 to current Thread: T1 T1 3: Bytes (h, h) Hello World no match T1 4: Bytes(H, H) Hello World byte matches; create T2 with next set of states T2 5: Save (3) Hello World record capture location at slot 3 T2 6: Bytes(e, e) Hello World byte matches; create T3 with next set of states T3 7: Bytes(l, l) Hello World byte matches; create T4 with next set of states T4 8: Bytes(l, l) Hello World byte matches; create T5 with next set of states T5 9: Bytes(o, o) Hello World byte matches; create T6 with next set of states T6 10: Save (1) Hello World record capture location at slot 1 T6 11: Match (0) Hello World found a match

Table 4.2: Regex Engine Matching (h|H)ello against “Hello World”. See Section 4.1.1 for a discussion of the bytecodes used in the table above. continues until we ascertain whether or not there is a match. This entire process is depicted in Figure 4.8 and we simulate our VM testing the regex (h|H)ello against the string “Hello World” in Table 4.2. Our VM has two matching modes, namely finding the first match and finding all matches in a string. To find all matches, we simply execute our engine to find the first match and modify the position of the starting byte in the input string to find subsequent matches.

4.1.3 Specializing Regex Compiler

Our specializing regex compiler generates native code from a regular expression using JitBuilder. This specializing comes from the fact that each regular expression is compiled to native code that only includes the bytecodes specified in the regex, i.e., we compile and execute the state machine resulting from compiling the regex to bytecode. In particular, the step and add operations are specialized and now operate on a program counter (PC) instead of switching between all opcodes as in the interpreter implementation. Similar to our interpreter implementation, we define four methods that comprise the entire execution of our VM. At a very high level, we

39 Figure 4.8: Regex Matching Engine Flowchart.

40 use JitBuilder API calls to build up OMR IL that is then compiled. Digging further into our implementation, we depend on several external utilities implemented as a C library that we need to make available to JitBuilder and Python, namely, the following structures and their associated functionality:

• SparseSet: Tracks the current list of OpCodes for a Thread.

• Matches: Tracks the number of matches found and their capture slots.

• Threads: Tracks the list of threads and their associated capture locations.

First, we need to make both JitBuilder and Python aware of these types and their layout as the machine-level representation of these types must be consistent. To do this, Python’s ctypes library provides APIs for defining C-style structs and retrieving the offsets of their associated fields, whereas JitBuilder provides a TypeDictionary object to store type information and APIs to define new structure types as well as specify field offsets. As for the functionality associated with these structures, we used cffi to create Python bindings to our C utility library. The bindings allow us to have access to these functions within our Python environment. The necessity of this becomes apparent when we use JitBuilder to compile our code. Since our compiled code depends on functionality from our C utility library, we need to make them available in our Python environment and the aforementioned bindings do just that. JitBuilder provides APIs to call functions compiled outside of its runtime. For this, we need the function’s signature and entry point (function address). This proved to be not so straightforward for two reasons:

1. Pybind11 not supporting passing arbitrary void * arguments.

2. Incompatible FFI layers between cffi and pybind11.

Recall that we used Pybind11 to create our Python bindings to JitBuilder, whereas we used cffi to create Python bindings to our C utility library. Both of these FFI

41 libraries perform type checking when calling out to bound functions. Problem 1 arises when we need to use JitBuilder to call out to an externally compiled function. The API call for this expects the entry point to the function to be passed as a void * argument. The problem is that Pybind11 provides a type-safe API and does not support passing arbitrary void * arguments. We can pass a void * argument as a PyCapsule, but this requires us to dig into the Python C API both on the Python side to create the PyCapsule and in the wrapper function used for the binding on the C++ side to unwrap the PyCapsule. The solution we chose was to create a wrapper around the JitBuilder function that accepts a uintptr t argument and casts it to a void * on the C++ side before lowering it to the actual JitBuilder API call. This workaround exploits the fact that pybind11’s type checker will accept a Python integer for a C++ uintptr t argument. Recall that a uintptr t is an unsigned integer type capable of storing any valid data pointer. Problem 2 arises when we try to obtain the address of the bound functions from our utility library using cffi. For the most part, we can do this but cffi casts result into a cffi cdata wrapper object that is not the same as a uintptr t that our function expects. The remedy for this is much simpler as cffi provides suitable override methods to cast a uintptr t based cdata wrapper object to a Python int that the function will accept. Now that JitBuilder and Python are aware of all the external functionality we need, we can proceed with defining our VM functions. Other than being compiled, the VM operates the same except the add and step methods are now specialized to a regex. The information for each of our functions is encapsulated within a MethodBuilder object. We define the function name, parameters and return type in the MethodBuilder constructor and this can be thought of a declaration. The actual functionality is implemented in a virtual buildIL method and can be thought of as the function definition. For our add and step methods, we loop over our list of bytecodes and build up OMR IL containing code specialized to our regex.

42 Our find and find all functionality matches the approach we use in our interpreter implementation, i.e., to find all matches, we simply execute our engine to find the first match and modify the position of the starting byte in the input string to find subsequent matches.

43 Chapter 5

Evaluation

In this chapter, we evaluate and report on the performance results of our autojit (Section 3.1.2) and regex engine (Chapter 4). First, we describe our experimental setup and how we verify the correctness of our regex engine. Finally, we provide a discussion of the test cases and results for both our autojit and regex engine.

5.1 Experimental Setup

Below, we detail the specs of our benchmarking machine:

• OS: Ubuntu 19.04 (Disco Dingo)

• CPU: 6x Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz

• RAM: 32 GB DDR4

Numba compiles code using the LLVM JIT at opt-level 3 (O3). We used Python 3.8 with Numba 0.48.0 and cffi 1.13.2. Additionally, PyPy 7.3.1-alpha0 (Python 3.6.9), Rust 1.41.0 nightly, Rust’s regex crate 1.3.4 and JitBuilder 0.1.0 were also used. All of the following results exclude compilation times and are plotted on a log scale excluding Figure 5.6.

44 We use pyperf 1 for our analysis. Pyperf is a statistical benchmarking toolkit in Python that can automatically calibrate benchmarks to produce stable results. No changes were made to pyperf’s default configuration for the results reported below.

All test cases that take over 15 minutes to execute on our Matrix Multiplication benchmark are deemed as a timeout and are discarded from our autojit results. These timeouts only occur in the case of PyPy and CPython. Due to the statistical nature of the benchmark process, time complexity of the function (O(n3)), input sizes and the performance characteristics of the aforementioned runtimes, it would take an unreasonably long amount of time to run these benchmark cases to completion.

5.2 Testing our Implementations

To test our implementations for finding the first match in an input string, we executed our engine on a modified version of the test suite included in RE22. Most notably, we excluded test cases that used regex features that the used parser was not able to handle. For example, we exclude \C, which is used to match a single byte. To test our implementations for finding the total number of matches against a given string, we executed our engine on the sherlock based test cases3 from Rust’s regex crate. We first execute these test cases in Rust, creating a new test script that contains the total number of matches found. These numbers are taken as the ground truth since this is the implementation upon which ours is based. Our implementations passed all test cases. 1https://github.com/vstinner/pyperf 2https://github.com/google/re2/blob/master/re2/testing/search test.cc 3https://github.com/rust-lang/regex/blob/master/bench/src/sherlock.rs

45 Figure 5.1: Iterative Fibonacci Benchmark.

5.3 Discussion

In this section, we describe our experiments accelerating numerical code written in Python.

5.3.1 AutoJIT

Recall from Section 3.1.2 that we briefly experimented with automatically JIT com- piling a subset of Python code. To evaluate this, the following five Python functions were written: 1) Iterative Fibonacci, 2) Recursive Fibonacci, 3) Dot Product, 4) Matrix Multiplication and 5) Mandelbrot. We report on the results of our autojit approach compared to other state-of-the-art approaches in the Python ecosystem below, namely Numba and PyPy.

In our Iterative Fibonacci benchmark (Figure 5.1, Table 5.1), JitBuilder was able to generate native code that outperformed pure Python and performed 2x-7x worse than the best reported times. In this case, the best results were achieved by PyPy for smaller input sizes and Numba for larger input sizes. In the case of our Re- cursive Fibonacci benchmark (Figure 5.2, Table 5.2), Numba slightly outperformed

46 Input Size PyPy Numba Python+JitBuilder 10 9.1x 4.0x 1.3x 30 11.2x 8.9x 3.2x 50 12.9x 14.7x 5.2x

Table 5.1: Relative Speedups versus CPython on our Iterative Fibonacci Benchmark.

Figure 5.2: Recursive Fibonacci Benchmark.

Python+JitBuilder for the smallest input size but Python+JitBuilder performed the best for the remaining cases, outperforming the second-best result significantly on the largest input size. In our Dot Product benchmark (Figure 5.3, Table 5.3), Numba performs best on all cases while Python+JitBuilder performs second best. As the size of the vectors grows, the performance gap between the two lessens. Note that our x-axis values are non-linear in this Figure. In our Mandelbrot benchmark (Fig- ure 5.4, Table 5.4), Numba performs best on all the cases, significantly outperforming

Input Size PyPy Numba Python+JitBuilder 10 36.6x 26.2x 21.7x 30 11.5x 32.4x 49.2x 50 21.9x 29.2x 44.3x

Table 5.2: Relative Speedups versus CPython on our Recursive Fibonacci Benchmark.

47 Figure 5.3: Dot Product Benchmark.

Input Size PyPy Numba Python+JitBuilder 10 0.2x 13.7x 5.3x 100 0.2x 116.9x 47.5x 1000 0.2x 47.5x 327.3x 1000000 0.2x 647.2x 577.4x 100000000 2.0x 4450.5x 4366.6x

Table 5.3: Relative Speedups versus CPython on our Dot Product Benchmark.

48 Figure 5.4: Mandelbrot Benchmark.

Input Size PyPy Numba Python+JitBuilder 10 0.2x 872.51x 628.41x 100 0.2x 1143.30x 747.45x 1000 0.2x 1217.39x 821.67.3x

Table 5.4: Relative Speedups versus CPython on our Mandelbrot Benchmark. the others. Python+JitBuilder performs second best, Python third best, while PyPy performs worst on all the test cases. Finally, we introduce an additional point of com- parison for our Matrix Multiplication benchmark (Figure 5.5, Table 5.5), because in its current state, JitBuilder is not able to perform automatic loop vectorization. It does, however, provide API methods to manually define loops that should be vec- torized whereas Numba is able to perform automatic loop vectorization. Results demonstrate that our Python+JitBuilder implementation with vectorized loops per- formed best for all input sizes while Python+JitBuilder without vectorized loops performed second best in all cases except for the smallest input size where Numba performed slightly better. At a first glance, it would appear somewhat surprising that pure Python outper- forms PyPy on our Matrix Multiplication, Dot Product and Mandelbrot benchmarks despite the latter including a JIT compiler. The reason for this is that these bench-

49 Input PyPy Numba Python Python+JitBuilder Size +JitBuilder (Vectorized) (NxN) 10 0.2x 923.3x 37.9x 923.3x 100 0.2x 560.5x 864.8x 1729.5x

Table 5.5: Relative Speedups versus CPython on our Matrix Multiplication Benchmark.

Figure 5.5: Matrix Multiplication Benchmark.

50 Figure 5.6: Compilation Times: Python+JitBuilder versus Numba. marks use NumPy arrays. NumPy provides homogenous arrays in Python and is implemented as a CPython C extension module. This reliance on CPython’s C API makes NumPy not portable across Python implementations. PyPy, however, includes cpyext, a subsystem which provides a compatibility layer to compile and run CPython C extension modules within its runtime. The problem is that cpyext is infamously slow and can perform orders of magnitude worse than CPython [20]. Another notable trend is that our Python+JitBuilder implementation starts off not performing the best but eventually closes the gap or surpasses the others on our Matrix Multiplication, Dot Product and Recursive Fibonacci benchmarks. We sus- pect this is due to the dispatch overhead in our implementation that is amortised as the workload gets larger. However, we have yet to identify why this might be. In particular, we use ctypes to cast our compiled function and also to lower array types to our compiled function. Ctypes has been known to be one of the slower FFIs in the Python ecosystem due to its dependence on libffi. Numba uses highly opti- mized C code and the CPython C API directly for its dispatching mechanism. Both CPython and PyPy bypass this additional overhead since they are not executing any code external to their respective runtimes.

51 In Figure 5.6, we show the compile times for both our Python+Jit-Builder approach and Numba. Our compilation times include optimizing and compiling OMR IL to native code whereas the times reported for Numba are more intricate. This is due to our system building up OMR IL directly in Python code whereas Numba compiles Python bytecode to Numba IR then to LLVM IR that is then compiled to native code. The reported times for Numba includes the time it takes for Numba to translate Python bytecode to Numba IR, optimize their IR, translate it to LLVM IR and have it handed off to the LLVM JIT for compilation. Our compilation pipeline is much simpler and results indicate that our compilation times are consistently lower than those of Numba.

5.3.2 RegEx Engine

In this section, we report on and discuss the results of the evaluation of our regex engine. For this round of evaluations, we measure the time it takes to find all matches within our input dataset. Our sole dataset is a 5MB text file that is a concatenation of several Gutenberg Project ebooks4. For this set of evaluations, we compare our system to PyPy and Rust’s regex crate. Numba’s automatic JIT could not handle the complexity of our implementation and as a result have been excluded from this set of results. Numba also has a numerical computation focus and would perhaps not be used to accelerate a workload of this kind in Python. Rust’s regex crate includes several matching engines: an NFA, DFA, one-pass NFA and backtracking based matching engines. For this reason, we include several Rust measurements in our results. We explicitly execute its NFA-based matching engine for one set of measurements, then, execute its normal mode of operation. In the latter case, some internal analysis is done on the regex and the best engine is chosen automatically. We then repeat the above process with the crate’s performance optimizations turned

4https://www.gutenberg.org

52 off. We expand on these aforementioned performance optimizations below during our analysis. For brevity, we will refer to Rust’s regex crate as rrc beyond this point. Our first benchmark (Figure 5.7, Table 5.6), focuses on matching literal words. It comes as no surprise that our pure Python approach performs the worst. The fact that it is written in pure Python puts it at a significant performance disadvantage, especially considering that all the other engines are executing mostly native code; entirely native code in the case of Rust. This trend continues throughout the remaining benchmarks and will not be discussed further. We simply include the results to demonstrate the speedups achievable through compilation with JitBuilder. Another notable trend is that PyPy performs the second worse on all the benchmark cases and will not be discussed further. Profiling with vmprof 5 indicates that PyPy spends between 94%-99% of its total execution time in jitted code. Despite this, PyPy was not able to perform comparably to our Python+JitBuilder approach nor Rust. On all three test cases, our Python+JitBuilder implementation performs comparably (typically slightly worse) to both rrc’s normal mode of execution and its NFA engine when performance optimizations are off. However, we see a dramatic improvement in rrc’s performance once we enable these performance optimizations. To match Gutenberg, both case sensitively and insensitively, rrc is able to completely bypass executing any of its matching engines and instead compile both of these down to a Boyer-Moore string search [13] that uses memchr to quickly search through the input text. For the final case which includes an alternation, rrc compiles this down to an Aho-Corasick automaton [2] that also uses memchr to quickly search through text. For our second set of benchmarks (Figure 5.8, Table 5.7), we measure the time it takes to match everything with and without newlines and the time it takes to report no matches in the input string. To report no matches, our engine performed the best of the lot whereas the other engines performed comparably with rrc performing slightly

5https://github.com/vmprof/vmprof-python

53 Figure 5.7: Regex Benchmark Set #1.

Method Gutenberg (?i)Gutenberg Gutenberg|Good PyPy 15.8x 14.6x 15.2x Python+JitBuilder 176.0x 162.5x 165.9x Rust NFA 230.5x 303.2x 259.1x Rust 187.5x 303.3x 259.7x Rust Perf NFA 141201.0x 111437.9x 172474.2x Rust Perf 178311.2x 100051.5x 248694.8x

Table 5.6: Relative Speedups versus our Pure Python Regex Interpreter on Regex Benchmark Set #1.

54 Figure 5.8: Regex Benchmark Set #2. better with performance optimizations turned off. Despite the regex essentially just being literal text the presence of Unicode word boundaries renders many of rrc’s optimizations useless, including its DFA engine. In this case, the execution will fall back to using the NFA engine. The other two regexes match everything in the input string including and excluding newlines. Our engine performs about 6-7x worse than rrc with its performance optimizations turned off. With these optimizations turned on the performance of rrc’s NFA engine essentially remains the same whereas performance is greatly improved in rrc’s normal mode of execution. If it is not readily apparent, both of these regexes render rrc’s literal optimizations useless. The performance improvement we are seeing is a result of the DFA engine. The main performance benefit of a DFA engine is that it is in exactly a single state at any point in time versus the NFA engine that can be in multiple states at any given time. The next set of regexes (Figure 5.9, Table 5.8), are modified versions of some of the ones in rrc’s benchmarking suite. They allow us to measure more varied performance and evaluate rrc’s additional performance optimizations. The regex [a- zA-Z]+ly searches for words ending in “ly”. Here, our engine performs up to 2x worse than rrc’s NFA engine with performance optimizations disabled. Both rrc’s normal

55 Method .* (?s).* \bzero matches\b PyPy 13.4x 13.5x 16.3xx Python+JitBuilder 135.8x 146.9x 172.8x Rust NFA 1096.5x 1033.9x 141.7x Rust 1078.9x 1034.3x 140.8x Rust Perf NFA 1006.4x 947.4x 135.8x Rust Perf 12406.2x 22568.3x 135.7x

Table 5.7: Relative Speedups versus our Pure Python Regex Interpreter on Regex Benchmark Set #2.

Method [a-zA-Z]+ly pre[a-zA-Z]+ (?m)ˆhi|bye$ PyPy 13.2x 14.9x 17.2x Python+JitBuilder 149.3x 165.8x 167.8x Rust NFA 296.7x 221.6x 240.6x Rust 298.2x 221.3x 204.4x Rust Perf NFA 289.5x 10767.9x 220.8x Rust Perf 7708.1x 9157.6x 6239.4x

Table 5.8: Relative Speedups versus our Pure Python Regex Interpreter on Regex Benchmark Set #3. mode of execution and its NFA engine with performance optimizations enabled do not see any significant improvement. We do however see significant improvement in the case of rrc’s normal mode of execution with performance optimizations enabled. This regex defeats any possible prefix optimizations, but it does trigger rrc’s suffix optimizations. Results are similar for our next case where we search for words beginning with the prefix “pre”. Here, our system performs about 1.3x worse. This regex defeats any suffix optimizations but triggers rrc’s prefix optimizations. For our final case, we search for the words “hi” occurring at the beginning of a line or “bye” occurring at the end of a line. This regex defeats any form of prefix or suffix optimizations. Here, performance is comparable throughout for rrc’s different modes of operation except for the speedups in the case of its normal mode of execution with performance optimizations enabled. These speedups are due to the use of the DFA-based engine. Our system performs about 1.4x slower than executing rrc’s NFA engine directly.

56 Figure 5.9: Regex Benchmark Set #3.

57 Chapter 6

Conclusions & Future Work

6.1 Conclusions

This thesis sets out to evaluate the case of high-performance Python using OMR Jit- Builder to accelerate compute-intensive workloads and also to evaluate the case of compiled-to-native-code regular expressions. To this end, we have implemented and evaluated several compute-intensive numerical based benchmarks as well as imple- mented and evaluated a specializing regular expression to native code compiler. Our results on the numerical benchmarks demonstrate that Python+JitBuilder achieves performance that rivals and in some cases outperforms state-of-the-art approaches in the current ecosystem. We also note the following advantages of our system:

1. Within the Python ecosystem, our approach is closest to that of Numba. The main benefit of our system, however, is the generality and ability to accelerate any type of workload whereas Numba is focused on numerical computation.

2. Our system provides a relatively lower barrier to entry compared to current approaches to accelerating Python code:

installable package.

58 • new API to learn versus additional build systems and/or programming language.

3. Our system complements and extends the CPython runtime versus modifying it. This enables developers to make use of the rich arsenal of libraries already available that are dependent on the CPython C API.

4. Given JitBuilder’s ability to call out to native code compiled outside of its run- time, our system can be added to projects incrementally and will not invalidate previous efforts in the case of high-performance libraries implemented outside CPython. This includes C extension modules and libraries made to interface with Python through some form of FFI. That is, they can be made to work together instead of “starting over” with JitBuilder.

We explored implementing Numba-esque auto JIT functionality using a Python dec- orator to indicate which functions should be compiled. We managed to support an extensive enough subset to automatically JIT compile the majority of our numerical benchmark cases. To simplify development efforts, we did our translation at the Python AST level and our findings are consistent with that of Rubinsteyn et al. [38]. Function compilation at the AST level is not a very scalable approach to supporting a large enough subset of Python for automatic JIT compilation.

All things considered, our specializing regex engine performs comparably to that of the NFA engine in Rust’s regex crate, and outperforms it in some cases, while significantly outperforming our pure Python implementation in all cases. From our experiments, whether it is worth it to generate specialized native code from regular expressions for improved performance is inconclusive based on several reasons:

• Internally, our system depends on the ctypes package to convert Python types to machine types before lowering them to our compiled regex matcher. Ctypes

59 has been known to be one of the slower FFI libraries available in the Python ecosystem due to its dependence on libffi. We suspect that there is some, potentially non-minuscule, overhead associated with this but we have been unable to quantify this overhead using current tools.

• As previously discussed in Section 2.3, JitBuilder currently only supports an optimization strategy based on the warm optimization level defined in OMR’s compiler component. Furthermore, JitBuilder’s optimizer only includes some of the optimizations defined at that level and does not iterate as frequently to prioritize optimization times. There are perhaps more performance benefits achievable through using a strategy based on one of the higher optimization levels defined in OMR’s compiler component. This would improve our re- sults on all benchmarks. Both Numba and Rust use LLVM as their backend and compile code at the highest optimization level (3) for release builds. It is therefore somewhat of an unfair judgement to draw conclusions given this information.

Results demonstrate however that significant performance gains for regular expres- sions can be achieved through several different types of algorithmic optimizations such as literal, prefix and suffix based optimizations. Additionally, low overhead analysis and switching between different matching engines are also beneficial. Rust’s regex crate’s DFA-based versus NFA-based matching engines present the classical space versus time tradeoff, while its bounded backtracking engine offers superior performance on smaller regular expressions and inputs. Finally, another compelling case is to stay out of a matching engine for as long as possible or to bypass it entirely. In the case of Rust’s regex crate, this occurs in cases where the regular expression is a string literal. Regular expressions of this style get compiled down to a Boyer-Moore string search that offers superb performance as indicated by benchmarking results.

60 6.2 Future Work

As indicated above, further research is necessary to ascertain the performance ben- efits, or, potentially lack thereof of specialized compiled-to-native-code regular ex- pressions. We suggest evaluation within an environment with less variance i.e., a more consistent environment, such as using the same compiler for all the test cases. This could be achieved in our current environment by compiling our interpreter or by rewriting our interpreter using PyPy’s RPython toolchain. The RPython toolchain allows us to write an interpreter in RPython (a restricted subset of python) then compile it to native code. This would result in a general case high-performance in- terpreter as in the case in Rust’s regex crate and remove the specializing aspect of our system.

In the case of our autojit implementation, we suggest future research into support- ing a greater subset of Python. We suggest starting JIT compilation from Python’s bytecode versus its AST and adding some form of type inference algorithm to the system. One implication of making this switch is that we lose access to Python’s type hints which can only be accessed from its AST. With that in mind, we still recom- mend going this route as we found type hints to be somewhat cumbersome in their current form. For example, type hints on variables within a function are effectively treated as comments and do not show up in any of the function’s attributes. This is in contrast to type hints specified on the function’s argument(s) and return type. Also, adding type inference to the system would allow us to relax the constraints introduced on Python’s dynamic type system through the use of type hints. Instead, type inference should be performed on function invocation, then functions should be specialized based on the types of parameters they are called with and cached for reuse.

Another area of improvement would be to look into how one could delegate execution

61 back to the Python interpreter in the case of unsupported features.

Finally, we recommend research into how to better support Python’s already bur- geoning arsenal of high-performance libraries. For our evaluations, we have managed to work with NumPy arrays by directly accessing their underlying data pointer. This, however, only gives us access to the underlying data in the array but not access to the functionality defined on NumPy arrays. We recommend research be carried out into a holistic framework to enable JIT interaction with the functionality defined within these libraries.

62 Bibliography

[1] Keith Adams, Jason Evans, Bertrand Maher, Guilherme Ottoni, Andrew Paroski, Brett Simmers, Edwin Smith, and Owen Yamauchi, The Hiphop Virtual Machine, Proceedings of the 2014 ACM International Conference on Object Ori- ented Programming Systems Languages & Applications (New York, NY, USA), OOPSLA ’14, ACM, 2014, pp. 777–790.

[2] Alfred V. Aho and Margaret J. Corasick, Efficient string matching: An aid to bibliographic search, Commun. ACM 18 (1975), no. 6, 333–340.

[3] Jo¨elAkeret, Lukas Gamper, Adam Amara, and Alexandre Refregier, HOPE: A Python Just-in-Time Compiler for Astrophysical Computations, Astronomy and Computing 10 (2015), 1–8.

[4] John Aycock, A Brief History of Just-in-time, ACM Comput. Surv. 35 (2003), no. 2, 97–113.

[5] Leonardo Banderali, Taking Eclipse OMR JitBuilder to a Language Near You, 2nd Workshop on Advances in Open Runtime Technology for Cloud Environ- ments (Co-located with CASCON 2018), Nov 2018.

[6] Gerg¨oBarany, pylibjit: A JIT Compiler Library for Python., Software Engi- neering (Workshops), 2014, pp. 213–224.

63 [7] Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn, and Kurt Smith, Cython: The Best of Both Worlds, Computing in Science and Engg. 13 (2011), no. 2, 31–39.

[8] Federico Biancuzzi and Shane Warden, Masterminds of Programming: Conver- sations with the Creators of Major Programming Languages, ” O’Reilly Media, Inc.”, 2009.

[9] Carl Friedrich Bolz, Antonio Cuni, Maciej Fija lkowski, Michael Leuschel, Samuele Pedroni, and Armin Rigo, Runtime Feedback in a Meta-tracing JIT for Efficient Dynamic Languages, Proceedings of the 6th Workshop on Implementa- tion, Compilation, Optimization of Object-Oriented Languages, Programs and Systems (New York, NY, USA), ICOOOLPS ’11, ACM, 2011, pp. 9:1–9:8.

[10] Carl Friedrich Bolz, Antonio Cuni, Maciej Fijalkowski, and Armin Rigo, Tracing the Meta-level: PyPy’s Tracing JIT Compiler, Proceedings of the 4th Work- shop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems (New York, NY, USA), ICOOOLPS ’09, ACM, 2009, pp. 18–25.

[11] Carl Friedrich Bolz and Armin Rigo, How to not write Virtual Machines for Dynamic Languages, 3rd Workshop on Dynamic Languages and Applications, 2007.

[12] Carl Friedrich Bolz and Laurence Tratt, The Impact of Meta-tracing on VM Design and Implementation, Sci. Comput. Program. 98 (2015), no. P3, 408– 421.

[13] Robert S. Boyer and J. Strother Moore, A fast string searching algorithm, Com- mun. ACM 20 (1977), no. 10, 762–772.

64 [14] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, and Skye Wanderman-Milne, JAX: Composable

Transformations of Python+NumPy Programs, http://github.com/google/ jax, 2018, Accessed: 20-12-2019.

[15] Brett Cannon and Dino Viehland, PEP 523 – Adding a Frame Evaluation API

to CPython, https://www.python.org/dev/peps/pep-0523, May 2016, Ac- cessed: 20-12-2019.

[16] Jose Castanos, David Edelsohn, Kazuaki Ishizaki, Priya Nagpurkar, Toshio Nakatani, Takeshi Ogasawara, and Peng Wu, On the Benefits and Pitfalls of Extending a Statically Typed Language JIT Compiler for Dynamic Script- ing Languages, Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (New York, NY, USA), OOPSLA ’12, ACM, 2012, pp. 195–212.

[17] Carl Chapman and Kathryn T. Stolee, Exploring Regular Expression Usage and Context in Python, Proceedings of the 25th International Symposium on Software Testing and Analysis (New York, NY, USA), ISSTA 2016, ACM, 2016, pp. 282–293.

[18] Cox, Russ, Regular Expression Matching: the Virtual Machine Approach,

https://swtch.com/~rsc/regexp/regexp2.html, Dec 2009, Accessed: 24-01- 2020.

[19] Cuni, Antonio, High Performance Implementation of Python for CLI/.NET with JIT Compiler Generation for Dynamic Languages, Ph.D. thesis, PhD thesis, Dipartimento di Informatica e Scienze dell’Informazione , 2010.

65 [20] , Inside cpyext: Why Emulating CPython C API is so Hard,

https://morepypy.blogspot.com/2018/09/inside-cpyext-why- emulating--c.html, Sept 2018, Accessed: 24-02-2020.

[21] L. Peter Deutsch and Allan M. Schiffman, Efficient Implementation of the Smalltalk-80 System, Proceedings of the 11th ACM SIGACT-SIGPLAN Sym- posium on Principles of Programming Languages (New York, NY, USA), POPL ’84, ACM, 1984, pp. 297–302.

[22] Mark-Jason Dominus, Perl Regular Expression Matching is NP-Hard, https: //perl.plover.com/NPC/, Accessed: 20-12-2019.

[23] Matthew Gaudet and Mark Stoodley, Rebuilding an Airliner in Flight: A Retro- spective on Refactoring IBM Testarossa Production Compiler for Eclipse OMR, Proceedings of the 8th International Workshop on Virtual Machines and Inter- mediate Languages (New York, NY, USA), VMIL 2016, ACM, 2016, pp. 24–27.

[24] Serge Guelton, Pierrick Brunet, Mehdi Amini, Adrien Merlini, Xavier Corbillon, and Alan Raynaud, Pythran: Enabling Static Optimization of Scientific Python Programs, Computational Science & Discovery 8 (2015), no. 1, 014001.

[25] Kazuaki Ishizaki, Takeshi Ogasawara, Jose Castanos, Priya Nagpurkar, David Edelsohn, and Toshio Nakatani, Adding Dynamically-Typed Language Support to a Statically-Typed Language Compiler: Performance Evaluation, Analysis, and Tradeoffs, ACM SIGPLAN Notices, vol. 47, ACM, 2012, pp. 169–180.

[26] Reid Kleckner, Unladen Swallow Retrospective, https://qinsb.blogspot. com/2011/03/unladen-swallow-retrospective.html, Mar 2011, Accessed: 20-12-2019.

66 [27] Kleene, Stephen Cole, Representation of Events in Nerve Nets and Finite Au- tomata, Tech. report, RAND PROJECT AIR FORCE SANTA MONICA CA, 1951.

[28] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert, Numba: A LLVM-based Python JIT compiler, Proceedings of the Second Workshop on the LLVM Com- piler Infrastructure in HPC (New York, NY, USA), LLVM ’15, ACM, 2015, pp. 7:1–7:6.

[29] Daryl Maier and Xiaoli Liang, Supercharge a Language Runtime!, Proceedings of the 27th Annual International Conference on Computer Science and Software Engineering (Riverton, NJ, USA), CASCON ’17, IBM Corp., 2017, pp. 314–314.

[30] Nicholas D. Matsakis and Felix S. Klock, II, The Rust Language, Proceedings of the 2014 ACM SIGAda Annual Conference on High Integrity Language Tech- nology (New York, NY, USA), HILT ’14, ACM, 2014, pp. 103–104.

[31] Kevin Modzelewski, Pyston 0.6.1 Released, and Future Plans — The Pys-

ton Blog, https://blog.pyston.org/2017/01/31/pyston-0-6-1-released- and-future-plans, Jan 2017, Accessed: 20-12-2019.

[32] Skip Montanaro and NY Rexford, A Peephole Optimizer for Python, Proceed- ings of the 7th International Python Conference, 1998.

[33] Guilherme Ottoni, HHVM JIT: A Profile-guided, Region-based Compiler for PHP and Hack, Proceedings of the 39th ACM SIGPLAN Conference on Pro- gramming Language Design and Implementation (New York, NY, USA), PLDI 2018, ACM, 2018, pp. 151–165.

[34] Rob Pike, The Text Editor Sam, Software: Practice and Experience 17 (1987), no. 11, 813–845.

67 [35] Mohaned Qunaibit, Stefan Brunthaler, Yeoul Na, Stijn Volckaert, and Michael Franz, Accelerating Dynamically-Typed Languages on Heterogeneous Platforms Using Guards Optimization, 32nd European Conference on Object-Oriented Programming (ECOOP 2018) (Dagstuhl, Germany) (Todd Millstein, ed.), Leibniz International Proceedings in Informatics (LIPIcs), vol. 109, Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2018, pp. 16:1–16:29.

[36] Armin Rigo, Representation-based Just-in-time Specialization and the Psyco Prototype for Python, Proceedings of the 2004 ACM SIGPLAN Symposium on Partial Evaluation and Semantics-based Program Manipulation (New York, NY, USA), PEPM ’04, ACM, 2004, pp. 15–26.

[37] Armin Rigo and Samuele Pedroni, PyPy’s Approach to Virtual Machine Con- struction, Companion to the 21st ACM SIGPLAN Symposium on Object- oriented Programming Systems, Languages, and Applications (New York, NY, USA), OOPSLA ’06, ACM, 2006, pp. 944–953.

[38] Alex Rubinsteyn, Eric Hielscher, Nathaniel Weinman, and Dennis Shasha, Para- keet: A just-in-time parallel accelerator for python, Proceedings of the 4th USENIX Conference on Hot Topics in Parallelism (USA), HotPar’12, USENIX Association, 2012, p. 14.

[39] Rust’s Regex Crate Developers, Crate regex, https://docs.rs/regex/1.3.5/ regex/, n.d., Accessed: 24-02-2020.

[40] Henrique Nazare Santos, Pericles Alves, Igor Costa, and Fernando Magno Quintao Pereira, Just-in-time Value Specialization, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) (Washington, DC, USA), CGO ’13, IEEE Computer Society, 2013, pp. 1–11.

68 [41] Lukas Stadler, Adam Welc, Christian Humer, and Mick Jordan, Optimizing R Language Execution via Aggressive Speculation, Proceedings of the 12th Sym- posium on Dynamic Languages (New York, NY, USA), DLS 2016, ACM, 2016, pp. 84–95.

[42] Mark Stoodley, Jitbuilder Status and Directions 2018, https://www. slideshare.net/MarkStoodley/jit-builder-status-and-directions- 2018-0328, May 2018, Accessed: 20-12-2019.

[43] Ken Thompson, Programming Techniques: Regular Expression Search Algo- rithm, Commun. ACM 11 (1968), no. 6, 419–422.

[44] Guido Van Rossum et al., Python Programming Language., USENIX annual technical conference, vol. 41, 2007, p. 36.

[45] Daniel C. Wang, Andrew W. Appel, Jeff L. Korn, and Christopher S. Serra, The Zephyr Abstract Syntax Description Language, Proceedings of the Conference on Domain-Specific Languages on Conference on Domain-Specific Languages (DSL), 1997 (USA), DSL’97, USENIX Association, 1997, p. 17.

[46] Thomas W¨urthinger, Christian Wimmer, Christian Humer, Andreas W¨oß, Lukas Stadler, Chris Seaton, Gilles Duboscq, Doug Simon, and Matthias Grim- mer, Practical Partial Evaluation for High-performance Dynamic Language Runtimes, Proceedings of the 38th ACM SIGPLAN Conference on Program- ming Language Design and Implementation (New York, NY, USA), PLDI 2017, ACM, 2017, pp. 662–676.

[47] Thomas W¨urthinger,Andreas W¨oß, Lukas Stadler, Gilles Duboscq, Doug Si- mon, and Christian Wimmer, Self-optimizing AST Interpreters, Proceedings of the 8th Symposium on Dynamic Languages (New York, NY, USA), DLS ’12, ACM, 2012, pp. 73–82.

69 [48] F. Yu, Z. Chen, Y. Diao, T. V. Lakshman, and R. H. Katz, Fast and Memory- Efficient Regular Expression Matching for Deep Packet Inspection, 2006 Sympo- sium on Architecture For Networking And Communications Systems, Dec 2006, pp. 93–102.

[49] Haiping Zhao, Iain Proctor, Minghui Yang, Xin Qi, Mark Williams, Qi Gao, Guilherme Ottoni, Andrew Paroski, Scott MacVicar, Jason Evans, and Stephen Tu, The HipHop Compiler for PHP, Proceedings of the ACM International Con- ference on Object Oriented Programming Systems Languages and Applications (New York, NY, USA), OOPSLA ’12, ACM, 2012, pp. 575–586.

70 Appendix A

Appendix A

A.1 Code Repositories

Project URL Python Bindings to JitBuilder gitlab.casa.cs.unb.ca/python_proj/py-jitbuilder Regex Engine & Util Libraries gitlab.casa.cs.unb.ca/python_proj/regex-jit

Table A.1: Code Repositories.

71 Vita

Candidate’s full name: Dayton J. Allen

University attended (with dates and degrees obtained): University of the West Indies (BSc. Computer Science)

Publications: [1] D.J. Allen, D. Bremner, M. Stoodley, D. Maier. “Accelerating Python Code Using JitBuilder,” Poster, 29th Annual International Conference on Computer Science and Software Engineering (CASCON x EVOKE 2019), Markham, Canada, November 4-6, 2019.

Conference Presentations: [1] D.J. Allen, D. Bremner, M. Stoodley, D. Maier. “A First Stop in the World of Python on JitBuilder’s Polyglot Journey.” 3rd Workshop on Advances in Open Runtime Technology for Cloud Computing (AORTCC), 29th Annual International Conference on Computer Science and Software Engineering (CASCON x EVOKE 2019), Markham, Canada, November 4-6, 2019.