How Fast Can We Make Interpreted Python?
Total Page:16
File Type:pdf, Size:1020Kb
How fast can we make interpreted Python? Russell Power Alex Rubinsteyn New York University {power,alexr}@cs.nyu.edu Abstract however, the landscape of choices is severely constrained by Python is a popular dynamic language with a large part of its the very same API that makes it easy to extend.The standard appeal coming from powerful libraries and extension mod- interpreter for Python (called CPython) exposes a low-level ules. These augment the language and make it a productive API (the Python C API [4]) which allows for building ex- environment for a wide variety of tasks, ranging from web tension libraries and for embedding the interpreter in other development (Django) to numerical analysis (NumPy). programs. The Python C API allows access to almost ev- Unfortunately, Python’s performance is quite poor when ery aspect of the interpreter, including inspecting the current compared to modern implementations of languages such as interpreter state (what threads are running, function stacks, Lua and JavaScript. Why does Python lag so far behind etc..) and pulling apart the representation of various object these other languages? As we show, the very same API and types (integer, float, list, dictionary, or user-defined). For per- extension libraries that make Python a powerful language formance reasons, many Python libraries are written in C or also make it very difficult to efficiently execute. another compiled language, and interface to Python via the Given that we want to retain access to the great extension C API. libraries that already exist for Python, how fast can we make In an ideal world, any implementation of Python’s semi- it? To evaluate this, we designed and implemented Falcon, a formal specification [19] would be interchangeable with the high-performancebytecode interpreter fully compatible with CPython implementation. Unfortunately, to avoid breaking the standard CPython interpreter. Falcon applies a number libraries, an alternative implementation must also support ∼ of well known optimizations and introduces several new the full C API. While the size of the C API ( 700 func- techniques to speed up execution of Python bytecode. In our tions) is burdensome, what really makes it problematic is evaluation, we found Falcon an average of 25% faster than the degree to which it exposes the internal memory layout the standard Python interpreter on most benchmarks and in and behavior of Python objects. As a result of this many some cases about 2.5X faster. Python extensions have become intimately coupled to the current implementation of the CPython interpreter. For in- 1. Introduction stance, modifying the layout of the basic object format (for example, to use less memory) breaks even source level com- Python is popular programming language, with a long his- patibility with existing extensions. tory and an active development community. A major driver The value of these extensions to Python is hard to over- arXiv:1306.6047v2 [cs.PL] 13 Aug 2013 of Python’s popularity is the diverse ecosystem of libraries state. Python already has a fast JIT compiler in the form of and extension modules which make it easy to do almost PyPy [9], but is has not seen widespread adoption.This is, to anything, from writing web-servers to numerical comput- a large extent, due to the lack of support for existing CPython ing. But despite significant effort by Python’s developers, extension libraries. Replacing these libraries is not a sim- the performance of Python’s interpreter still lags far behind ple undertaking; NumPy [16] alone consists of almost 100k implementations of languages such as Lua and JavaScript. lines of C source, and the SciPy libraries which build upon What differentiates Python from “faster” languages? Ob- it are another 500k lines. viously, implementation choices (JIT vs. interpreter) can To evaluate how much we can improve over CPython have a dramatic effect on performance.In the case of Python, (without breaking existing extension modules), we devel- oped an alternative Python virtual machine called Falcon. Falcon converts CPython’s stack-oriented bytecode into a register-based format and then performs optimizations to re- move unnecessary operations and occasionally elide type checks. Falcon’s register bytecode is then executed by a threaded interpreter, which attempts to accelerate common code patterns using attribute lookup caching and register tag- the array of local values? This is the essence of a register ging. bytecode. Every instruction explicitly labels which registers Overall, we found that: (local variable slots) it reads from and to which register it • A combination of register conversion, simple bytecode writes its result. optimization, and a virtual machine with threaded dis- Translated into register code, the above example would patch result in an average 25% speedup over CPython. look like: On certain benchmarks, Falcon was up to 2.5X faster. r2 = BINARY_ADD(r0, r1) • Just-in-time register conversion and optimization are fast RETURN_VALUE r2 enough to run online for every function. This means that Figure 3. Register code for adding two inputs a system like Falcon can accelerate any code without the need for profiling and compilation heuristics. When converted into register code, the local variables x, y and z are represented by the registers r0, r1 and r2. 2. Overview Since the source and destination registers are part of each Falcon does not replace the standard CPython interpreter, instruction, it is possible to express this function using only but rather runs inside of it. A user-defined function can be two instructions. marked for execution by Falcon with the decorator @falcon. The downside to Falcon’s register code format (like any When a decorated function is called, Falcon translates that register code) is that each instruction must be larger to make function’s stack bytecode into Falcon’s more compact regis- room for register arguments. There are two advantages to ter code. This register code is then optimized to remove re- register code which make the space increase worthwhile. dundant computations and decrease the number of registers The first is a potential for reducing the time spent in vir- needed by a function. The optimized register code is then tual machine dispatch by reducing the number of instruc- passed on to Falcon’s virtual machine for evaluation. tions that must be executed. Previous research has verified For example, consider executing the following function, that switching from a stack-based virtual machine to a regis- which adds its two inputs, assigns their sum to a local vari- ter machines can improve performance [12, 18]. able, and then returns that variable: Additionally, it is much easier to write optimizations for a register code. The reason for this is that when every instruc- def add(x,y): tion implicitly affects a stack, program analyses and opti- z=x+y return z mizations must track these side effects in some form of vir- tual stack. A register code, on the other hand, admits the Figure 1. Python function that adds two inputs expression of more compact optimization implementations (section 3.2), since instructions only depend on each other When Python first encounters this code, it is compiled through explicit flows of data along named registers. into the following bytecode. 3. Compiler LOAD_FAST (x) LOAD_FAST (y) The Falcon compiler is structured as a series of passes, each BINARY_ADD STORE_FAST (z) of which modifies the register code in some way. To illustrate LOAD_FAST (z) the behavior of each pass we will use a simple example RETURN_VALUE function called count_threshold - which counts the number of Figure 2. Python stack code that adds two inputs elements in a list below a given threshold: def count_threshold(x,t): Each operation in Python’s bytecode implicitly interacts return sum([xi < t for xi in x]) with a value stack. Python first pushes the values of the For count_threshold Python generates the following stack local variables x and y onto the stack. The instruction BINARY_- bytecode (Figure 4): ADD then takes these two values, adds them, and pushes this Before we can do anything else, we need to convert our new result value onto the stack. Where did the values of original stack machine bytecode to the equivalent register the local variables come from? In addition to the stack, code. Python’s virtual machine maintains a distinct array of values for named local variables. The numerical arguments attached 3.1 Stack-to-register conversion to the LOAD_FAST and STORE_FAST instructions indicate which To convert from Python stack code to register code Falcon local variable is being loaded or stored. uses abstract interpretation [10]. In Falcon this takes the Even from this simple example, we can see that Python form of a “virtual stack” which stores register names instead bytecode burdens the virtual machine with a great deal of of values. Falcon steps through a function’s stack operations, wasteful stack manipulations. Could we get better perfor- evaluates the effect each has on the virtual stack, and emit an mance if we did away with the stack and instead only used equivalent register machine operation. LOAD_GLOBAL (sum) Python Falcon Stack BUILD_LIST BUILD_LIST 0 r5 = BUILD_LIST 0 h4i → h5, 4i LOAD_FAST (x) GET_ITER 10: FOR_ITER (to 31) STORE_FAST (xi) instructions directly, we can alias these variables to specially LOAD_FAST (xi) LOAD_FAST (t) designated register names, which simplifies our code and COMPARE_OP (<) reduces the number of instructions needed. LIST_APPEND JUMP_ABSOLUTE 10 31: CALL_FUNCTION Python Falcon Stack RETURN_VALUE LOAD_FAST 0 (x) h5, 4i → h1, 5, 4i Figure 4. Python stack machine bytecode Register r1 is aliased to the local variable x. Therefore Handling control flow for the LOAD_FAST operation here, we don’t need to generate a Falcon instruction, and can instead simply push r1 onto our For straight-line code this process is fairly easy; most Python virtual stack.