<<

Tracing the Meta-Level: PyPy’s Tracing JIT

Carl Friedrich Bolz Antonio Cuni University of Düsseldorf University of Genova STUPS Group DISI Germany Italy [email protected] [email protected] Maciej Fijalkowski Armin Rigo merlinux GmbH [email protected][email protected]

ABSTRACT stand, extend and port whereas writing a just-in-time com- We attempt to apply the technique of Tracing JIT Com- piler is an error-prone task that is made even harder by the pilers in the context of the PyPy project, i.e., to programs dynamic features of a language. that are interpreters for some dynamic languages, including A recent approach to getting better performance for dy- Python. Tracing JIT can greatly speed up pro- namic languages is that of tracing JIT compilers [16, 7]. grams that spend most of their time in loops in which they Writing a tracing JIT compiler is relatively simple. It can take similar code paths. However, applying an unmodified be added to an existing for a language, the inter- tracing JIT to a program that is itself a interpreter preter takes over some of the functionality of the compiler results in very limited or no speedup. In this paper we show and the generation part can be simplified. how to guide tracing JIT compilers to greatly improve the The PyPy project is trying to find approaches to generally speed of bytecode interpreters. One crucial point is to un- ease the implementation of dynamic languages. It started as roll the bytecode dispatch loop, based on two hints provided a Python implementation in Python, but has now extended by the implementer of the bytecode interpreter. We evalu- its goals to be generally useful for implementing other dy- ate our technique by applying it to two PyPy interpreters: namic languages as well. The general approach is to imple- one is a small example, and the other one is the full Python ment an interpreter for the language in a subset of Python. interpreter. This subset is chosen in such a way that programs in it can be compiled into various target environments, such as /Posix, the CLI or the JVM. The PyPy project is described 1. INTRODUCTION in more detail in Section 2. Dynamic languages have seen a steady rise in popularity In this paper we discuss ongoing work in the PyPy project in recent years. JavaScript is increasingly being used to im- to improve the performance of interpreters written with the plement full-scale applications which run within a browser, help of the PyPy toolchain. The approach is that of a trac- whereas other dynamic languages (such as Ruby, , Python, ing JIT compiler. Contrary to the tracing JITs for dynamic PHP) are used for the server side of many web sites, as well languages that currently exist, PyPy’s tracing JIT operates as in areas unrelated to the web. “one level down”, i.e., it traces the of the inter- One of the often-cited drawbacks of dynamic languages is preter, as opposed to the execution of the user program. the performance penalties they impose. Typically they are The fact that the program the tracing JIT compiles is in slower than statically typed languages. Even though there our case always an interpreter brings its own set of prob- has been a lot of research into improving the performance lems. We describe tracing JITs and their application to in- of dynamic languages (in the SELF project, to name just terpreters in Section 3. By this approach we hope to arrive one example [18]), those techniques are not as widely used at a JIT compiler that can be applied to a variety of dy- as one would expect. Many dynamic language implementa- namic languages, given an appropriate interpreter for each tions use completely straightforward bytecode-interpreters of them. The process is not completely automatic but needs without any advanced implementation techniques like just- a small number of hints from the interpreter author, to help in-time compilation. There are a number of reasons for this. the tracing JIT. The details of how the process integrates Most of them boil down to the inherent complexities of using into the rest of PyPy will be explained in Section 4. This compilation. Interpreters are simple to implement, under- work is not finished, but has already produced some promis- ing results, which we will discuss in Section 5. The contributions of this paper are: • Applying a tracing JIT compiler to an interpreter.

Permission to make digital or hard copies of all or part of this work for • Finding techniques for improving the generated code. personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to 2. THE PYPY PROJECT republish, to post on servers or to redistribute to lists, requires prior specific The PyPy project1 [21, 5] is an environment where flex- permission and/or a fee. 1 Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00. http://codespeak.net/pypy ible implementation of dynamic languages can be written. When a hot loop is identified, the interpreter enters a To implement a dynamic language with PyPy, an interpreter special mode, called tracing mode. During tracing, the in- for that language has to be written in RPython [1]. RPython terpreter records a history of all the operations it executes. (“Restricted Python”) is a subset of Python chosen in such It traces until it has recorded the execution of one iteration a way that type inference can be performed on it. The lan- of the hot loop. To decide when this is the case, the trace guage interpreter can then be translated with the help of is repeatedly checked during tracing as to whether the in- PyPy into various target environments, such as C/Posix, the terpreter is at a position in the program where it had been CLI and the JVM. This is done by a component of PyPy earlier. called the translation toolchain. The history recorded by the tracer is called a trace: it By writing VMs in a high-level language, we keep the is a sequential list of operations, together with their actual implementation of the language free of low-level details such operands and results. Such a trace can be used to generate as memory management strategy, threading model or object efficient machine code. This generated machine code is im- layout. These features are automatically added during the mediately , and can be used in the next iteration translation process. The process starts by performing con- of the loop. trol flow graph construction and type inferences, then fol- Being sequential, the trace represents only one of the many lowed by a series of steps, each step transforming the inter- possible paths through the code. To ensure correctness, the mediate representation of the program produced by the pre- trace contains a guard at every possible point where the vious one until we get the final executable. The first trans- path could have followed another direction, for example at formation step makes details of the Python object model conditions and indirect or virtual calls. When generating explicit in the intermediate representation, later steps intro- the machine code, every guard is turned into a quick check ducing garbage collection and other low-level details. As we to guarantee that the path we are executing is still valid. will see later, this internal representation of the program is If a guard fails, we immediately quit the machine code and also used as an input for the tracing JIT. continue the execution by falling back to interpretation.2 It is important to understand how the tracer recognizes 3. TRACING JIT COMPILERS that the trace it recorded so far corresponds to a loop. This happens when the position key is the same as at an earlier Tracing optimizations were initially explored by the Dy- point. The position key describes the position of the exe- namo project [2] to dynamically optimize machine code at cution of the program, i.e., usually contains things like the runtime. Its techniques were then successfully used to imple- function currently being executed and the program counter ment a JIT compiler for a Java VM [16, 15]. Subsequently position of the tracing interpreter. The tracing interpreter these tracing JITs were discovered to be a relatively simple does not need to check all the time whether the position key way to implement JIT compilers for dynamic languages [7]. already occurred earlier, but only at instructions that are The technique is now being used by both Mozilla’s Trace- able to change the position key to an earlier value, e.g., a Monkey JavaScript VM [14] and has been tried for Adobe’s backward branch instruction. Note that this is already the Tamarin ActionScript VM [8]. second place where backward branches are treated specially: Tracing JITs are built on the following basic assumptions: during interpretation they are the place where the profiling • programs spend most of their runtime in loops is performed and where tracing is started or already existing assembler code executed; during tracing they are the place • several iterations of the same loop are likely to take where the check for a closed loop is performed. similar code paths Let’s look at a small example. Take the (slightly con- trived) RPython code in Figure 1. The tracer interprets The basic approach of a tracing JIT is to only generate these functions in a bytecode format that is an encoding of machine code for the hot code paths of commonly executed the intermediate representation of PyPy’s translation tool- loops and to interpret the rest of the program. The code for chain after type inference has been performed. When the those common loops however is highly optimized, including profiler discovers that the while loop in strange_sum is ex- aggressive inlining. ecuted often the tracing JIT will start to trace the execution Typically tracing VMs go through various phases when of that loop. The trace would look as in the lower half of they execute a program: Figure 1. The operations in this sequence are operations of the above- • Interpretation/profiling mentioned intermediate representation (e.g., the generic mod- • Tracing ulo and equality operations in the function above have been recognized to always take integers as arguments and are thus • Code generation rendered as int_mod and int_eq). The trace contains all the operations that were executed in SSA-form [12] and ends • Execution of the generated code with a jump to its beginning, forming an endless loop that can only be left via a guard failure. The call to f is inlined At first, when the program starts, everything is inter- into the trace. The trace contains only the hot else case preted. The interpreter does a small amount of lightweight of the if test in f, while the other branch is implemented profiling to establish which loops are run most frequently. via a guard failure. This trace can then be converted into This lightweight profiling is usually done by having a counter machine code and executed. on each backward jump instruction that counts how often this particular backward jump is executed. Since loops need 2There are more complex mechanisms in place to still pro- a backward jump somewhere, this method looks for loops in duce extra code for the cases of guard failures [15], but they the user program. are independent of the issues discussed in this paper. def f(a, b): if b % 46 == 41: return a - b def interpret(bytecode, a): else: regs = [0] * 256 return a + b pc = 0 def strange_sum(n): while True: result = 0 = ord(bytecode[pc]) while n >= 0: pc += 1 result = f(result, n) if opcode == JUMP_IF_A: n -= 1 target = ord(bytecode[pc]) return result pc += 1 if a: # corresponding trace: pc = target loop_header(result0, n0) elif opcode == MOV_A_R: i0 = int_mod(n0, Const(46)) n = ord(bytecode[pc]) i1 = int_eq(i0, Const(41)) pc += 1 guard_false(i1) regs[n] = a result1 = int_add(result0, n0) elif opcode == MOV_R_A: n1 = int_sub(n0, Const(1)) n = ord(bytecode[pc]) i2 = int_ge(n1, Const(0)) pc += 1 guard_true(i2) a = regs[n] jump(result1, n1) elif opcode == ADD_R_TO_A: n = ord(bytecode[pc]) pc += 1 Figure 1: A simple Python function and the a += regs[n] elif opcode == DECR_A: recorded trace. a -= 1 elif opcode == RETURN_A: return a 3.1 Applying a Tracing JIT to an Interpreter The tracing JIT of the PyPy project is atypical in that it is not applied to the user program, but to the interpreter Figure 2: A very simple bytecode interpreter with running the user program. In this section we will explore registers and an accumulator. what problems this brings, and suggest how to solve them (at least partially). This means that there are two inter- preters involved, and we need appropriate terminology to distinguish beween them. On the one hand, there is the in- MOV_A_R 0 # i = a terpreter that the tracing JIT uses to perform tracing. This MOV_A_R 1 # copy of ’a’ we will call the tracing interpreter. On the other hand, there is the interpreter that runs the user’s programs, which we # 4: will call the language interpreter. In the following, we will MOV_R_A 0 # i-- DECR_A assume that the language interpreter is bytecode-based. The MOV_A_R 0 program that the language interpreter executes we will call the user program (from the point of view of a VM author, MOV_R_A 2 # res += a the “user” is a programmer using the VM). ADD_R_TO_A 1 Similarly, we need to distinguish loops at two different MOV_A_R 2 levels: interpreter loops are loops inside the language inter- MOV_R_A 0 # if i!=0: goto 4 preter. On the other hand, user loops are loops in the user JUMP_IF_A 4 program. A tracing JIT compiler finds the hot loops of the program MOV_R_A 2 # return res it is compiling. In our case, this program is the language RETURN_A interpreter. The most important hot interpreter loop is the bytecode dispatch loop (for many simple interpreters it is also the only hot loop). Tracing one iteration of this loop Figure 3: Example bytecode: Compute the square means that the recorded trace corresponds to execution of of the accumulator one opcode. This means that the assumption made by the tracing JIT – that several iterations of a hot loop take the same or similar code paths – is wrong in this case. It is very unlikely that the same particular opcode is executed many loop_start(a0, regs0, bytecode0, pc0) times in a row. opcode0 = strgetitem(bytecode0, pc0) An example is given in Figure 2. It shows the code of a pc1 = int_add(pc0, Const(1)) very simple bytecode interpreter with 256 registers and an guard_value(opcode0, Const(7)) a1 = int_sub(a0, Const(1)) accumulator. The bytecode argument is a string of , jump(a1, regs0, bytecode0, pc1) all register and the accumulator are integers.3 A program for

3The chain of if, elif, ... instructions checking the various Figure 4: Trace when executing the DECR_A opcode is turned into a switch statement by one of PyPy’s optimizations. Python does not have a switch statement. this interpreter that computes the square of the accumulator tlrjitdriver = JitDriver(greens = [’pc’, ’bytecode’], reds = [’a’, ’regs’]) is shown in Figure 3. If the tracing interpreter traces the execution of the DECR_A opcode (whose integer value is 7), def interpret(bytecode, a): the trace would look as in Figure 4. Because of the guard regs = [0] * 256 on opcode0, the code compiled from this trace will be useful pc = 0 only when executing a long series of DECR_A opcodes. For while True: all the other operations the guard will fail, which will mean tlrjitdriver.jit_merge_point( bytecode=bytecode, pc=pc, that performance is probably not improved at all. a=a, regs=regs) To improve this situation, the tracing JIT could trace the opcode = ord(bytecode[pc]) execution of several opcodes, thus effectively unrolling the pc += 1 bytecode dispatch loop. Ideally, the bytecode dispatch loop if opcode == JUMP_IF_A: should be unrolled exactly so much that the unrolled version target = ord(bytecode[pc]) corresponds to a user loop. User loops occur when the pro- pc += 1 if a: gram counter of the language interpreter has the same value if target < pc: several times. This program counter is typically stored in tlrjitdriver.can_enter_jit( one or several variables in the language interpreter, for ex- bytecode=bytecode, pc=target, ample the bytecode object of the currently executed function a=a, regs=regs) of the user program and the position of the current bytecode pc = target within that. In the example above, the program counter is elif opcode == MOV_A_R: ... # rest unmodified represented by the bytecode and pc variables. Since the tracing JIT cannot know which parts of the lan- guage interpreter are the program counter, the author of the Figure 5: Simple bytecode interpreter with hints language interpreter needs to mark the relevant variables of applied the language interpreter with the help of a hint. The trac- ing interpreter will then effectively add the values of these variables to the position key. This means that the loop will to instantiate JitDriver by listing all the variables of the only be considered to be closed if these variables that are bytecode loop. The variables are classified into two groups, making up the program counter at the language interpreter “green” variables and “red” variables. The green variables level are the same a second time. Loops found in this way are those that the tracing JIT should consider to be part are, by definition, user loops. of the program counter of the language interpreter. In the The program counter of the language interpreter can only case of the example, the pc variable is obviously part of be the same a second time after an instruction in the user the program counter; however, the bytecode variable is also program sets it to an earlier value. This happens only at counted as green, since the pc variable is meaningless with- backward jumps in the language interpreter. That means out the knowledge of which bytecode string is currently be- that the tracing interpreter needs to check for a closed loop ing interpreted. All other variables are red (the fact that only when it encounters a backward jump in the language red variables need to be listed explicitly too is an implemen- interpreter. Again the tracing JIT cannot know which part tation detail). of the language interpreter implements backward jumps, so In addition to the classification of the variables, there are the author of the language interpreter needs to indicate this two methods of JitDriver that need to be called. Both of with the help of a hint. them receive as arguments the current values of the vari- The language interpreter uses a similar technique to de- ables listed in the definition of the driver. The first one is tect hot user loops: the profiling is done at the backward jit_merge_point which needs to be put at the beginning of branches of the user program, using one counter per seen the body of the bytecode dispatch loop. The other, more program counter of the language interpreter. interesting one, is can_enter_jit. This method needs to be The condition for reusing already existing machine code called at the end of any instruction that can set the pro- also needs to be adapted to this new situation. In a clas- gram counter of the language interpreter to an earlier value. sical tracing JIT there is at most one piece of assembler For the example this is only the JUMP_IF_A instruction, and code per loop of the jitted program, which in our case is only if it is actually a backward jump. Here is where the lan- the language interpreter. When applying the tracing JIT guage interpreter performs profiling to decide when to start to the language interpreter as described so far, all pieces of tracing. It is also the place where the tracing JIT checks assembler code correspond to the bytecode dispatch loop of whether a loop is closed. This is considered to be the case the language interpreter. However, they correspond to dif- when the values of the “green” variables are the same as at ferent paths through the loop and different ways to unroll an earlier call to the can_enter_jit method. it. To ascertain which of them to use when trying to enter For the small example the hints look like a lot of work. assembler code again, the program counter of the language However, the number of hints is essentially constant no mat- interpreter needs to be checked. If it corresponds to the po- ter how large the interpreter is, which makes the extra work sition key of one of the pieces of assembler code, then this negligible for larger interpreters. assembler code can be executed. This check again only needs When executing the Square function of Figure 3, the pro- to be performed at the backward branches of the language filing will identify the loop in the square function to be hot, interpreter. and start tracing. It traces the execution of the interpreter Let’s look at how hints would need to be applied to the running the loop of the square function for one iteration, example interpreter from Figure 2. Figure 5 shows the rele- thus unrolling the interpreter loop of the example interpreter vant parts of the interpreter with hints applied. One needs eight times. The resulting trace can be seen in Figure 6. 3.2 Improving the Result The critical problem of tracing the execution of just one opcode has been solved, the loop corresponds exactly to the loop in the square function. However, the resulting trace loop_start(a0, regs0, bytecode0, pc0) is not optimized enough. Most of its operations are not # MOV_R_A 0 actually doing any computation that is part of the square opcode0 = strgetitem(bytecode0, pc0) function. Instead, they manipulate the data structures of pc1 = int_add(pc0, Const(1)) guard_value(opcode0, Const(2)) the language interpreter. While this is to be expected, given n1 = strgetitem(bytecode0, pc1) that the tracing interpreter looks at the execution of the pc2 = int_add(pc1, Const(1)) language interpreter, it would still be an improvement if a1 = call(Const(<* fn list_getitem>), regs0, n1) some of these operations could be removed. # DECR_A The simple insight on how to improve the situation is opcode1 = strgetitem(bytecode0, pc2) that most of the operations in the trace are actually con- pc3 = int_add(pc2, Const(1)) guard_value(opcode1, Const(7)) cerned with manipulating the bytecode string and the pro- a2 = int_sub(a1, Const(1)) gram counter. Those are stored in variables that are “green” # MOV_A_R 0 (i.e., they are part of the position key). This means that opcode1 = strgetitem(bytecode0, pc3) the tracer checks that those variables have some fixed value pc4 = int_add(pc3, Const(1)) at the beginning of the loop (they may well change over the guard_value(opcode1, Const(1)) course of the loop, though). In the example of Figure 6 n2 = strgetitem(bytecode0, pc4) pc5 = int_add(pc4, Const(1)) the check would be that at the beginning of the trace the call(Const(<* fn list_setitem>), regs0, n2, a2) pc variable is 4 and the bytecode variable is the bytecode # MOV_R_A 2 string corresponding to the square function. Therefore it opcode2 = strgetitem(bytecode0, pc5) is possible to constant-fold computations on them away, as pc6 = int_add(pc5, Const(1)) long as the operations are side-effect free. Since strings are guard_value(opcode2, Const(2)) immutable in RPython, it is possible to constant-fold the n3 = strgetitem(bytecode0, pc6) pc7 = int_add(pc6, Const(1)) strgetitem operation. The int_add are additions of the a3 = call(Const(<* fn list_getitem>), regs0, n3) green variable pc and a constant number, so they can be # ADD_R_TO_A 1 folded away as well. opcode3 = strgetitem(bytecode0, pc7) With this optimization enabled, the trace looks as in Fig- pc8 = int_add(pc7, Const(1)) ure 7. Now much of the language interpreter is actually gone guard_value(opcode3, Const(5)) from the trace and what is left corresponds very closely to n4 = strgetitem(bytecode0, pc8) pc9 = int_add(pc8, Const(1)) the loop of the square function. The only vestige of the lan- i0 = call(Const(<* fn list_getitem>), regs0, n4) guage interpreter is the fact that the register list is still used a4 = int_add(a3, i0) to store the state of the computation. This could be removed # MOV_A_R 2 by some other optimization, but is maybe not really all that opcode4 = strgetitem(bytecode0, pc9) bad anyway (in fact we have an experimental optimization pc10 = int_add(pc9, Const(1)) that does exactly that, but it is not yet finished). Once we guard_value(opcode4, Const(1)) n5 = strgetitem(bytecode0, pc10) get this optimized trace, we can pass it to the JIT backend, pc11 = int_add(pc10, Const(1)) which generates the corresponding machine code. call(Const(<* fn list_setitem>), regs0, n5, a4) # MOV_R_A 0 opcode5 = strgetitem(bytecode0, pc11) 4. IMPLEMENTATION ISSUES pc12 = int_add(pc11, Const(1)) guard_value(opcode5, Const(2)) In this section we will describe some of the practical issues n6 = strgetitem(bytecode0, pc12) when implementing the scheme described in the last section pc13 = int_add(pc12, Const(1)) in PyPy. In particular we will describe some of the problems a5 = call(Const(<* fn list_getitem>), regs0, n6) of integrating the various parts with each other. # JUMP_IF_A 4 The first integration problem is how to not integrate the opcode6 = strgetitem(bytecode0, pc13) tracing JIT at all. It is possible to choose when the language pc14 = int_add(pc13, Const(1)) guard_value(opcode6, Const(3)) interpreter is translated to C whether the JIT should be built target0 = strgetitem(bytecode0, pc14) in or not. If the JIT is not enabled, all the hints that are pc15 = int_add(pc14, Const(1)) possibly in the interpreter source are just ignored by the i1 = int_is_true(a5) translation process. guard_true(i1) If the JIT is enabled, things are more interesting. At the jump(a5, regs0, bytecode0, target0) moment the JIT can only be enabled when translating the interpreter to C, but we hope to lift that restriction in the Figure 6: Trace when executing the Square function future. A classical tracing JIT will interpret the program it of Figure 3, with the corresponding as is running until a hot loop is identified, at which point trac- comments. ing and ultimately assembler generation starts. The tracing JIT in PyPy is operating on the language interpreter, which is written in RPython. But RPython programs are statically translatable to C anyway. This means that interpreting the language interpreter before a hot loop is found is clearly not desirable, since the overhead of this double-interpretation loop_start(a0, regs0) point where it is easy to let the C version of the language # MOV_R_A 0 4 a1 = call(Const(<* fn list_getitem>), regs0, Const(0)) interpreter resume its operation. This means that the fall- # DECR_A back interpreter executes at most one bytecode operation of a2 = int_sub(a1, Const(1)) the language interpreter and then falls back to the C version # MOV_A_R 0 of the language interpreter. After this, the whole process of call(Const(<* fn list_setitem>), regs0, Const(0), a2) profiling may start again. # MOV_R_A 2 Machine code production is done via a well-defined inter- a3 = call(Const(<* fn list_getitem>), regs0, Const(2)) # ADD_R_TO_A 1 face to an assembler backend. This allows easy of the i0 = call(Const(<* fn list_getitem>), regs0, Const(1)) tracing JIT to various architectures (including, we hope, to a4 = int_add(a3, i0) virtual machines such as the JVM where our backend could # MOV_A_R 2 generate JVM bytecode at runtime). At the moment the call(Const(<* fn list_setitem>), regs0, Const(2), a4) only implemented backend is a 32-bit Intel-x86 backend. # MOV_R_A 0 a5 = call(Const(<* fn list_getitem>), regs0, Const(0)) # JUMP_IF_A 4 5. EVALUATION i1 = int_is_true(a5) guard_true(i1) In this section we evaluate the work done so far by looking jump(a5, regs0) at some benchmarks. Since the work is not finished, these can only be preliminary. Benchmarking was done on an otherwise idle machine with a 1.4 GHz Pentium M processor Figure 7: Trace when executing the Square function and 1 GB RAM, using Linux 2.6.27. All benchmarks where of Figure 3, with the corresponding opcodes as com- run 50 times, each in a newly started process. The first run ments. The constant-folding of operations on green was ignored. The final numbers were reached by computing variables is enabled. the average of all other runs, the confidence intervals were computed using a 95% confidence level. All times include running the tracer and producing machine code. would be significantly too big to be practical. The first round of benchmarks (Figure 8) are timings of What is done instead is that the language interpreter the example interpreter given in Figure 2 computing the keeps running as a C program, until a hot loop in the user square of 10000000 using the bytecode of Figure 3.5 The program is found. To identify loops, the C version of the results for various configurations are as follows: language interpreter is generated in such a way that at the Benchmark 1: The interpreter translated to C without place that corresponds to the can_enter_jit hint profil- including a JIT compiler. ing is performed using the program counter of the language Benchmark 2: The tracing JIT is enabled, but no inter- interpreter. Apart from this bit of profiling, the language preter-specific hints are applied. This corresponds to the interpreter behaves in just the same way as without a JIT. trace in Figure 4. The threshold when to consider a loop to When a hot user loop is identified, tracing is started. The be hot is 40 iterations. As expected, this is not faster than tracing interpreter is invoked to start tracing the language the previous number. It is even quite a bit slower, proba- interpreter that is running the user program. Of course the bly due to the overheads, as well as non-optimal generated tracing interpreter cannot actually trace the execution of machine code. the C representation of the language interpreter. Instead it Benchmark 3: The tracing JIT is enabled and hints as takes the state of the execution of the language interpreter in Figure 5 are applied. This means that the interpreter loop and starts tracing using a bytecode representation of the is unrolled so that it corresponds to the loop in the square language interpreter. That means there are two “versions” function. Constant folding of green variables is disabled, of the language interpreter embedded in the final executable therefore the resulting machine code corresponds to the trace of the VM: on the one hand it is there as executable machine in Figure 6. This alone brings an improvement over the code, on the other hand as bytecode for the tracing inter- previous case, but is still slower than pure interpretation. preter. It also means that tracing is costly as it incurs a Benchmark 4: Same as before, but with constant folding double interpretation overhead. enabled. This corresponds to the trace in Figure 7. This From then on things proceed as described in Section 3. speeds up the square function considerably, making it nearly The tracing interpreter tries to find a loop in the user pro- three times faster than the pure interpreter. gram, if it finds one it will produce machine code for that Benchmark 5: Same as before, but with the threshold loop and this machine code will be immediately executed. set so high that the tracer is never invoked. In this way the The machine code is executed until a guard fails. Then the overhead of the profiling is measured. For this interpreter execution should fall back to normal interpretation by the it seems to be rather large, with about 20% slowdown due language interpreter. This falling back is possibly a complex to profiling. This is because the interpreter is small and the process, since the guard failure can have occurred arbitrarily opcodes simple. For larger interpreters (e.g., PyPy’s Python deep in a helper function of the language interpreter, which interpreter) the overhead will likely be less significant. would make it hard to rebuild the state of the language inter- To test the technique on a more realistic example, we did preter and let it run from that point (e.g., this would involve some preliminary benchmarks with PyPy’s Python inter- building a potentially deep C stack). Instead the falling back preter. The function we benchmarked as well as the results is achieved by a special fallback interpreter which runs the can be seen in Figure 9. While the function may seem a bit language interpreter and the user program from the point of the guard failure. The fallback interpreter is essentially a 4This is the only reason for the jit_merge_point hint. variant of the tracing interpreter that does not keep a trace. 5The result will overflow, but for smaller numbers the run- The fallback interpreter runs until execution reaches a safe ning time is not long enough to sensibly measure it. Time (ms) speedup is that of partial evaluation [13, 19]. Conceptually there are 1 Compiled to C, no JIT 442.7 ± 3.4 1.00 some similarities to our work. In partial evaluation some ar- 2 Normal Trace Compilation 1518.7 ± 7.2 0.29 guments of the interpreter function are known (static) while 3 Unrolling of Interp. Loop 737.6 ± 7.9 0.60 the rest are unknown (dynamic). This separation of argu- 4 JIT, Full Optimizations 156.2 ± 3.8 2.83 ments is related to our separation of variables into those that 5 Profile Overhead 515.0 ± 7.2 0.86 should be part of the position key and the rest. In partial evaluation all parts of the interpreter that rely only on static Figure 8: Benchmark results of example interpreter arguments can be constant-folded so that only operations on computing the square of 10000000 the dynamic arguments remain. Classical partial evaluation has failed to be useful for dy- def f(a): namic language for much the same reasons why ahead-of- t = (1, 2, 3) i = 0 time compilers cannot compile them to efficient code. If the while i < a: partial evaluator knows only the program it simply does not t = (t[1], t[2], t[0]) have enough information to produce good code. Therefore i += t[0] some work has been done to do partial evaluation at runtime. return i One of the earliest works on runtime specialisation is Tempo for C [10, 9]. However, it is essentially a normal partial - Time (s) speedup uator “packaged as a ”; decisions about what can be 1 PyPy compiled to C, no JIT 23.44 ± 0.07 1.00 specialised and how, are pre-determined. Another work in 2 PyPy comp’d to C, with JIT 3.58 ± 0.05 6.54 this direction is DyC [17], another runtime specializer for 3 CPython 2.5.2 4.96 ± 0.05 4.73 C. Both of these projects have a problem similar to that of 4 CPython 2.5.2 + 1.6 1.51 ± 0.05 15.57 DynamoRIO. Targeting the C language makes higher-level specialisation difficult. Figure 9: Benchmarked function and results for the There have been some attempts to do dynamic partial Python interpreter running f(10000000) evaluation, which is partial evaluation that defers partial evaluation completely to runtime to make partial evaluation arbitrary, executing it is still non-trivial, as a normal Python more useful for dynamic languages. This concept was in- interpreter needs to dynamically dispatch nearly all of the troduced by Sullivan [23] who implemented it for a small involved operations, like indexing into the tuple, addition dynamic language based on lambda-calculus. It is also re- and comparison of i. We benchmarked PyPy’s Python in- lated to Psyco [20], a specializing JIT compiler for Python. terpreter with the JIT disabled, with the JIT enabled and There is some work by one of the authors to implement a CPython6 2.5.2 (the reference implementation of Python). dynamic partial evaluator for Prolog [3]. There are also ex- In addition we benchmarked CPython using Psyco 1.6 [20], periments within the PyPy project to use dynamic partial a specializing JIT compiler for Python. evaluation for automatically generating JIT compilers out The results show that the tracing JIT speeds up the execu- of interpreters [22, 11]. So far those have not been as suc- tion of this Python function significantly, even outperform- cessful as we would like and it seems likely that they will be ing CPython. To achieve this, the tracer traces through the supplanted with the work on tracing JITs described here. whole Python dispatching machinery, automatically inlin- ing the relevant fast paths. However, the manually tuned 7. CONCLUSION AND NEXT STEPS Psyco still performs a lot better than our prototype (al- We have shown techniques for improving the results when though it is interesting to note that Psyco improves the applying a tracing JIT to an interpreter. Our first bench- speed of CPython by only a factor of 3.29 in this example, marks indicate that these techniques work really well on while our tracing JIT improves PyPy by a factor of 6.54). small interpreters and first experiments with PyPy’s Python interpreter make it appear likely that they can be scaled up 6. RELATED WORK to realistic examples. A lot of work remains. We are working on two main op- Applying a trace-based optimizer to an interpreter and timizations at the moment. Those are: adding hints to help the tracer produce better results has Allocation Removal: A key optimization for making been tried before in the context of the DynamoRIO project the approach produce good code for more complex dynamic [24], which has been a great inspiration for our work. They language is to perform escape analysis on the loop operation achieve the same unrolling of the interpreter loop so that the after tracing has been performed. In this way all objects that unrolled version corresponds to the loops in the user pro- are allocated during the loop and do not actually escape the gram. However the approach is greatly hindered by the fact loop do not need to be allocated on the heap at all but can be that they trace on the machine code level and thus have no exploded into their respective fields. This is very helpful for high-level information available about the interpreter. This dynamic languages where primitive types are often boxed, as makes it necessary to add quite a large number of hints, be- the repeated allocation of intermediate results is very costly. cause at the assembler level it is not really visible anymore Optimizing Frame Objects: One problem with the that e.g., a bytecode string is immutable. Also more ad- removal of allocations is that many dynamic languages are vanced optimizations like allocation removal would not be so reflective that they allow the introspection of the frame possible with that approach. object that the interpreter uses to store local variables (e.g., The standard approach for automatically producing a com- , Python). This means that intermediate results piler for a given an interpreter for it always escape because they are stored into the frame object, 6http://python.org rendering the allocation removal optimization ineffective. To remedy this problem we make it possible to update the frame Proceedings of the 23rd ACM SIGPLAN-SIGACT object lazily only when it is actually accessed from outside Symposium on Principles of Programming Languages, of the code generated by the JIT. pages 145–156, St. Petersburg Beach, Florida, United Furthermore both tracing and leaving machine code are States, 1996. ACM. very slow due to a double interpretation overhead and we [11] A. Cuni, D. Ancona, and A. Rigo. Faster than C#: might need techniques for improving those. Efficient implementation of dynamic languages on Eventually we will need to apply the JIT to the vari- .NET. Submitted to ICOOOLPS’09. ous interpreters that are written in RPython to evaluate [12] . Cytron, J. Ferrante, B. K. Rosen, M. N. Wegman, how widely applicable the described techniques are. Possi- and F. K. Zadeck. Efficiently computing static single ble targets for such an evaluation would be the SPy-VM, a assignment form and the control dependence graph. Smalltalk implementation [4]; a Prolog interpreter; PyGirl, ACM Transactions on Programming Languages and a Gameboy emulator [6]; and also not immediately obvious Systems, 13(4):451–490, Oct. 1991. ones, like Python’s regular expression engine. [13] Y. Futamura. Partial evaluation of computation If these experiments are successful we hope that we can process - an approach to a Compiler-Compiler. reach a point where it becomes unnecessary to write a lan- Higher-Order and Symbolic Computation, guage specific JIT compiler and instead possible to just ap- 12(4):381–391, 1999. ply a couple of hints to the interpreter to get reasonably [14] A. Gal, B. Eich, M. Shaver, D. Anderson, B. Kaplan, good performance with relatively little effort. G. Hoare, D. Mandelin, B. Zbarsky, J. Orendorff, M. Bebenita, M. Chang, M. Franz, E. Smith, 8. REFERENCES R. Reitmaier, and M. Haghighat. Trace-based [1] D. Ancona, M. Ancona, A. Cuni, and N. D. Matsakis. Just-in-Time type specialization for dynamic RPython: a step towards reconciling dynamically and languages. In PLDI, 2009. statically typed OO languages. In Proceedings of the [15] A. Gal and M. Franz. Incremental dynamic code 2007 Symposium on Dynamic Languages, pages 53–64, generation with trace trees. Technical Report Montreal, Quebec, Canada, 2007. ACM. ICS-TR-06-16, Donald Bren School of Information and [2] V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a Computer Science, University of California, Irvine, transparent dynamic optimization system. ACM Nov. 2006. SIGPLAN Notices, 35(5):1–12, 2000. [16] A. Gal, C. W. Probst, and M. Franz. HotpathVM: an [3] C. F. Bolz. Automatic JIT Compiler Generation with effective JIT compiler for resource-constrained devices. Runtime Partial Evaluation. Master thesis, In Proceedings of the 2nd International Conference on Heinrich-Heine-Universit¨at Dusseldorf,¨ 2008. Virtual Execution Environments, pages 144–153, [4] C. F. Bolz, A. Kuhn, A. Lienhard, N. Matsakis, Ottawa, Ontario, Canada, 2006. ACM. O. Nierstrasz, L. Renggli, A. Rigo, and T. Verwaest. [17] B. Grant, M. Mock, M. Philipose, C. Chambers, and Back to the Future in One Week — Implementing a S. J. Eggers. DyC: an expressive annotation-directed Smalltalk VM in PyPy, pages 123–139. 2008. dynamic compiler for c. Theoretical Computer Science, [5] C. F. Bolz and A. Rigo. How to not write a virtual 248(1–2):147–199, 2000. machine. In Proceedings of the 3rd Workshop on [18] U. H¨olzle. Adaptive optimization for SELF: reconciling Dynamic Languages and Applications (DYLA 2007), high performance with exploratory programming. 2007. Technical report, Stanford University, 1994. [6] C. Bruni and T. Verwaest. PyGirl: generating [19] N. D. Jones, C. K. Gomard, and P. Sestoft. Partial Whole-System VMs from High-Level prototypes using evaluation and Automatic Program Generation. PyPy. In Tools, accepted for publication, 2009. Prentice-Hall, Inc., 1993. [7] M. Chang, M. Bebenita, A. Yermolovich, A. Gal, and [20] A. Rigo. Representation-based just-in-time M. Franz. Efficient Just-In-Time execution of specialization and the psyco prototype for python. In dynamically typed languages via code specialization Proceedings of the 2004 ACM SIGPLAN Symposium using precise runtime type inference. Technical Report on Partial Evaluation and Semantics-Based Program ICS-TR-07-10, Donald Bren School of Information and Manipulation, pages 15–26, Verona, Italy, 2004. ACM. Computer Science, University of California, Irvine, [21] A. Rigo and S. Pedroni. PyPy’s approach to virtual 2007. machine construction. In Companion to the 21st ACM [8] M. Chang, E. Smith, R. Reitmaier, M. Bebenita, SIGPLAN Conference on Object-Oriented A. Gal, C. Wimmer, B. Eich, and M. Franz. Tracing Programming Systems, Languages, and Applications, for web 3.0: Trace compilation for the next generation pages 944–953, Portland, Oregon, USA, 2006. ACM. web applications. In Proceedings of the 2009 ACM [22] A. Rigo and S. Pedroni. JIT compiler architecture. SIGPLAN/SIGOPS International Conference on Technical Report D08.2, PyPy, May 2007. Virtual Execution Environments, pages 71–80, [23] G. T. Sullivan. Dynamic partial evaluation. In Washington, DC, USA, 2009. ACM. Proceedings of the Second Symposium on Programs as [9] C. Consel, L. Hornof, F. No¨el, J. Noy´e,and Data Objects, pages 238–256. Springer-Verlag, 2001. N. Volanschi. A uniform approach for compile-time [24] G. T. Sullivan, D. L. Bruening, I. Baron, T. Garnett, and run-time specialization. Dagstuhl Seminar on and S. Amarasinghe. Dynamic native optimization of Partial Evaluation, pages 54—72, 1996. interpreters. In Proceedings of the 2003 Workshop on [10] C. Consel and F. No¨el. A general approach for Interpreters, Virtual Machines and Emulators, pages run-time specialization and its application to C. In 50–57, San Diego, California, 2003. ACM.