Pycket: a Tracing JIT for a Functional Language

Pycket: A Tracing JIT For a Functional Language Spenser Bauman a Carl Friedrich Bolz b Robert Hirschfeld c Vasily Kirilichev c Tobias Pape c Jeremy G. Siek a Sam Tobin-Hochstadt a aIndiana University Bloomington, USA bKing’s College London, UK cHasso Plattner Institute, University of Potsdam, Germany Abstract The importance of JavaScript in modern browsers has led to We present Pycket, a high-performance tracing JIT compiler for considerable research on JIT compilation, including both method- Racket. Pycket supports a wide variety of the sophisticated fea- based (Paleczny et al. 2001) and trace-based approaches (Gal tures in Racket such as contracts, continuations, classes, structures, et al. 2009). The current industry trend is toward method-based dynamic binding, and more. On average, over a standard suite of approaches, as the higher warm-up time for tracing JITs can have benchmarks, Pycket outperforms existing compilers, both Racket’s a greater impact on short-running JavaScript programs. However, JIT and other highly-optimizing Scheme compilers. Further, Pycket for languages with different workloads, such as Python and Lua, provides much better performance for proxies than existing systems, tracing JITs are in widespread use (Bolz et al. 2009). dramatically reducing the overhead of contracts and gradual typ- To address the drawbacks of AOT compilation for functional ing. We validate this claim with performance evaluation on multiple languages, and to explore a blank spot in the compiler design space, existing benchmark suites. we present Pycket, a tracing JIT compiler for Racket, a dynamically- The Pycket implementation is of independent interest as an ap- typed, mostly-functional language descended from Scheme. Pycket plication of the RPython meta-tracing framework (originally cre- is implemented using the RPython meta-tracing framework, which ated for PyPy), which automatically generates tracing JIT compilers automatically generates a tracing JIT compiler from an interpreter from interpreters. Prior work on meta-tracing focuses on bytecode written in RPython (“Restricted Python”), a subset of Python. interpreters, whereas Pycket is a high-level interpreter based on the To demonstrate the effectiveness of Pycket, consider a function CEK abstract machine and operates directly on abstract syntax trees. computing the dot product of two vectors in Racket: Pycket supports proper tail calls and first-class continuations. In the setting of a functional language, where recursion and higher-order (define (dot u v) (for/sum ([x u] [y v]) (* x y))) functions are more prevalent than explicit loops, the most significant This implementation uses a Racket comprehension, which iterates performance challenge for a tracing JIT is identifying which control in lockstep over u and v, binding their elements to x and y, respec- flows constitute a loop—we discuss two strategies for identifying tively. The for/sum operator performs a summation over the values loops and measure their impact. generated in each iteration, in this case the products of the vector elements. This dot function works over arbitrary sequences (lists, vectors, specialized vectors, etc) and uses generic arithmetic. Dot 1. Introduction product is at the core of many numerical algorithms (Demmel 1997). Contemporary high-level languages like Java, JavaScript, Haskell, We use this function as a running example throughout the paper. ML, Lua, Python, and Scheme rely on sophisticated compilers to In Racket, the generality of dot comes at a cost. If we switch produce high-performance code. Two broad traditions have emerged from general to floating-point specific vectors and specialize the in the implementation of such compilers. In functional languages, iteration and numeric operations, dot runs 6× faster. On the other such as Haskell, Scheme, ML, Lisp, or Racket, ahead-of-time hand, if we increase safety by adding contracts, checking that the (AOT) compilers perform substantial static analysis, assisted by inputs are vectors of floats, dot runs 2× slower. type information available either from the language’s type system In Pycket, the generic version runs at almost the same speed or programmer declarations. In contrast, dynamic, object-oriented as the specialized version—the overhead of generic sequences, languages, such as Lua, JavaScript, and Python, following the vectors, and arithmetic is eliminated. In fact, the code generated tradition of Self, are supported by just-in-time (JIT) compilers which for the inner loop is identical—the performance differs only due analyze the execution of programs and dynamically compile them to warm-up. Pycket executes the generic version of dot 5 times to machine code. faster than the straightforwardly specialized version in Racket and While both of these approaches have produced notable progress, 1.5 times faster than most manually optimized code we wrote for the state of the art is not entirely satisfactory. In particular, for Racket. Furthermore, with the help of accurate loop-finding using dynamically-typed functional languages, high performance tra- dynamic construction of call graphs, Pycket eliminates the contract ditionally requires the addition of type information, whether in overhead in dot—the generated inner loop is identical to dot without the form of declarations (Common Lisp), type-specific operations contracts. (Racket), or an additional static type system (Typed Racket). Fur- With Pycket, we depart from traditional Lisp and Scheme com- thermore, AOT compilers have so far been unable to remove the pilers in several of ways. First, we do no AOT optimization. The only overhead associated with highly-dynamic programming patterns, transformations statically performed by Pycket are converting from such as the dynamic checks used to implement contracts and grad- core Racket syntax to A-normal form and converting assignments ual types. In these situations, programmers often make software to mutable variables to heap mutations. We present an overview of engineering compromises to achieve better performance. Pycket’s architecture and key design decisions in section2. 1 2015/5/11 Second, Pycket performs aggressive run-time optimization by leveraging RPython’s trace-based compilation facilities. With trace- e ::= x j λx. e j e e based compilation, the runtime system starts by interpreting the κ ::= [] j arg(e; ρ)::κ j fun(v; ρ)::κ program and watching for hot loops. Once a hot loop is detected, the system records the instructions executed in the loop and opti- hx; ρ, κi 7−! hρ(x); ρ, κi mizes the resulting straight-line trace. Subsequent executions of the h(e e ); ρ, κi 7−! he ; ρ, arg(e ; ρ)::κi loop use the optimized trace instead of the interpreter. Thus, Pycket 1 2 1 2 automatically sees through indirections due to pointers, objects, or hv; ρ, arg(e; ρ0)::κi 7−! he; ρ0; fun(v; ρ)::κi higher-order functions. We present background on tracing in sec- hv; ρ, fun(λx. e; ρ0)::κi 7−! he; ρ0[x 7! v]; κi tion3 and describe optimizations that we applied to the interpreter that leverage tracing in section5. Trace-based compilation poses a significant challenge for func- Figure 1. The CEK Machine for the lambda calculus. tional languages such as Racket because its only looping mechanism is the function call, but not all function calls create loops. Tracing macros, assignment, multiple values, modules, structures, continu- JITs have used the following approaches to detect loops, but none ation marks and more. While this precise combination of features of them is entirely adequate for functional languages. may not be present in many languages, the need to handle higher- • Backward Branches: Any jump to a smaller address counts as a order functions, dynamic data structures, control operators, and dy- loop (Bala et al. 2000). This approach is not suitable for an AST namic binding is common to languages ranging from OCaml to interpreter such as Pycket. JavaScript. Now to describe the implementation of Pycket, consider again • Cyclic Paths: Returning to the same static program location the dot product function from the introduction: counts as a loop (Hiniker et al. 2005). This approach is used in PyPy (Bolz et al. 2009), but in the context of Pycket, too many (define (dot u v) (for/sum ([x u] [y v]) (* x y))) non-loop code paths are detected (Section4). This example presents several challenges. First, for/sum and define • Static Detection: Identify loops statically, either from explicit are macros, which must be expanded to core syntax before inter- loop constructs (Bebenita et al. 2010) or through static analysis pretation. Second, these macros rely on runtime support functions of control flow (Gal and Franz 2006). from libraries, which must be loaded to run the function. Third, • Call Stack Comparison: To detect recursive functions, Hayashizaki this loop is implemented with a tail-recursive function, which must et al.(2011) inspect the call stack to determine when the target avoid stack growth. We now describe our solutions to each of these of the current function call is already on the stack. challenges in turn. Section4 describes our approach to loop detection that combines a Macros & Modules Pycket uses Racket’s macro expander (Flatt simplified form of call-stack comparison with an analysis based on 2002) to evaluate macros, thereby reducing Racket programs to core forms a dynamically constructed call graph. just a few implemented by the runtime system (Tobin- Overall, we make the following contributions. Hochstadt et al. 2011). To run a Racket program,2 Pycket uses Racket to macro-expand 1. We describe the first high-performance JIT compiler for a dy- all the modules used in a program and write the resulting forms namically typed functional language. and metadata JSON encoded files. Pycket then reads the serialized 2. We show that tracing JIT compilation works well for eliminating representation, parses it to an AST, and executes it. Adopting this the overhead of proxies and contracts. technique enables Pycket to handle most of the Racket language while focusing on the key research contributions.

Load more