Tuplex: Data Science in Python at Native Code Speed
Total Page:16
File Type:pdf, Size:1020Kb
Tuplex: Data Science in Python at Native Code Speed Leonhard Spiegelberg Rahul Yesantharao Malte Schwarzkopf Tim Kraska [email protected] [email protected] [email protected] [email protected] Brown University MIT Brown University MIT Abstract compute a flight’s distance covered from kilometers to miles Data science pipelines today are either written in Python via a UDF after joining with a carrier table: or contain user-defined functions (UDFs) written in Python. carriers= spark.read.load( 'carriers.csv') But executing interpreted Python code is slow, and arbitrary fun= udf( lambda m: m* 1.609, DoubleType()) Python UDFs cannot be compiled to optimized machine code. spark.read.load('flights.csv') We present Tuplex, the first data analytics framework that .join(carriers, 'code', 'inner') just-in-time compiles developers’ natural Python UDFs into .withColumn('distance', fun('distance')) efficient, end-to-end optimized native code. Tuplex introduces .write.csv('output.csv') a novel dual-mode execution model that compiles an opti- mized fast path for the common case, and falls back on slower This code will run data loading and the join using Spark’s exception code paths for data that fail to match the fast path’s compiled Scala operators, but must execute the Python UDF assumptions. Dual-mode execution is crucial to making end- passed to withColumn in a Python interpreter. This requires to-end optimizing compilation tractable: by focusing on the passing data between the Python interpreter and the JVM [41], common case, Tuplex keeps the code simple enough to apply and prevents generating end-to-end optimized code across the aggressive optimizations. Thanks to dual-mode execution, Tu- UDFs—for example, an optimized pipeline might apply the plex pipelines always complete even if exceptions occur, and UDF to distance while loading data from flights.csv. Tuplex’s post-facto exception handling simplifies debugging. Could we instead generate native code from the Python We evaluate Tuplex with data science pipelines over real- UDF and optimize it end-to-end with the rest of the pipeline? world datasets. Compared to Spark and Dask, Tuplex im- Unfortunately, this is not feasible today. Generating, compil- proves end-to-end pipeline runtime by 5×–109×, and comes ing, and optimizing code that handles all possible code paths within 22% of a hand-optimized C++ baseline. Optimizations through a Python program is not tractable because of the com- enabled by dual-mode processing improve runtime by up to plexity of Python’s dynamic typing. Dynamic typing (“duck 3×; Tuplex outperforms other Python compilers by 5×; and typing”) requires that code always be prepared to handle any Tuplex performs well in a distributed setting. type: while the above UDF expects a numeric value for m, it may actually receive an integer, a float, a string, a null value, 1 Introduction or even a list. The Python interpreter handles these possi- Data scientists today predominantly write code in Python, bilities through extra checks and exception handlers, but the as the language is easy to learn and convenient to use. But sheer number of cases to handle makes it difficult to compile the features that make Python convenient for programming— optimized code even for this simple UDF. dynamic typing, automatic memory management, and a huge Tuplex is a new analytics framework that generates op- module ecosystem—come at the cost of low performance timized end-to-end native code for data analytics pipelines compared to hand-optimized code and an often frustrating with Python UDFs. Developers write their Tuplex pipelines debugging experience. using a LINQ-style API similar to PySpark’s, and use Python Python code executes in a bytecode interpreter, which inter- UDFs without any type annotations. Tuplex compiles these prets instructions, tracks object types, manages memory, and pipelines into efficient native code by relying on a new dual handles exceptions. This infrastructure imposes a heavy over- mode execution model. Dual-mode execution separates the head, particularly if Python user-defined functions (UDFs) are common case, for which code generation offers the greatest inlined in a larger parallel computation, such as a Spark [70] benefit, from exceptional cases, which complicate code gener- job. For example, a PySpark job over flight data [63] might ation and inhibit optimization, but have minimal performance 1 impact. Our key insight is that separating these cases and 32×, and its generated code comes within 2× of the per- leveraging the regular structure of LINQ-style data analytics formance of Weld [49] and Hyper [26] for pure query exec- pipelines makes Python UDF compilation tractable, as the Tu- tion time, while achieving 2× faster end-to-end runtime in plex compiler faces a simpler and more constrained problem a realistic data analytics setting (§6.2). Tuplex’s dual-mode than a general Python compiler. processing facilitates end-to-end optimizations that improve Making dual-mode processing work required us to over- runtime by up to 3× over simple UDF compilation (§6.3). come several challenges. First, Tuplex must establish what Finally, Tuplex performs well on a single server and distribut- the common case is. Tuplex’s key idea is to sample the input, edly across a cluster of AWS Lambda functions (§6.4); and derive the common case from this sample, and infer types and anecdotal evidence suggests that it simplifies the development expected cases across the pipeline. Second, Tuplex’s gener- and debugging of data science pipelines (§7). We will release ated native code must represent a semantically-correct Python Tuplex as open-source software. execution in the interpreter. To guarantee this, Tuplex sepa- 2 Background and Related Work rates the input data into records for which the native code’s behavior is identical to Python’s, and ones for which it is not Prior attempts to speed up data science via compilation or and which must be processed in the interpreter. Third, Tu- to compile Python to native code exist, but they fall short of plex’s generated code must offer a fast bail-out mechanism if the ideal of compiling end-to-end optimized native code from exceptions occur within UDFs (e.g., a division by zero), and UDFs written in natural Python. We discuss key approaches resolve these in line with Python semantics. Tuplex achieves and systems in the following; Table1 summarizes the key this by adding lightweight checks to generated code, and points. leverages the fact that UDFs are stateless to re-process the of- Python compilers. Building compilers for arbitrary Python fending records for resolution. Fourth, Tuplex must generate programs, which lack the static types required for optimiz- code with high optimization potential but also achieve fast JIT ing compilation, is challenging. PyPy [54] reimplements the compilation, which it does using tuned LLVM compilation. Python interpreter in a compilable subset of Python, which In addition to enabling compilation, dual-mode processing it JIT-compiles via LLVM (i.e., it creates a self-compiling has another big advantage: it can help developers write more interpreter). GraalPython [16] uses the Truffle [61] language robust pipelines that never fail at runtime due to dirty data interpreter to implement a similar approach while generating or unhandled exceptions. Tuplex detects exception cases, re- JVM bytecode for JIT compilation. Numba [31] JIT-compiles solves them via slow-path execution if possible, and presents a Python bytecode for annotated functions on which it can per- summary of the unresolved cases to the user. This helps proto- form type inference. Numba supports a subset of Python type data wrangling pipelines but also helps make production and targets array-structured data from numeric libraries like pipelines more robust to data glitches. NumPy [2]. All of these compilers either myopically focus While the focus of this paper is primarily on multi-threaded on optimizing hotspots without attention to high-level pro- processing efficiency of a single server, Tuplex is a distributed gram structure, or are limited to a small subset of the Python system, and we show results for a preliminary backend based language (e.g., numeric code only, no strings or exceptions). on AWS lambda functions. While Pyston [40] sought to create a full Python compiler us- In summary, we make the following principal contributions: ing LLVM, it was abandoned due to insufficient performance, 1. We combine ideas from query compilation with specula- memory management problems, and complexity [39]. tive compilation techniques in the dual-mode processing Python transpilers. model for data analytics: an optimized common-case Other approaches seek to cross- code path processes the bulk of the data, and a slower compile Python into languages for which optimizing compil- fallback path handles rare, non-conforming data. ers exist. Cython [4] unrolls the CPython interpreter and a 2. We observe that data analytics pipelines with Python Python module into C code, which interfaces with standard UDFs—unlike general Python programs—have suffi- Python code. Nuitka [18] cross-compiles Python to C++ and cient structure to make compilation without type annota- also unrolls the interpreter when cross-compilation is not pos- tions feasible. sible. The unrolled code represents a specific execution of the 3. We build and evaluate Tuplex, the first data analytics interpreter, which the compiler may optimize, but still runs system to embed a Python UDF compiler with a query the interpreter code, which compromises performance and compiler. inhibits end-to-end optimization. We evaluated our Tuplex prototype over real-world datasets, Data-parallel IRs. Special-purpose native code in libraries including Zillow real estate adverts, a decade of U.S. flight like NumPy can speed up some UDFs [24], but such pre- data [63], and web server logs from a large university.