Tuplex: Data Science in Python at Native Code Speed Leonhard Spiegelberg Rahul Yesantharao Malte Schwarzkopf Tim Kraska
[email protected] [email protected] [email protected] [email protected] Brown University MIT Brown University MIT Abstract compute a flight’s distance covered from kilometers to miles Data science pipelines today are either written in Python via a UDF after joining with a carrier table: or contain user-defined functions (UDFs) written in Python. carriers= spark.read.load( 'carriers.csv') But executing interpreted Python code is slow, and arbitrary fun= udf( lambda m: m* 1.609, DoubleType()) Python UDFs cannot be compiled to optimized machine code. spark.read.load('flights.csv') We present Tuplex, the first data analytics framework that .join(carriers, 'code', 'inner') just-in-time compiles developers’ natural Python UDFs into .withColumn('distance', fun('distance')) efficient, end-to-end optimized native code. Tuplex introduces .write.csv('output.csv') a novel dual-mode execution model that compiles an opti- mized fast path for the common case, and falls back on slower This code will run data loading and the join using Spark’s exception code paths for data that fail to match the fast path’s compiled Scala operators, but must execute the Python UDF assumptions. Dual-mode execution is crucial to making end- passed to withColumn in a Python interpreter. This requires to-end optimizing compilation tractable: by focusing on the passing data between the Python interpreter and the JVM [41], common case, Tuplex keeps the code simple enough to apply and prevents generating end-to-end optimized code across the aggressive optimizations.