Tuplex: Robust, Efficient Analytics When Python Rules Leonhard F. Spiegelberg Tim Kraska Brown University Massachusetts Institute of Technology leonhard
[email protected] [email protected] ABSTRACT was only added as an afterthought and does not achieve the Spark became the defacto industry standard as an execu- same performance as other frameworks, which were designed tion engine for data preparation, cleaning, distributed ma- with Python in mind. For example, Tensorflow, Numpy, chine learning, streaming and, warehousing over raw data. Scikit-Learn, and Keras combine a Python frontend seam- However, with the success of Python the landscape is shift- lessly with a C++ backend for performance and the possi- ing again; there is a strong demand for tools which better bility to take advantage of modern hardware (GPUs, TPUs, integrate with the Python landscape and do not have the FPGAs). This combination works particularly well as it al- impedance mismatch like Spark. In this paper, we demon- lows to transfer data between Python, other libraries and strate Tuplex (short for tuples and exceptions), a Python- C++ without unnecessary copies, conversions and minimal native data preparation framework that allows users to de- communication overhead. velop and deploy pipelines faster and more robustly while Moreover, from all the mentioned frameworks only Spark providing bare-metal execution times through code compi- is a general purpose platform aiming to do it all from data lation whenever possible. integration, over analytics and streaming, to model build- ing. However, according to the \one-size-does-not-fit-all” PVLDB Reference Format: theory [10], often huge performance gains can be achieved Leonhard F.