Lachesis: Automatic Partitioning for UDF-Centric Analytics Jia Zou Arun Iyengar Binhang Yuan Amitabh Das IBM T.J.Watson Research Center Dimitrije Jankov Pratik Barhate
[email protected] Chris Jermaine Arizona State University Rice University (jia.zou,adas59,pbarhate)@asu.edu (by8,dj16,cmj4)@rice.edu ABSTRACT enumerate partitioner candidates based on foreign keys and select Partitioning is effective in avoiding expensive shuffling operations. the optimal candidate using a cost-based approach. However, it remains a significant challenge to automate this process However, it remains a significant challenge to automate this for Big Data analytics workloads that extensively use user defined process for UDF-centric analytics workloads, as illustrated in Fig. 1. functions (UDFs), where sub-computations are hard to be reused First, functional dependency that is widely utilized for partition- for partitionings compared to relational applications. In addition, ing selection is often unavailable in the unstructured data. Second, functional dependency that is widely utilized for partitioning selec- while the cost model based on relational algebra is widely used for tion is often unavailable in the unstructured data that is ubiquitous selecting optimal partitioner candidates for relational applications, in UDF-centric analytics. We propose the Lachesis system, which there is no widely acceptable cost model for UDF-centric applica- represents UDF-centric workloads as workflows of analyzable and tions due to the opaqueness of UDFs and objects [49, 67]. Third, reusable sub-computations. Lachesis further adopts a deep reinforce- sub-computations are opaque to the system and hard to be reused ment learning model to infer which sub-computations should be and matched for partitionings compared to relational applications.