Lachesis: Automatic Partitioning for UDF-Centric Analytics

Lachesis: Automatic Partitioning for UDF-Centric Analytics Jia Zou Arun Iyengar Binhang Yuan Amitabh Das IBM T.J.Watson Research Center Dimitrije Jankov Pratik Barhate [email protected] Chris Jermaine Arizona State University Rice University (jia.zou,adas59,pbarhate)@asu.edu (by8,dj16,cmj4)@rice.edu ABSTRACT enumerate partitioner candidates based on foreign keys and select Partitioning is effective in avoiding expensive shuffling operations. the optimal candidate using a cost-based approach. However, it remains a significant challenge to automate this process However, it remains a significant challenge to automate this for Big Data analytics workloads that extensively use user defined process for UDF-centric analytics workloads, as illustrated in Fig. 1. functions (UDFs), where sub-computations are hard to be reused First, functional dependency that is widely utilized for partition- for partitionings compared to relational applications. In addition, ing selection is often unavailable in the unstructured data. Second, functional dependency that is widely utilized for partitioning selec- while the cost model based on relational algebra is widely used for tion is often unavailable in the unstructured data that is ubiquitous selecting optimal partitioner candidates for relational applications, in UDF-centric analytics. We propose the Lachesis system, which there is no widely acceptable cost model for UDF-centric applica- represents UDF-centric workloads as workflows of analyzable and tions due to the opaqueness of UDFs and objects [49, 67]. Third, reusable sub-computations. Lachesis further adopts a deep reinforce- sub-computations are opaque to the system and hard to be reused ment learning model to infer which sub-computations should be and matched for partitionings compared to relational applications. used to partition the underlying data. This analysis is then applied to automatically optimize the storage of the data across applications Partitioner Partitioner Partitioner Candidate Candidate Matching to improve the performance and users’ productivity. Enumeration Selection Foreign Key Attribute PVLDB Reference Format: Relational Cost Estimation Analysis queries Matching Jia Zou, Amitabh Das, Pratik Barhate, Arun Iyengar, Binhang Yuan, candidates sizes Order SELECT * Dimitrije Jankov, and Chris Jermaine. Lachesis: Automatic Partitioning for FROM X, Y Lineitem UDF-Centric Analytics. PVLDB, 14(8): 1262-1275, 2021. cost = io_cost WHERE X.a = Y.b candidate1 oid: long Part + cpu_cost doi:10.14778/3457390.3457392 candidate2 pid: long candidate3 sid: long features costs … CREATE TABLE X … selector (ranking, (…) 1 INTRODUCTION Supplier RL, genetic, …) PARTITION BY optimal Hash(a) Big Data analytics systems such as Spark [63], Hadoop [60], Flink [2], candidate and TupleWare [12] have been designed and developed to address Subgraph UDF-centric Historical IR No Cost Estimation analytics on unstructured data which cannot be efficiently repre- Graph Analysis Matching query sented in relational schemas. By supplying user-defined functions producer consumer candidates producer cluster (UDFs) written in the host language, such as Python, Java, Scala, or State Vector data C++, control structures such as conditional statements and loops can be used to express complex computations. Such systems pro- DRL partitioner vide high flexibility and make it easy to develop complex analytics future reward optimal data partitioner based on on top of unstructured data, which accounts for most of the world’s candidate candidate latency data data (above 80% by many estimates [54]). Figure 1: Lachesis vs. relational physical database design Most Big Data analytics frameworks are deployed on distributed clusters and require to partition a large dataset horizontally across multiple machines [7]. Because a large dataset can be involved in Motivating Example. We have three datasets: (1) a collection of f g multiple join-based analytics workloads, finding the optimal parti- reddit comments objects in JSON ( 2 ); (2) a collection of reddit f g tioning is a non-trivial task [1, 6, 13, 18, 28, 38, 43, 65]. Therefore, author objects ( 0 ) in CSV; and (3) a collection of the subreddit f g it is urgent to automate this partitioning process. Existing works in community objects ( BA ) also in JSON. We need to rank comments physical database design [1, 23, 28, 38, 43, 65] can well automate by their impacts. A classifier first predicts whether the author or the partitioning for relational datasets. As illustrated in Fig. 1, they the subreddit community is more important for this comment. The impact score (퐼2 ) of a comment (2) is determined by the classifica- This work is licensed under the Creative Commons BY-NC-ND 4.0 International tion result as illustrated in Eq. 1. The classifier may use arbitrary License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of algorithm, such as a complex deep learning neural network or a this license. For any use beyond those covered by this license, obtain permission by emailing [email protected]. Copyright is held by the owner/author(s). Publication rights simple conditional branch such as c.score > x. licensed to the VLDB Endowment. ( Proceedings of the VLDB Endowment, Vol. 14, No. 8 ISSN 2150-8097. maxf0.;:, 2.2:g 85 2;0BB85 ~¹2º 8B CAD4 doi:10.14778/3457390.3457392 퐼2 = (1) BA.=B >Cℎ푒AF8B4 1262 The comment importance scores can be computed via a UDF- partitioner candidate selection [1, 6, 13, 18, 28, 38, 43, 65] cannot de- customized three-way join, of which the UDF that defines the join scribe the overhead of manipulations (i.e, parsing, (de)compression, selection predicate is illustrated in Listing. 1. and (de)serialization) of arbitrary objects [4, 48]. These issues bring Unlike SQL applications where in many cases people can simply challenges for selecting the optimal partitioner. Therefore, we pro- follow the foreign keys to perform co-partitioning, UDF-centric pose a deep reinforcement learning (DRL) [32, 37, 50, 51] formula- applications may involve arbitrary logic such as the classify() in tion that is based on a set of unique features extracted from histori- the above example. Even the application programmer cannot easily cal workflows for each partitioner candidate, including frequency, figure out the optimal partitioning. To make it worse, UDF-centric recency, selectivity, complexity, key distributions, number and size applications running over complex objects are almost opaque to of co-partitioned datasets, etc.. (Sec. 3.2.2 and Sec. 4.1.3) the system, for example, the system does not understand what is Problem 3. Partitioner Matching. Given a query, if its input has been happening inside the join_selection() function as illustrated in partitioned using a UDF-based partitioner, the query optimizer Listing. 1 and thus automatic enumeration, selection, and matching should recognize the partitioning and decide whether a shuffling of partitionings for this problem become difficult. stage can be avoided. To facilitate such matching, we abstract the UDF matching problem into an IR subgraph isomorphism prob- Listing 1: UDF-centric join selection predicate lem [10] utilizing the two-terminal characteristics of the partitioner bool join_selection (string comment_line, string author_line, string subreddit_line) { string new_comment_line = schema_resolve(comment_line); //preprocessing IR graphs. (Sec. 3.2.3 and Sec. 4.2) json c = my_json::parse(new_comment_line); //parsing comment json object Our contributions can be summarized as: if (classify(c) == true){ //need to join with authors string c_a = c["author"];//derive author name from comment (1) As to our knowledge, we are the first to systematically explore vector<string> r = my_csv::parse(author_line);//parsing author CSV file automatic partitioning for UDF-centric applications. We propose string a_name = r[1]; //derive name from author return (c_a == a_name); Lachesis, which is an end-to-end cross-layer system that automati- } else { //need to join with subreddits string c_sr = c["subreddit"]; //derive subreddit name from comment cally creates partitions to improve workflow performance. json sr = my_json::parse(subreddit_line);//parsing subreddit json object (2) We propose a set of new functionalities for partitioner candi- string sr_name = sr["name"]; //derive name from subreddit return (c_sr == sr_name); }} date enumeration, selection, and partitioner matching, based on subgraph searching and merging, DRL with historical workflow analysis, and isomorphic subgraph matching. <a, c> <c, sr> <a_name, a> <c_a, c> <c_sr, c> <sr_name, sr> (3) We implement the Lachesis system and conduct detailed perfor- join join mance evaluation and overhead analysis. pair pair pair pair c_a c_sr c_a c_sr sr_name sr a a_name T switch F 2 BACKGROUND [] “name” [] “author” classify_v1 [] “subreddit” []“name” c c 2.1 IR for UDF-centric analytics mycsv::parse c myjson::parse c myjson::parse schema_resolve User defined function (UDF) is first proposed as an enrichment of the SQL language, to allow SQL programmers to implement their scan scan scan own functions for processing relational data. Later, the MapReduce authors.csv comments.json subreddits.json and dataflow platforms such as Hadoop60 [ ], Spark [63], Flink [2], Figure 2: IR graph for Listing. 1: Partitioner candidates for authors, further integrate the UDFs with high-level languages like Java, comments, and subreddits, are illustrated in different colors. Python so that even non-relational data such as texts and images

Lachesis: Automatic Partitioning for UDF-Centric Analytics

700 E | 7000 E | 7200 E

Universal Mythology: Stories

Discovering Discrepancies in Numerical Libraries

The Analects of Confucius

Lecture 9 Good Morning and Welcome to LLT121 Classical Mythology

Daily Life for the Common People of China, 1850 to 1950

Confucius and His Disciples in Thelunyu

At the Feet of the Moirai

Chap. 18; and Sy- Mons 1981: 9–10

Three Fates Free Ebook

A Review of Battery State of Health Estimation Methods: Hybrid Electric Vehicle Challenges

Lachesis: Robust Database Storage Management Based on Device-Speciﬁc Performance Characteristics