Scalable Querying of Nested Data
Total Page:16
File Type:pdf, Size:1020Kb
Scalable Querying of Nested Data Jaclyn Smithy, Michael Benedikty, Milos Nikolicx, Amir Shaikhhay yUniversity of Oxford, UK xUniversity of Edinburgh, UK [email protected] [email protected] ABSTRACT also contributed to the rise in NoSQL databases [4, 5, 42], While large-scale distributed data processing platforms have where the nested data model is central. Thus, it comes as become an attractive target for query processing, these sys- no surprise that nested data accounts for most large-scale, tems are problematic for applications that deal with nested structured data processing at major web companies [40]. collections. Programmers are forced either to perform non- Despite natively supporting nested data, existing systems trivial translations of collection programs or to employ au- fail to harness the full potential of processing nested collec- tomated flattening procedures, both of which lead to perfor- tions at scale. One implementation difficulty is the discrep- mance problems. These challenges only worsen for nested ancy between the tuple-at-a-time processing of local pro- collections with skewed cardinalities, where both handcrafted grams and the bulk processing used in distributed settings. rewriting and automated flattening are unable to enforce Though maintaining similar APIs, local programs that use load balancing across partitions. Scala collections often require non-trivial and error-prone In this work, we propose a framework that translates a translations to distributed Spark programs. Beyond pro- program manipulating nested collections into a set of seman- gramming difficulties, nested data processing requires scal- tically equivalent shredded queries that can be efficiently ing in the presence of large or skewed inner collections. We evaluated. The framework employs a combination of query next elaborate these difficulties by means of an example. compilation techniques, an efficient data representation for Example 1. Consider a variant of the TPC-H database nested collections, and automated skew-handling. We pro- containing a flat relation Part with information about parts vide an extensive experimental evaluation, demonstrating and a nested relation COP with information about customers, significant improvements provided by the framework in di- their orders, and the parts purchased in the orders. The verse scenarios for nested collection programs. relation COP stores customer names, order dates, part IDs PVLDB Reference Format: and purchased quantities per order. The type of COP is: Jaclyn Smith, Michael Benedikt, Milos Nikolic, and Amir Shaikhha. Scalable Querying of Nested Data. PVLDB, 12(xxx): xxxx-yyyy, Bag (h cname : string; corders : Bag (h odate : date; 2019. oparts : Bag (h pid : int; qty : real i) i) i): DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx Consider now a high-level collection program that returns for each customer and for each of their orders, the total 1. INTRODUCTION amount spent per part name. This requires joining COP with Large-scale, distributed data processing platforms such as Part on pid to get the name and price of each part and then Spark [59], Flink [9], and Hadoop [21] have become indis- summing up the costs per part name. We can express this pensable tools for modern data analysis. Their wide adop- using nested relational calculus [11, 56] as: tion stems from powerful functional-style APIs that allow for cop in COP union programmers to express complex analytical tasks while ab- fh cname := cop:cname; arXiv:2011.06381v1 [cs.DB] 12 Nov 2020 stracting distributed resources and data parallelism. These corders := systems use an underlying data model that allows for data for co in cop:corders union fh odate := co:odate; to be described as a collection of tuples whose values may Q oparts := sumBytotal( themselves be collections. This nested data representation pname for op in co:oparts union arises naturally in many domains, such as web, biomedicine, Qcorders for p in Part union and business intelligence. The widespread use of nested data if op:pid == p:pid then Qoparts fh pname := p:pname; total := op:qty ∗ p:price ig)igig This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International License. To view a copy The program navigates to oparts in COP, joins each bag of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For with Part, computes the amount spent per part, and aggre- any use beyond those covered by this license, obtain permission by emailing gates using sumBy which sums up the total amount spent for [email protected]. Copyright is held by the owner/author(s). Publication rights each distinct part name. The output type is: licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 12, No. xxx Bag (h cname : string; corders : Bag (h odate : date; ISSN 2150-8097. oparts : Bag (h pname : string; total : real i) i) i): DOI: https://doi.org/10.14778/xxxxxxx.xxxxxxx 1 A key motivation for our work is analytics pipelines, which an output. Exploration of succinctness and efficiency for typically consist of a sequence of transformations with the different representations of nested output has not, to our output of one transformation contributing to the input of knowledge, been the topic of significant study. the next, as above. Although the final output consumed Challenge 3: Data skew. Last but not least, data by an end user or application may be flat, the intermediate skew can significantly degrade performance of large-scale nested outputs may still be important in themselves: either data processing, even when dealing with flat data only. Skew because the intermediate outputs are used in multiple follow- can lead to load imbalance where some workers perform sig- up transformations, or because the pipeline is expanded and nificantly more work than others, prolonging run times on modified as the data is explored. Our interest is thus in platforms with synchronous execution such as Spark, Flink, programs consisting of sequence of queries, where both inputs and Hadoop. The presence of nested data only exacerbates and outputs may be nested. the problem of load imbalance: inner collections may have We next discuss the challenges associated with processing skewed cardinalities { e.g., very few customers can have very such examples using distributed frameworks. many orders { and the standard top-level partitioning places Challenge 1: Programming mismatch. The transla- each inner collection entirely on a single worker node. More- tion of collection programs from local to distributed settings over, programmers must ensure that inner collections do not is not always straightforward. Distributed processing sys- grow large enough to cause disk spill or crash worker nodes tems natively distribute collections only at the granularity due to insufficient memory. 2 of top-level tuples and provide no direct support for using Prior work has addressed some of these challenges. High- nested loops over different distributed collections. To ad- level scripting languages such as Pig Latin [43] and Scope [13] dress these limitations, programmers resort to techniques ease the programming of distributed data processing sys- that are either prohibitively expensive or error-prone. In tems. Apache Hive [6], F1 [48, 46], Dremel [40], and Big- our example, a common error is to try to iterate over Part, Query [47] provide SQL-like languages with extensions for which is a collection distributed across worker nodes, from querying nested structured data at scale. Emma [3] and within the map function over COP that navigates to oparts, DIQL [27] support for-comprehensions and respectively SQL- which are collections local to one worker node. Since Part like syntax via DSLs deeply-embedded in Scala, and target is distributed across worker nodes, it cannot be referenced several distributed processing frameworks. But alleviating inside another distributed collection. Faced with this im- the performance problems caused by skew and manually flat- possibility, programmers could either replicate Part to each tening nested collections remains an open problem. worker node, which can be too expensive, or rewrite to use Our approach. To address these challenges and achieve an explicit join operator, which requires flattening COP to scalable processing of nested data, we advocate an approach bring pid values needed for the join to the top level. A nat- that relies on three key aspects: ural way to flatten COP is by pairing each cname with every Query Compilation. When designing programs over tuple from corders and subsequently with every tuple from nested data, programmers often first conceptualize desired oparts; however, the resulting (cname; odate; pid; qty) tu- transformations using high-level languages such as nested ples encode incomplete information { e.g., omit customers relational calculus [11, 56], monad comprehension calcu- with no orders { which can eventually cause an incorrect lus [26], or the collection constructs of functional program- result. Manually rewriting such a computation while pre- ming languages (e.g., the Scala collections API). These lan- serving its correctness is non-trivial. guages lend themselves naturally to centralized execution, Challenge 2: Input and output data representa- thus making them the weapon of choice for rapid prototyp- tion. Default top-level partitioning of nested data can lead ing and testing over small datasets. to poor scalability, particularly with few top-level tuples To relieve programmers from manually rewriting program and/or large inner collections. The number of top-level tu- for scalable execution (Challenge 1), we propose using a ples limits the level of parallelism by requiring all data on compilation framework that automatically restructures pro- lower levels to persist on the same node, which is detrimen- grams to distributed settings. In this process, the framework tal regardless of the size of these inner collections. Con- applies optimizations that are often overlooked by program- versely, large inner collections can overload worker nodes mers such as column pruning, selection pushdown, and push- and increase data transfer costs, even with larger numbers ing nesting and aggregation operators past joins.