<<

Optimizing Across Relational and Linear in Parallel Analytics Pipelines

Asterios Katsifodimos Assistant Professor @ TU Delft

[email protected] http://asterios.katsifodimos.com Outline

• Introduction and Motivation • vs. parallel dataflow systems

• Declarativity in Parallel-dataflow Systems • The Emma language and optimizer • The Lara language for mixed linear- and relational-algebra programs

• Optimizing across linear & relational Algebra • BlockJoin: A algorithm for mixed relational & linear algebra programs

• The road ahead

2 A timeline of data analysis tools from a biased perspective

Appearance of First parallel Open Source First columnar Scalable, UDF- Alternative Better Abstractions Relational shared-nothing Projects and storage based, data MapReduce for dataflow systems Databases architectures mainstream Databases. Faster! analytics blooms implementations go databases mainstream

70s 80s 90s 00s 2004 - 2009 2009 - 2016 2016 -

Ingres Gamma BerkeleyDB MonetDB Hadoop Spark Spark Dataframes Oracle TRACE MS Access C-Store MapReduce Flink SystemML System-R Te r a d a t a PostgreSQL Vertica Impala Emma … … MySQL Aster Data Giraph MRQL … GreenPlum Hive (SQL!) … HBase Blu … 1974: Don Chamberlin We need easier to use tools! publishes SQL Open Source rises

2003: Internet Explodes Criticism of MapReduce. NoSQL movement. Back to SQL?

Google publishes MapReduce. 3 Relational DBs vs. Modern Big Data Processing Stack

SQL SQL-like abstraction

Iterative M/R-like Jobs (Java, Scala, …) SQL Compiler & Optimizer SQL Compiler & Optimizer

Relational Dataflow Put/Get ops Engine Apache Apache Flink Hadoop Spark Files/Bytes Transaction Manager HBase

Row/ Storage Hadoop Distributed File System (HDFS)

Operating System Cluster Management (Kubernetes, YARN, …)

4 Relational Databases

SQL Relations RDBMS

Declarative Data Processing

Second-Order Distributed Parallel Dataflow Functions (Map, Reduce, etc.) Collections Engines

Parallel Dataflows

5 People with data analysis skills

Big Data systems programming experts Big data analysts Quiz: guess the algorithm!

... // initialize while (theta) { newCntrds = points .map(findNearestCntrd) .map( (c, p) => (c, (p, 1L)) ) .reduceByKey( (x, y) => (x._1 + y._1, x._2 + y._2) ) .map( x => Centroid(x._1, x._2._1 / x._2._2) )

bcCntrs = sc.broadcast(newCntrds.collect()) }

... // initialize val cntrds = centroids.iterate(theta) { currCntrds => val newCntrds = points .map(findNearestCntrd).withBcSet(currCntrds, "cntrds") .map( (c, p) => (c, p, 1L) ) .groupBy(0).reduce( (x, y) => (x._1, x._2 + y._2, x._3 + y._3) ) .map( x => Centroid(x._1, x._2 / x._3) ) Apache Flink currCntrds } K-Means Clustering

• Input • Dataset ( of points in 2D) — large • K initial centroids — small • Step 1: • Read K-centroids • Assign each point to the closest centroid (wrt distance) • Output (or, cluster) • Step 2: • For each centroid • Calculate its average position wrt distance from its points • Output: (coordinates)

8 Partial Aggregates

... // initialize while (theta) { newCntrds = points .map(findNearestCntrd) .map( (c, p) => (c, (p, 1L)) ) .reduceByKey( (x, y) => (x._1 + y._1, x._2 + y._2) ) .map( x => Centroid(x._1, x._2._1 / x._2._2) )

bcCntrs = sc.broadcast(newCntrds.collect()) } Efficient aggregation requires a special combination of primitives

... // initialize val cntrds = centroids.iterate(theta) { currCntrds => val newCntrds = points .map(findNearestCntrd).withBcSet(currCntrds, "cntrds") .map( (c, p) => (c, p, 1L) ) .groupBy(0).reduce( (x, y) => (x._1, x._2 + y._2, x._3 + y._3) ) .map( x => Centroid(x._1, x._2 / x._3) )

currCntrds } Broadcast Variables

... // initialize while (theta) { newCntrds = points .map(findNearestCntrd) .map( (c, p) => (c, (p, 1L)) ) .reduceByKey( (x, y) => (x._1 + y._1, x._2 + y._2) ) .map( x => Centroid(x._1, x._2._1 / x._2._2) )

bcCntrs = sc.broadcast(newCntrds.collect()) } Explicit data-exchange between the driver and the parallel dataflows

... // initialize val cntrds = centroids.iterate(theta) { currCntrds => val newCntrds = points .map(findNearestCntrd).withBcSet(currCntrds, "cntrds") .map( (c, p) => (c, p, 1L) ) .groupBy(0).reduce( (x, y) => (x._1, x._2 + y._2, x._3 + y._3) ) .map( x => Centroid(x._1, x._2 / x._3) )

currCntrds } Native Iterations

... // initialize while (theta) { newCntrds = points .map(findNearestCntrd) .map( (c, p) => (c, (p, 1L)) ) .reduceByKey( (x, y) => (x._1 + y._1, x._2 + y._2) ) .map( x => Centroid(x._1, x._2._1 / x._2._2) )

bcCntrs = sc.broadcast(newCntrds.collect()) }

Feedback edges in the dataflow graph have to be specified explicitly.

... // initialize val cntrds = centroids.iterate(theta) { currCntrds => val newCntrds = points .map(findNearestCntrd).withBcSet(currCntrds, "cntrds") .map( (c, p) => (c, p, 1L) ) .groupBy(0).reduce( (x, y) => (x._1, x._2 + y._2, x._3 + y._3) ) .map( x => Centroid(x._1, x._2 / x._3) )

currCntrds } Beyond Datasets: End-to-end ML Pipelines

A typical Data Science Workflow

Data Management Machine Learning

1 2 3 4 5 6 7 8 Data Source Data ML Hyper Question Feature Model Model Exploration cleaning & Algorithm Parameter Formulation Selection Training Evaluation & Selection preparation Selection Search

80% of data scientists’ time

12 A very simple ML pipeline example

Factory 1

Gather and Principal Train ML ⋈ clean sensor Component Factory 2 Model data Analysis

Factory 3

13 Preprocessing

Naturally expressed in • SQL SELECT • Dataflow APIs f1.v, f2.v, f3.v FROM Factory factory_1 AS f1, 1 factory_2 AS f2, factory_3 AS f3 Gather and Factory ⋈ clean sensor WHERE 2 data f1.sid == f2.sid AND f2.sid == f3.sid AND f1.v != NULL AND Factory f2.v != NULL AND 3 f3.v != NULL

14 Matrix Operations/ML in SQL Calculating covariance matrix in SQL

Impedance mismatch!

Principal Train ML Component Model Analysis

15 * From:http://stackoverflow.com/questions/18019469/finding-covariance-using- Machine Learning in “Native” Syntax

Naturally expressed in • Matlab // read factories as matrix FacMatrix • R [rows, cols] = size(FacMatrix) • … colMeans = mean(FacMatrix) NormMat = zeros(rows, cols) for i = 1:rows Principal Train ML Component NormMat(i, :) = FacMatrix(i, :) - colMeans; Model Analysis end CovMat = NormMat’ * NormMat / (rows – 1)

// … 16 SQL + ML Native Syntax = Optimization Barrier

Optimization barrier! Factory 1

Gather and Principal Train ML ⋈ clean sensor Component Factory 2 Model data Analysis

Factory 3

17 Big Data Language Desiderata

• Declarativity: what vs. how • Enables optimizability, eases programming • User-Defined Functions as first-class citizens • Important reason why MapReduce took off

• Control-flow for Machine Learning and Graph workloads • Iterations, if-then-else branches • Turing complete “driver” with parallelizable constructs

• Portability • Should support multiple backend engines (Flink, Spark, MapReduce, etc.) • Deep embedding • e.g. à la LINQ (Language Integrated Query) in C#, Ferry in Haskel, etc. 18 Emma/Lara: Databags & Matrices

//SELECT f1.v, f2.v Deeply Embedded in Scala //FROM factory_1 as f1, factory_2 as f2 //WHERE f1.id == f2.id and f1.v != and f2.v != null val factories = for { Databag type f1 <- factory_1 § Like RDDs on Spark, Datasets on Flink, etc. f2 <- factory_2 Gather and § Operations on Databags are parallelized clean data if f1.sid == f2.sid && f1.v != null && Declarative for-comprehensions f2.v != null && § Like Select-From-Where in SQL § } yield (f1.v,…, f2.v, …) Can express join, cross, filter, etc. Loops: First-class citizens

while(…) { Databag => Matrix § val FacMatrix = factories.toMatrix() Enable optimization across Linear and relational algebra //Calculate the mean of each of the columns Principal § Intermediate representation to reason Componen val colMeans = FacMatrix.cols( col => col.mean() ) across linear and relational (WiP) t Analysis val NormMat = FacMatrix – colMeans UDFs: First-class citizens //Calculate covariance matrix val CovMat = NormMat.t %*% NormMat/(NormMat.numRows – 1) “Native” Matrix Syntax // ... § Against Impedance mismatch19 [SIGMOD15/16, SIGMOD Record 16, BeyondMR 2016, PVLDB 2017] FP goodness: Comprehension Syntax Comprehensions: generalize SQL and are available as first-class syntax in modern general purpose programming languages.

Math ⦃ (x, y) | x ∈ xs, y ∈ ys, x = y ⦄ SQL

SELECT x, y FROM x AS xs, y AS ys WHERE x = y

Python

[ (x, y) for x in xs, y in ys if x == y ]

Scala

for (x <- xs; y <- ys; if x == y) yield (x, y) 20 Automatic Optimization of Programs

• Caching of intermediate results • Reused intermediate results (e.g., within a loop) are automatically cached

• Choice of Partitioning • Pre- datasets which are used in multiple joins

• Unnesting • e.g., converting exists() clauses into joins

• Join order/algorithm selection • Classic database optimizations

• Fold-group Fusion • Calculate multiple aggregates on one pass • … 21 A concrete optimization BlockJoin: partitioning data for linear algebra operations through joins

[PVLDB ’17 (VLDB 18)] "BlockJoin: Efficient Matrix Partitioning Through Joins” A. Kunft, A. Katsifodimos, S. Schelter, T. Rabl, V. Markl. In the Proceedings of the VLDB Endowment, Vol. 10, No. 13, to be presented in VLDB 2018. 22 - vs. block-partitioned operations

Relational Algebra: Aggregate over a row (avg) Linear Algebra: Matrix Multiplication

Certain relational operations better executed Scalable Linear Algebra operations such as over row-partitioned tables. Matrix Multiplication better executed over block-partitioned matrices. Orders of magnitude speedups. 23 Matrix Block-partitioning

erators in order to produce or consume block-partitioned 1 2 datasets has the potential to significantly speedup workloads 1 2 3 4 that mix operators from linear and relational algebra. We 1.1 1.2 1.3 1.4 1 1.1 1.2 1.3 1.4 1 put our focus on joins, and propose BlockJoin, a join op- 2.1 2.2 2.3 2.4 erator which consumes row-partitioned relational data and 2 2.1 2.2 2.3 2.4 outputs a block-partitioned matrix. The task of designing such an operator poses a number of challenges. First, in or- 3 3.1 3.2 3.3 3.4 3.1 3.2 3.3 3.3 2 der to block-partition a matrix which results from a join, the 4 4.1 4.2 4.3 4.4 4.1 4.2 4.3 4.4 matching of the two joining sides need to be known Original Matrix Block-Partitioned Matrix in advance. This stems from the fact that a matrix block (a) Original matrix M (b) 2 x 2 blocked matrix. can contain columns from both input relations. Thus, the partitioning function cannot know which those columns are, Figure 2: A matrix and its blocked representation. until there is an association of all pairs of tuples in the two joined relations. Therefore, such workloads are typically im- 2. BACKGROUND 24 plemented by first materializing the join results, and subse- In this section we introduce the blocked matrix represen- quently block-partitioning those results. tation and discuss a running example which will be used In contrast, BlockJoin is based on the observation that throughout the paper. For the example, we also depict a it is sucient to know which keys and corresponding baseline implementation in a distributed dataflow system. identifiers match, prior to materializing the complete join result. Analogous to joins which have been proposed for 2.1 Block-Partitioned Matrix Representation columnar databases [23, 8, 1], BlockJoin builds on two main Distributed dataflow systems use an element-at-a-time concepts: index joins and late materialization. More specif- processing model in which an element typically represents ically, BlockJoin first identifies the matching tuple pairs by a line in a text file or a tuple of a . Systems which performing a join on the keys of the two relations (analogous implement matrices in this model can choose among a va- to TID-Joins [24]). After the matching tuple-id pairs have riety of partition schemes (e.g., cell-, row- or column-wise been identified, a partitioning function derives, for each tu- partitioning) of the matrix. For common operations such ple, the blocks to which that tuple contributes data. Finally, as matrix multiplications, all of these representations in- it materializes the join result by sending tuples to the ma- cur huge performance overhead [16]. Block-partitioning the chines hosting the corresponding blocks. Thus, BlockJoin matrix provides significant performance benefits, including completely avoids to materialize the join results in a row- a reduction in the number of tuples required to represent partitioned manner. Instead, it performs the joining and and process a matrix, block-level compression, and the op- block-partitioning phases together. Our experiments show timization of operations like multiplication on a block-level that BlockJoin performs up to 7x faster than the baseline of basis. These benefits have led to the widespread adoption of conducting a row-wise join followed by a block-partitioning block-partitioned matrices in parallel data processing plat- step. forms [16, 18, 19]. In a blocked representation, the matrix is split into submatrices of fixed size, called blocks. These The contributions of this paper are summarized as follows: blocks become the processing elements in the dataflow sys- We discuss and motivate the need for implement- tem. Figure 2 shows a matrix and its blocked representation. • ing relational operators producing block-partitioned In the remainder of the paper, we will refer to individual datasets (Section 2.2). blocks using horizontal and vertical block indices: for in- stance, given a matrix M, we denote the bottom left block We propose BlockJoin, a distributed join algorithm of Figure 2 (b) as M2,1. While our algorithm works with ar- • which produces block-partitioned results for workloads bitrarily shaped blocks, the majority of systems uses square mixing linear and relational algebra operations. To the blocks only, and we adopt this convention in the rest of the best of our knowledge, this is the first work proposing paper for the sake of simplicity. a for block-partitioned results (Sec- tion 3). 2.2 Motivating Example For our motivating example we choose a common use case We compare the performance of BlockJoin against a of learning a spam detection model in an e-commerce appli- • baseline approach, which first joins row-partitioned cation. Assume that customers write reviews for products, relations and then block-partitions the results. We some of which are spam, and we want to train a classifier to thereby exemplify BlockJoin’s superiority experimen- automatically detect the spam reviews. The data for prod- tally and via a cost model (Section 3.3). ucts and reviews is stored in di↵erent files in a distributed filesystem, and we need an ML algorithm to learn from at- We provide a reference implementation of BlockJoin tributes of both relations. Therefore, we first need to join • based on Apache Spark [38] and experimentally show the records from these tables to obtain reviews with their that, depending on the size and shape of the in- corresponding products. Next, we need to transform these put relations, BlockJoin is up to 7x faster compared product-review pairs into a suitable representation for an to the baseline approach. Moreover, we show that ML algorithm, which means we need to apply a user defined for tables with large numbers of columns, BlockJoin function (UDF) that transforms the attributes into a vector scales gracefully when the baseline approach fails (Sec- representation. Finally, we aggregate these vectors into a tions 4 and 5). distributed, blocked feature matrix to feed them into an ML system (such as SystemML). Common Pattern: Blocked Matrices as Join Results

e.g., join products and their reviews, to form a feature matrix. The matrix is block partitioned to be used in linear algebra operations.

Feature Matrix Block-Partitioned Matrix

Products ⋈ repartition

Reviews

25 Data shuffling dominates execution time

• Challenge: can we join and block-partition results with less shuffling?

• Solution: BlockJoin, a one-step parallel join algorithm which outputs block- partitioned results

• Basic idea 1. Find out which and how many tuples survive the join (without shuffling) 2. Give each surviving tuple a unique, sequential ID 3. Shuffle only surviving tuples and block-partition the joining tuples on the fly

• Results: up to >7x speedup over row-wise join followed by block-partitioning • Implementation can be further optimized (compression, sorting, etc.) • BlockJoin can be used for multi-way joins (where we expect larger speedups)

26 Goal: Join two tables, then block-partition them

Original Tables Join Result

User Transaction a 1 ⋈ a 2 3 B[0,0] a 1 2 3 B[0,1] b 1 a 2 3 a 1 2 3 d 1 a 2 3 = a 1 2 3 b 2 3 B[1,0] b 1 2 3 B[1,1] c 2 3

27 Join two tables, then block-partition them

U T U T B[0,0] B[0,1] a 1 a 2 3 a 1 ⋈ a 2 3 = a 1 2 3 a 1 2 3 b 1 a 2 3 a 2 3 a 1 2 3 a 1 2 3 Shuffle a 2 3 a 1 2 3 Shuffle

Node 1

Node 2 U I U T B[1,0] B[1,1] d 1 a 2 3 b 1 ⋈ b 2 3 = b 1 2 3 a 1 2 3 b 2 3 d 1 c 2 3 b 1 2 3 c 2 3

28 BlockJoin: fused join + block-partitioning operator Saves heavy repartitioning costs. Step I: Local Index Join Step II: Block Materialization

1. Find which tuples survive the join. Exchange tuples parts in order to materialize blocks on 2. Prepare surviving tuple IDs and metadata. each node participating in the join. Two materialization 3. Broadcast tuple IDs and metadata to all nodes. strategies, late & early.

Products Prod1, P0,0 Prod2, P Block Meta-Data P (Prod1,‘XPhone’, 199,‘Cat_SP’) 0,1 0,0 Prod3, P Products Reviews P Late (Prod2,‘YPhone’, Materialization 599,‘Cat_UT’)Early Materialization 1,0Late Materialization Early Materialization 0,1 Prod4, P fetch kernel 4 fetch kernel fetch kernel 4 fetch kernel 1,1 4 fetch kernel Prod1 P R 0 P R 1 P1,0 (Prod3,‘ZPhone’, 99,‘Cat_Phone_old’) 0,0 0,0 row-idx col-idx 0,0 0,0 P (Prod4,‘ZPhone2’,199,‘Cat_UT’) 5 collect 1,1 Prod1 P R 0121 P R 5 012 split5 split range-part. 5 range-part. 0,0 0,1 0,0 1,0 0 0,0 1,0 0 split 3 Main results range-partition 1,0 Prod21,1 : P R 2 2 P & localR sort 1 broadcast 2 Reviews 0,1 0,1 0,1 0,1 U+ create blocks 6 create blocks o 7x speedups compared to index-join 0,0 Prod40,1 P R 1 3 P R 2 separate joining1,1 and1,1 1,1 1,1 R0,0 (Prod1,‘...’,4,false) 1,0 1,1 3 3 Prod1, R0,0 R (Prod2,‘...’,5,true) + block-partitioning 3 0,1 U Prod2, R 6 0,1 join key TID TID6 row-idx TID TID6 Prod1, Ro Scales more gracefullyP R P R broadcast R merge (Prod1,‘...’,5,false) & materialize blocks 1,0 0 1 merge & create blocks 0 1 1,0 Prod4, R R (Prod4,‘...’,2,false) 1,1 0 0 materialize blocks 0 0 fetch kernel 4 fetch kernel 1,1 7 01 blk-col-idx 01 merge blocks 1 2 1 1 blk-row-idx 1 1 TID : < Relation, Partition-ID, Index within Partition > collect index-join & apply row-idx (a) (b) (a) (b) [VLDB ‘17] "BlockJoin: Efficient Matrix Partitioning Through Joins”: A. Kunft, A. Katsifodimos, S. Schelter, T. Rabl, V. Markl. 29 PVLDB 2017, to appear in VLDB 2018, Rio. BlockJoin step 1: Find the surviving tuples with a semi-join/index-join

Driver Node: calculates scaffolds of final results (similar to index-join). The scaffold can be used for block partitioning (next step). U T 0 a a 1 a 2 3 Collect join keys 1 a Broadcast b 1 a 2 3 at the driver. U T 2 a B[0,0] 0 a B[0,1] 3 b a ⋈ a Node 1 1 a Node 1 b a = 2 a Node 2 d a Node 2 B[1,0] 3 b B[1,1] 0 a b 1 a c U T 2 a d 1 a 2 3 3 b b 2 3 c 2 3 30 BlockJoin step 2: Fill-in the blanks and shuffle the right tuples to their respective blocks

0 a 1 2 3 1 a 1 2 3 U T B[0,0] a 1 2 3 B[0,1] a 1 a 2 3 => 2 a a 1 2 3 b 1 a 2 3 3 b 1 Shuffle

Node 1

Node 2

0 a U T 1 a B[1,0] a 1 2 3 B[1,1] d 1 a 2 3 => 2 a 1 2 3 b 1 2 3 b 2 3 3 b 2 3 c 2 3 31 The devil is in the detail but details were skipped to simplify presentation

• The shown example assumed • Primary key on User and on Transaction • This can be relaxed either 1) similarly to late materialization in columnar-databases or 2) extra counters • Ordered relations - algorithm works with unordered relations

• The algorithm performs best when • Join selectivity is low (less tuples survive, huge savings vs. row-wise partitioning) • The number of columns is big

• Up to >7x speedup over row-wise join followed by block-partitioning • Implementation can be further optimized (compression, sorting, etc.) • BlockJoin can be used for multi-way joins (where we expect larger speedups)

32 The road ahead DeepDeep in-databasedatabase analytics analytics

3133 Linear Algebra + Relational Algebra = Theoretical Foundation

• Goal • Intermediate representation/algebra to support equivalences and transformations of mixed linear-relational algebra programs. • Fundamental questions • What is the right intermediate representation? • Which is the set of transformations/equivalences for optimization? • Do current cost models work? • Main challenge • mismatch: unordered sets (relations) vs. ordered/indexed matrices

34 Programming Abstraction

//SELECT f1.v, f2.v Optimizer Optimized Program //FROM factory_1 as f1, factory_2 as f2 //WHERE f1.id == f2.id and f1.v != null and f2.v != null val factories = for { //Emma’s for-comprehension for joining f1 <- factory_1 f2 <- factory_2

if f1.id == f2.id && f1.v != null && f2.v != null && } yield (f1.v,…, f2.v, …) val FacMatrix = factories.toMatrix()

//Calculate the mean of each of the columns Code Generator val colMeans = FacMatrix.cols( col => col.mean() ) val NormMat = FacMatrix – colMeans //Calculate covariance matrix val CovMat = NormMat.t %*% NormMat / (NormMat.numRows – 1) // ...

Dataflow Processing Jobs Compiler/Parser Intermediate Representation Apache Apache Hadoop Flink Spark

Hadoop Distributed File System (HDFS) 35 Cluster Management (Kubernetes, YARN, …) Linear Algebra + Relational Algebra = Systems Research

• Goal • Systems natively supporting mixed linear-relational algebra programs. • Fundamental questions • Data partitioning schemes (row- vs. column- vs. block-partitioning)? • Relational operators operating on block-partitioned data? • Linear algebra operators operating on row/column-partitioned data? • Operator Fusion? • Main challenges • Novel, non-trivial operators (e.g., BlockJoin) • Reuse of existing operators

36 Thank you! [email protected] http://asterios.katsifodimos.com The DataBag abstraction

class DataBag[+A] { // Type Conversion def this(s: Seq[A]) // Scala Seq -> DataBag def fetch() // DataBag -> Scala Seq // Input/Output (static) def read[A](url: String, format: ...): DataBag[A] def write[A](url: String, format: ...)(in: DataBag[A]) // Monad Operators (enable comprehension syntax) def map[B](f: A => B): DataBag[B] def flatMap[B](f: A => DataBag[B]): DataBag[B] def withFilter(p: A => Boolean): DataBag[A] // Nesting def groupBy[K](k: (A) => K): DataBag[Grp[K,DataBag[A]]] // Difference, , Duplicate Removal def minus[B >: A](subtrahend: DataBag[B]): DataBag[B] def plus[B >: A](addend: DataBag[B]): DataBag[B] def distinct(): DataBag[A] // Structural Recursion def fold[B](z: B, s: A => B, p: (B, B) => B): B // Aggregates (aliases for various folds def minBy, min, sum, product, count, empty, exists, ... }