Optimizing Across Relational and Linear Algebra in Parallel Analytics Pipelines
Total Page:16
File Type:pdf, Size:1020Kb
Optimizing Across Relational and Linear Algebra in Parallel Analytics Pipelines Asterios Katsifodimos Assistant Professor @ TU Delft [email protected] http://asterios.katsifodimos.com Outline • Introduction and Motivation • Databases vs. parallel dataflow systems • Declarativity in Parallel-dataflow Systems • The Emma language and optimizer • The Lara language for mixed linear- and relational-algebra programs • Optimizing across linear & relational Algebra • BlockJoin: A join algorithm for mixed relational & linear algebra programs • The road ahead 2 A timeline of data analysis tools from a biased perspective Appearance of First parallel Open Source First columnar Scalable, UDF- Alternative Better Abstractions Relational shared-nothing Projects and storage based, data MapReduce for dataflow systems Databases architectures mainstream Databases. Faster! analytics blooms implementations go databases mainstream 70s 80s 90s 00s 2004 - 2009 2009 - 2016 2016 - Ingres Gamma BerkeleyDB MonetDB Hadoop Spark Spark Dataframes Oracle TRACE MS Access C-Store MapReduce Flink SystemML System-R Te r a d a t a PostgreSQL Vertica Impala Emma … … MySQL Aster Data Giraph MRQL … GreenPlum Hive (SQL!) … HBase Blu … 1974: Don Chamberlin We need easier to use tools! publishes SQL Open Source rises 2003: Internet Explodes Criticism of MapReduce. NoSQL movement. Back to SQL? Google publishes MapReduce. 3 Relational DBs vs. Modern Big Data Processing Stack SQL SQL-like abstraction Iterative M/R-like Jobs (Java, Scala, …) SQL Compiler & Optimizer SQL Compiler & Optimizer Relational DataFlow Put/Get ops Engine Apache Apache Flink Hadoop Spark Files/Bytes Transaction Manager HBase Row/Column Storage Hadoop Distributed File System (HDFS) Operating System Cluster Management (Kubernetes, YARN, …) 4 Relational Databases SQL Relations RDBMS Declarative Data Processing Second-Order Distributed Parallel Dataflow Functions (Map, Reduce, etc.) Collections Engines Parallel Dataflows 5 People with data analysis skills Big Data systems programming experts Big data analysts Quiz: guess the algorithm! ... // initialize while (theta) { newCntrds = points .map(findNearestCntrd) .map( (c, p) => (c, (p, 1L)) ) .reduceByKey( (x, y) => (x._1 + y._1, x._2 + y._2) ) .map( x => Centroid(x._1, x._2._1 / x._2._2) ) bcCntrs = sc.broadcast(newCntrds.collect()) } ... // initialize val cntrds = centroids.iterate(theta) { currCntrds => val newCntrds = points .map(findNearestCntrd).withBcSet(currCntrds, "cntrds") .map( (c, p) => (c, p, 1L) ) .groupBy(0).reduce( (x, y) => (x._1, x._2 + y._2, x._3 + y._3) ) .map( x => Centroid(x._1, x._2 / x._3) ) Apache Flink currCntrds } K-Means Clustering • Input • Dataset (set of points in 2D) — large • K initial centroids — small • Step 1: • Read K-centroids • Assign each point to the closest centroid (wrt distance) • Output <centroid, point> (or, cluster) • Step 2: • For each centroid • Calculate its average position wrt distance from its points • Output: <new centroid> (coordinates) 8 Partial Aggregates ... // initialize while (theta) { newCntrds = points .map(findNearestCntrd) .map( (c, p) => (c, (p, 1L)) ) .reduceByKey( (x, y) => (x._1 + y._1, x._2 + y._2) ) .map( x => Centroid(x._1, x._2._1 / x._2._2) ) bcCntrs = sc.broadcast(newCntrds.collect()) } Efficient aggregation requires a special combination of primitives ... // initialize val cntrds = centroids.iterate(theta) { currCntrds => val newCntrds = points .map(findNearestCntrd).withBcSet(currCntrds, "cntrds") .map( (c, p) => (c, p, 1L) ) .groupBy(0).reduce( (x, y) => (x._1, x._2 + y._2, x._3 + y._3) ) .map( x => Centroid(x._1, x._2 / x._3) ) currCntrds } Broadcast Variables ... // initialize while (theta) { newCntrds = points .map(findNearestCntrd) .map( (c, p) => (c, (p, 1L)) ) .reduceByKey( (x, y) => (x._1 + y._1, x._2 + y._2) ) .map( x => Centroid(x._1, x._2._1 / x._2._2) ) bcCntrs = sc.broadcast(newCntrds.collect()) } Explicit data-exchange between the driver and the parallel dataflows ... // initialize val cntrds = centroids.iterate(theta) { currCntrds => val newCntrds = points .map(findNearestCntrd).withBcSet(currCntrds, "cntrds") .map( (c, p) => (c, p, 1L) ) .groupBy(0).reduce( (x, y) => (x._1, x._2 + y._2, x._3 + y._3) ) .map( x => Centroid(x._1, x._2 / x._3) ) currCntrds } Native Iterations ... // initialize while (theta) { newCntrds = points .map(findNearestCntrd) .map( (c, p) => (c, (p, 1L)) ) .reduceByKey( (x, y) => (x._1 + y._1, x._2 + y._2) ) .map( x => Centroid(x._1, x._2._1 / x._2._2) ) bcCntrs = sc.broadcast(newCntrds.collect()) } Feedback edges in the dataflow graph have to be specified explicitly. ... // initialize val cntrds = centroids.iterate(theta) { currCntrds => val newCntrds = points .map(findNearestCntrd).withBcSet(currCntrds, "cntrds") .map( (c, p) => (c, p, 1L) ) .groupBy(0).reduce( (x, y) => (x._1, x._2 + y._2, x._3 + y._3) ) .map( x => Centroid(x._1, x._2 / x._3) ) currCntrds } Beyond Datasets: End-to-end ML Pipelines A typical Data Science Workflow Data Management Machine Learning 1 2 3 4 5 6 7 8 Data Source Data ML Hyper Question Feature Model Model Exploration cleaning & Algorithm Parameter Formulation Selection Training Evaluation & Selection preparation Selection Search 80% of data scientists’ time 12 A very simple ML pipeline example Factory 1 Gather and Principal Train ML ⋈ clean sensor Component Factory 2 Model data Analysis Factory 3 13 Preprocessing Naturally expressed in • SQL SELECT • Dataflow APIs f1.v, f2.v, f3.v FROM Factory factory_1 AS f1, 1 factory_2 AS f2, factory_3 AS f3 Gather and Factory ⋈ clean sensor WHERE 2 data f1.sid == f2.sid AND f2.sid == f3.sid AND f1.v != NULL AND Factory f2.v != NULL AND 3 f3.v != NULL 14 Matrix Operations/ML in SQL Calculating covariance matrix in SQL Impedance mismatch! Principal Train ML Component Model Analysis 15 * From:http://stackoverflow.com/questions/18019469/finding-covariance-using-sql Machine Learning in “Native” Syntax Naturally expressed in • Matlab // read factories table as matrix FacMatrix • R [rows, cols] = size(FacMatrix) • … colMeans = mean(FacMatrix) NormMat = zeros(rows, cols) for i = 1:rows Principal Train ML Component NormMat(i, :) = FacMatrix(i, :) - colMeans; Model Analysis end CovMat = NormMat’ * NormMat / (rows – 1) // … 16 SQL + ML Native Syntax = Optimization Barrier Optimization barrier! Factory 1 Gather and Principal Train ML ⋈ clean sensor Component Factory 2 Model data Analysis Factory 3 17 Big Data Language Desiderata • Declarativity: what vs. how • Enables optimizability, eases programming • User-Defined Functions as first-class citizens • Important reason why MapReduce took off • Control-flow for Machine Learning and Graph workloads • Iterations, if-then-else branches • Turing complete “driver” with parallelizable constructs • Portability • Should support multiple backend engines (Flink, Spark, MapReduce, etc.) • Deep embedding • e.g. à la LINQ (Language Integrated Query) in C#, Ferry in Haskel, etc. 18 Emma/Lara: Databags & Matrices //SELECT f1.v, f2.v Deeply Embedded in Scala //FROM factory_1 as f1, factory_2 as f2 //WHERE f1.id == f2.id and f1.v != null and f2.v != null val factories = for { Databag type f1 <- factory_1 § Like RDDs on Spark, Datasets on Flink, etc. f2 <- factory_2 Gather and § Operations on Databags are parallelized clean data if f1.sid == f2.sid && f1.v != null && Declarative for-comprehensions f2.v != null && § Like Select-From-Where in SQL § } yield (f1.v,…, f2.v, …) Can express join, cross, filter, etc. Loops: First-class citizens while(…) { Databag => Matrix § val FacMatrix = factories.toMatrix() Enable optimization across Linear and relational algebra //Calculate the mean of each of the columns Principal § Intermediate representation to reason Componen val colMeans = FacMatrix.cols( col => col.mean() ) across linear and relational algebras (WiP) t Analysis val NormMat = FacMatrix – colMeans UDFs: First-class citizens //Calculate covariance matrix val CovMat = NormMat.t %*% NormMat/(NormMat.numRows – 1) “Native” Matrix Syntax // ... § Against Impedance mismatch19 [SIGMOD15/16, SIGMOD Record 16, BeyondMR 2016, PVLDB 2017] FP goodness: Comprehension Syntax Comprehensions: generalize SQL and are available as first-class syntax in modern general purpose programming languages. Math ⦃ (x, y) | x ∈ xs, y ∈ ys, x = y ⦄ SQL SELECT x, y FROM x AS xs, y AS ys WHERE x = y Python [ (x, y) for x in xs, y in ys if x == y ] Scala for (x <- xs; y <- ys; if x == y) yield (x, y) 20 Automatic Optimization of Programs • Caching of intermediate results • Reused intermediate results (e.g., within a loop) are automatically cached • Choice of Partitioning • Pre-partition datasets which are used in multiple joins • Unnesting • e.g., converting exists() clauses into joins • Join order/algorithm selection • Classic database optimizations • Fold-group Fusion • Calculate multiple aggregates on one pass • … 21 A concrete optimization BlockJoin: partitioning data for linear algebra operations through joins [PVLDB ’17 (VLDB 18)] "BlockJoin: Efficient Matrix Partitioning Through Joins” A. Kunft, A. Katsifodimos, S. Schelter, T. Rabl, V. Markl. In the Proceedings of the VLDB Endowment, Vol. 10, No. 13, to be presented in VLDB 2018. 22 Row- vs. block-partitioned operations Relational Algebra: Aggregate over a row (avg) Linear Algebra: Matrix Multiplication Certain relational operations better executed Scalable Linear Algebra operations such as over row-partitioned tables. Matrix Multiplication better executed over block-partitioned matrices. Orders of magnitude speedups. 23