Report from the Fourth Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR’17)

Foto N. Afrati Jan Hidders National Technical University of Athens, Greece Vrije Universiteit Brussel, Belgium [email protected] [email protected] Paraschos Koutris Jacek Sroka University of Wisconsin-Madison, USA University of Warsaw, Poland [email protected] [email protected] Jeffrey Ullman , USA [email protected]

ABSTRACT Both keynotes covered the aforementioned themes. This report summarizes the presentations and discus- The keynote by Matei Zaharia addressed usability, sions of the fourth workshop on Algorithms and Sys- integration and hardware, by describing recent and tems for MapReduce and Beyond (BeyondMR’17). The future developments of the Spark framework, and BeyondMR workshop was held in conjunction with the the keynote by Ke Yi gave an overview of compu- 2017 SIGMOD/PODS conference in Chicago, Illinois, tational models for large-scale parallel algorithms. USA on Friday May 19, 2017. The goal of the work- The presented papers addressed the themes as shop was to bring together researchers and practitioners follows. Paper [13] covered the themes languages to explore algorithms, computational models, languages and interfaces and algorithms by presenting an ap- and interfaces for systems that provide large-scale par- proach where spreadsheets are used as the program- allelization and fault tolerance. These include special- ming interface for a large-scale data-processing frame- ized programming and data-management systems based work with special algorithms for executing typical on MapReduce and extensions thereof, graph process- spreadsheet data-manipulation operations. Paper ing systems and data-intensive workflow systems. [7] tackles the themes languages and interfaces and The program featured two well-attended invited talks, integration by presenting an integrated algebra for the first on current and future development in implementing and optimizing data-processing work- processing by Matei Zaharia from Databricks and the flows with both relational algebra and linear alge- University of Stanford, and the second on computational bra operators. In paper [10] the themes languages models for the analysis and development of big data pro- and interfaces and algorithms are both addressed cessing algorithms by Ke Yi from the Hong Kong Uni- by presenting distributed algorithms and implemen- versity of Science and Technology. tations for computing the closure of certain OWL ontologies. The same themes also return in pa- per [12] which discusses distributed algorithms and 1. INTRODUCTION implementations for computing navigation queries In this edition of BeyondMR four main themes over RDF graphs formulated in a high-level query emerged: (i) languages and interfaces,whichin- language. The theme algorithms was again covered vestigates new interfaces and languages for speci- in paper [2] which discussed benchmarking data- fying data processing workflows to increase usabil- flow systems for the purpose of machine learning, ity, which includes both high-level and declarative by emphasizing scalability for high-dimensional but languages, as well as more low-level workflow eval- sparse data. In paper [8] the themes languages uation plans, (ii) algorithms, which involves the de- and interfaces, algorithms and integration were ad- sign, evaluation and analysis of algorithms for large- dressed by presenting a framework translating high- scale data processing and (iii) integration,which level specifications of data processing pipelines to concerns the integration of di↵erent types of analyt- di↵erent evaluation plans, making di↵erent choices ics frameworks, e.g., for -oriented analytics concerning parallelization frameworks, acceleration and for linear algebra.

44 SIGMOD Record, December 2017 (Vol. 46, No. 4) frameworks and data-processing platforms. Fur- to be better utilized. These concerns are addressed thermore, the theme algorithms was covered by pa- in the projects Tungsten [16] and Weld [11]. The per [4] which discusses algorithms for matrix multi- first tries to make better use of the hardware by plication on MapReduce-like data processing plat- run-time code generation that exploits cache local- forms. The theme was also addressed by paper ity and uses o↵-heap memory management. The [6], presenting streaming algorithms for multi-way second allows the easy integration of di↵erent an- theta-joins, and by paper [3], presenting distributed alytics libraries by o↵ering a common algebra for algorithms for binning in big genomic datasets. relational algebra operations, linear algebra opera- We will give a summary of the two keynotes in tions, and graph operations. Section 2 followed by a description of the contribu- The final change discussed was that big-data pro- tions of the presented papers further in Section 3. cessing services are increasingly provided through cloud computing. This provides new challenges as 2. SUMMARY OF KEYNOTES well as opportunities. New challenges are scaling up state management and stress testing, but opportu- We give a summary of the two keynotes, which nities are created by fast roll-out of new functional- contributed to visibility of the workshop. ity and consequently rapid comprehensive feedback. At the end of the presentation research challenges What’s Changing in Big Data? were discussed. These included new types of inter- The first keynote was presented by Matei Zaharia, faces for new types of users, new optimization tech- Stanford University, USA, and Databricks. His pre- niques to deal with new types and configurations sentation discussed the main changes to big-data of hardware, the integration of di↵erent processing processing in the last 10 years, as he has experi- platforms and the support of interaction between enced as codeveloper of Spark and chief technologist users of such platforms. of Databricks. It focused on three developments, namely (1) the user base moving from software en- gineers to data analysts, (2) the changing hardware Sequential vs Parallel, Fine-grained vs Coarse- and the computational bottleneck moving from I/O grained, Models and Algorithms to the computation and (3) the delivery of big-data Ke Yi, Hong Kong University of Science and Tech- processing services through cloud computing. nology, China, was the second keynote speaker. His The discussion of the changing user-base started talk focused on computational models for approxi- with the observation that usability had been, and mating the running time of distributed algorithms. remains, a crucial factor in the success of systems He warned at the start, by quoting Gorge Box, such as Spark. The changing user base shows in All models are wrong, but some are useful. After the selection of programming languages for Spark, this, the talk started with a quick introduction to where a shift is seen from Scala and Java to Python RAM and PRAM models, emphasizing the Concur- and R. Three areas where there are currently us- rent Read / Exclusive Read (CREW) and Concur- ability problems are (a) the understandability of rent Write/ Exclusive Write (EREW) variants. He the API, (b) the complexity of performance pre- also presented the External Memory Model (EM) dictability, and (c) the diculty in accessing the where internal memory is limited to size M and the system from smaller front-end tools such as Tableau data is transferred between internal and external and Excel. These problems are addressed by Spark memory in blocks of size B. This part of the pre- projects such as DataFrames and Spark SQL.The sentation was concluded by discussing the known first explicitly deals with structured data as it ap- relationships between those classes and with their pears in Python and R, and the second introduces a classification according to two independent criteria, high-level API for SQL-like processing with database- namely if the model is fine-grained or coarse-grained like query optimization. Next to that, there are also and if it is parallel or sequential. projects such as ML Pipelines for machine-learning Next, the Parallel External-Memory (PEM) model pipelines, graphFrames for graph processing, and for multicore GPUs [1] was presented together fol- the introduction of streaming over DataFrames. lowed by the Bulk-Synchronous Parallel (BSP) model. The changes in hardware in the last 10 years The latter is a shared-nothing, coarse-grained par- were summarized by the observations that although allel model [15] that is well suited for models used the size of storage and the speed of networks has in practice like MapReduce and Pregel. Simplifi- increased by an order of magnitude, the speed of cations and variants were presented and the mod- CPUs has not. Moreover, GPU s and FPGAshave els were compared. Also, methods to simulate one become more ubiquitous and powerful, and so need model by the other were discussed, as well as the re-

SIGMOD Record, December 2017 (Vol. 46, No. 4) 45 lations between BSP, PRAM and EM. It was con- is more explicit then MapReduce, but also more cluded that it is unlikely that optimal BSP algo- general then RA and LA. The practicality and ef- rithms are produced by simulating the fine-grained ficiency of the algebra is demonstrated by imple- PRAM or the sequential EM model. menting its operators using range scans over parti- The presentation concluded with a discussion of tioned, sorted maps, a primitive that is available in how to design algorithms for those models. Work- a wide range of back-end engines. Concretely, the depth models and Brent’s theorem were recalled. Fi- operators were implemented as range iterators in nally, the multithreaded and external memory ver- Apache Accumulo, an implementation of ’s sions of work-depth models were presented, and sev- BigTable. This implementation is shown to outper- eral problems were discussed for join algorithms, form the Accumulo’s native MapReduce integration such as the nested-loop and hypercube algorithms. for mixed-abstraction workflows involving joins and aggregations in the form of matrix multiplication. 3. REVIEW OF PRESENTED WORK (full paper) SPOWL: Spark-based OWL 2 Reason- (short paper) Towards minimal algorithms for big ing Materialisation data analytics with spreadsheets The paper [10] was the result of work by Yu Liu and The paper [13] presents research done by Jacek Sroka, Peter Mcbrien, from the Department of Computing Artur Le´sniewski, Miroslaw Kowaluk, Krzysztof Sten- at Imperial College London. It presents SPOWL, cel and Jerzy Tyszkiewicz, from the Institute of In- which uses Spark to implement OWL inferencing formatics at the University of Warsaw. The contri- over large ontologies. This system compiles T-Box bution of the paper is an algorithm to answer bulk axioms to Spark programs, which are iteratively ex- range queries. The idea is based on a range tree, ecuted to compute and materialize the closure of the but adapted to work on datasets that are too big ontology. The result can, for example, be used to for processing on a single machine. The algorithm compute SPARQL queries. The system leverages is organized so that the computation is perfectly the advantages of Spark by caching results in dis- balanced between the nodes in the cluster. It also tributed memory and parallelizing jobs in a more allows answering queries in bulk, i.e., a large num- flexible manner. It also optimises the number of ber of queries at the same time. necessary iterations by analysing the dependencies The motivation for the algorithm is to signifi- in the T-Box hierarchy and compiling the axioms cantly lower the technological barrier, which cur- correspondingly. The performance of SPOWL is rently prevents average users from analysing big evaluated against WepPIE on LUBM datasets, and datasets available as CSV files. The authors pro- is shown to be generally comparable or better. pose creation of a tool, which translates spread- sheets of regular structure into semantically equiv- (full paper) Querying Semantic Knowledge Bases alent MapReduce or Spark computations. Such a with SQL-on-Hadoop tool would enable users to prepare spreadsheets on The paper [12] describes research done by Martin a small sample of the data and then automatically Przyjaciel-Zablocki, Alexander Sch¨atzle and Georg translate them into MapReduce or Spark to be ex- Lausen, in the and Information Systems ecuted on the full dataset. As it is estimated that group, at the Institut f¨ur Informatik, in the Albert- spreadsheet programmers outnumber the program- Ludwigs-Universit¨atFreiburg. The paper presents mers of all other languages together, achieving that a new processor for Trial-QL, an SQL-like query goal would allow a huge number of new users to language for RDF that is based on the Triple Al- benefit from big-data processing. gebra with Recursion (TriAL⇤) [9] and therefore es- pecially suited to express navigational queries. The (full paper) LaraDB: A Minimalist Kernel for Lin- presented Trial-QL processor is built on top of Im- ear and Relational Algebra Computation pala (a massive parallel SQL Engine on Hadoop) The paper [7] was produced by Dylan Hutchison, and Spark, using one unified data storage system Bill Howe and Dan Suciu, from the Department of based on Parquet and HDFS. The paper studies and Engineering at the Univer- and compares di↵erent evaluation algorithms and sity of Washington. It presents Lara, an algebra storage strategies, and compares the performance to consisting of three operators that expresses Rela- other RDF management systems such as Sempala, tional Algebra (RA) and Linear Algebra (LA), as Neo4j, PigSPARQL, H2RDF+ and SHARD.Inter- well as optimization rules for a certain program- active and competitive response times are shown ming model. A proof is provided that the algebra on datasets of more than a billion triples, and even

46 SIGMOD Record, December 2017 (Vol. 46, No. 4) results with more than 25 billion triples could be The approach starts with a high-level language produced in minutes. for describing the global data-processing pipeline. It also allows the separate specification of the dif- (full paper) Benchmarking Data Flow Systems for ferent implementation variants for each step in the Scalable Machine Learning pipeline, together with their computational require- The paper [2] presents the contribution by Christoph ments. These are self-contained execution units that Boden, Andrea Spina, Tilmann Rabl and Volker can encapsulate legacy, native or accelerator-based Markl, in the Database Systems and Information code. This enables the framework to fuse data- Management Group at the TU Berlin. The pa- intensive programming paradigms (via native YARN per investigates the scalability of data-flow systems applications) with computation-intensive program- such as Apache Flink and Apache Spark for typical ming paradigms (via containerization). The frame- machine-learning workflows. The workloads are as- work then optimizes the executing plan by taking sumed to di↵er from conventional workloads as are into account the specified performance models and found in existing benchmarks, which are often based the available resources as specified in a specially de- on tasks such as WordCount, Grep or Sort.The veloped generic resource description framework. main di↵erences are that the data tends to be more sparse but at the same high dimensional, so the scal- (full paper) MapReduce Implementation of Strassen’s ability in the number of dimensions becomes more Algorithm for Matrix Multiplication important. The paper therefore presents a repre- The paper [4] presents the work by Minhao Deng sentative set of distributed machine-learning algo- and Prakash Ramanan, in the Electrical Engineer- rithms suitable for large-scale distributed settings ing and Computer Science department at Whichita that have close resemblance to industry-relevant ap- State University. It studies MapReduce implemen- plications and provide generalizable insights into tations of Strassen’s algorithm for multiplying two system performance. These algorithms were imple- n n matrices, introduced by in mented in Apache Spark and Apache Flink, and af- [14]. The Strassen algorithm has time complexity ter tuning, the two systems were compared in scal- (nlog2(7)), which is an improvement over the stan- ability in both the size of the data and the dimen- dard algorithm since the latter has time complexity 3 sionality of the data. Measurements were done with (n ) and log2(7) 2.81. datasets containing up to 4 billion data points and When considering implementations on MapRe- 100 million dimensions. The main conclusion is that duce of the straightforward algorithm, it has been current systems are surprisingly inecient in deal- shown that this can be done in typically one or ing with such high-dimensional data. two passes. The paper investigates in how many passes the Strassen algorithm can be implemented. (full paper) A containerized analytics framework This algorithm consists of three parts: Part I where for data- and compute-intensive pipeline applica- sums and di↵erences of quadrants of the input ma- tions trices are computed, Part II where several products The paper [8] describes the result obtained by Yuriy are computed of the matrices from Part I and Part Kaniovskyi, Martin K¨ohler and Siegfried Benkner, III where each quadrant of the final matrix is com- in the Research Group Scientific Computing at the puted by taking sums and di↵erences of the matrices University of Vienna and at the School of Computer from Part II. Science in the University of Manchester. It presents Direct implementation of these parts in MapRe- a framework for executing high-level specifications duce is shown to be possible in log2(n), 1 and log2(n) of data-processing pipelines on heterogeneous archi- passes, respectively. Moreover, it is shown that Part tectures using a variety of programming paradigms. I can be implemented in 2 passes, with total work The presented framework allows several ways of ex- (n log2(7)), and reducer size and reducer workload ecuting each step in the pipeline, using di↵erent o(n). It is conjectured that Part III can also be im- parallellization libraries (such as OpenMP, TBB, plemented in a constant number of passes and with MPI and Cilk), di↵erent acceleration frameworks small reducer size and workload. (such as OpenACC, CUDA and OpenCL) or dif- ferent data-processing platforms (e.g., MapReduce, (short paper) Scaling Out Continuous Multi-Way Spark, Storm and Flink). Moreover, the intermedi- Theta Joins ate results might be passed in di↵erent ways, e.g., The paper [6] presents work by Manuel Ho↵mann by storing it in HDFS, in a local file system, in and Sebastian Michel, performed at the Databases memory, or by streaming it. and Information Systems Group at the department

SIGMOD Record, December 2017 (Vol. 46, No. 4) 47 of computer science at the University of Kaiser- of 35 participants, and the proceedings are pub- slautern. The paper presents generic tuple-routing lished in the ACM Digital Library [5]. schemes for computing distributed multiway theta- joins over streaming data. These schemes are imple- Acknowledgements: We would like to thank PC mented in an architecture where query plans con- members, keynote speakers, authors, local work- sist of logical operators to Apache Storm topologies. shop organizers and attendees for making BeyondMR The paper reports the first results for the TPC-H 2017 a success. We also express our appreciation for benchmark where the topologies are executed on the support from Google Inc. Finally we would like Amazon EC2 instances. to thank the anonymous reviewers for their con- structive remarks on how to improve this report. (short paper) Bi-Dimensional Binning for Big Ge- nomic Datasets 5. REFERENCES [1] Lars Arge, Michael T. Goodrich, Michael Nelson, and The paper [3] describes results obtained by Pietro Nodari Sitchinava. Fundamental parallel algorithms for Pinoli, Simone Cattani, Stefano Ceri and Abdulrah- private-cache chip multiprocessors. In Proc. of SPAA man Kaitoua, in the Department of Electronics, In- ’08,pages197–206,NewYork,NY,USA,2008.ACM. [2] Christoph Boden, Andrea Spina, Tilmann Rabl, and formation, and Bioengineering at the Politecnico di Volker Markl. Benchmarking data flow systems for Milano. The main result is a bi-dimensional binning scalable machine learning. In Hidders [5], pages strategy for the SciDB array database, motivated 5:1–5:10. by implementation of the GenoMetric Query Lan- [3] Simone Cattani, Stefano Ceri, Abdulrahman Kaitoua, and Pietro Pinoli. Bi-dimensional binning for big guage (GMQL) — a high-level query language for genomic datasets. In Hidders [5], pages 9:1–9:4. genomics — that the authors previously developed. [4] Minhao Deng and Prakash Ramanan. Mapreduce Due to the array database particularities, it is not implementation of Strassen’s algorithm for matrix multiplication. In Hidders [5], pages 7:1–7:10. possible to dynamically split a region and distribute [5] Jan Hidders, editor. Proc. of BeyondMR’17,New its replicas to an arbitrary number of adjacent cells. York, NY, USA, 2017. ACM. Yet in order to apply a binning strategy, all the re- [6] Manuel Ho↵mann and Sebastian Michel. Scaling out gions need to be replicated an identical number of continuous multi-way theta-joins. In Hidders [5], pages 8:1–8:4. times. The authors propose to organize the bins in [7] Dylan Hutchison, Bill Howe, and Dan Suciu. LaraDB: a two dimensional grid. In such a grid the values Aminimalistkernelforlinearandrelationalalgebra are located above the diagonal and it is possible to computation. In Hidders [5], pages 2:1–2:10. [8] Yuriy Kaniovskyi, Martin Koehler, and Siegfried define spaces that would contain values for one bin Benkner. A containerized analytics framework for data without replicating the values. Such bins (spaces) and compute-intensive pipeline applications. In can share values because they share some grid cells. Hidders [5], pages 6:1–6:10. [9] Leonid Libkin, Juan Reutter, and Domagoj Vrgoˇc. Trial for RDF: Adapting graph query languages for 4. CONCLUSION RDF data. In Proc. of PODS ’13,pages201–212,New York, NY, USA, 2013. ACM. The presentations and keynotes at BeyondMR’17 [10] Yu Liu and Peter McBrien. SPOWL: Spark-based provided an overview of current developments and OWL 2 reasoning materialisation. In Hidders [5], pages 3:1–3:10. emerging issues in the area of algorithms, computa- [11] Shoumik Palkar, James J. Thomas, Anil Shanbhag, tional models, architectures and interfaces for sys- Malte Schwarzkopt, Saman P. Amarasinghe, and Matei tems that provide large-scale parallelization. The Zaharia. Weld: A common runtime for high workshop attracted 17 submissions from which the performance data analytics. In Proc. of CIDR 2017. www.cidrdb.org, January 2017. program committee led by Paris Koutris from the [12] Martin Przyjaciel-Zablocki, Alexander Sch¨atzle, and University of Wisconsin-Madison accepted 6 full pa- Georg Lausen. Querying semantic knowledge bases pers and 3 short papers. with SQL-on-Hadoop. In Hidders [5], pages 4:1–4:10. [13] Jacek Sroka, Artur Le´sniewski, Miroslaw Kowaluk, Keynotes and papers covered topics such as user Krzysztof Stencel, and Jerzy Tyszkiewicz. Towards interfaces, integration of programming paradigms, minimal algorithms for big data analytics with algorithmics and leveraging new hardware features. spreadsheets. In Hidders [5], pages 1:1–1:4. The contributions suggest that while MapReduce [14] Volker Strassen. Gaussian elimination is not optimal. Numer. Math.,13(4):354–356,August1969. has been extended and replaced by newer models, [15] Leslie G. Valiant. A bridging model for parallel there is an active area of research centred around computation. Commun. ACM,33(8):103–111,August data-management systems based on MapReduce and 1990. [16] Reynold Xin and Josh Rosen. Project Tungsten: extensions, providing ever more insight for devel- Bringing Apache Spark closer to bare metal. oping more e↵ective graph processing systems and https://databricks.com/blog/2015/04/28/project- data-intensive workflow systems. tungsten-bringing-spark-closer-to-bare- The sessions were well attended with an average metal.html,April2015.

48 SIGMOD Record, December 2017 (Vol. 46, No. 4)