Report from the Fourth Workshop on Algorithms and Systems for Mapreduce and Beyond (Beyondmr’17)

Report from the Fourth Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR’17) Foto N. Afrati Jan Hidders National Technical University of Athens, Greece Vrije Universiteit Brussel, Belgium [email protected] [email protected] Paraschos Koutris Jacek Sroka University of Wisconsin-Madison, USA University of Warsaw, Poland [email protected] [email protected] Jeffrey Ullman Stanford University, USA [email protected] ABSTRACT Both keynotes covered the aforementioned themes. This report summarizes the presentations and discus- The keynote by Matei Zaharia addressed usability, sions of the fourth workshop on Algorithms and Sys- integration and hardware, by describing recent and tems for MapReduce and Beyond (BeyondMR’17). The future developments of the Spark framework, and BeyondMR workshop was held in conjunction with the the keynote by Ke Yi gave an overview of compu- 2017 SIGMOD/PODS conference in Chicago, Illinois, tational models for large-scale parallel algorithms. USA on Friday May 19, 2017. The goal of the work- The presented papers addressed the themes as shop was to bring together researchers and practitioners follows. Paper [13] covered the themes languages to explore algorithms, computational models, languages and interfaces and algorithms by presenting an ap- and interfaces for systems that provide large-scale par- proach where spreadsheets are used as the program- allelization and fault tolerance. These include special- ming interface for a large-scale data-processing frame- ized programming and data-management systems based work with special algorithms for executing typical on MapReduce and extensions thereof, graph process- spreadsheet data-manipulation operations. Paper ing systems and data-intensive workflow systems. [7] tackles the themes languages and interfaces and The program featured two well-attended invited talks, integration by presenting an integrated algebra for the first on current and future development in big data implementing and optimizing data-processing work- processing by Matei Zaharia from Databricks and the flows with both relational algebra and linear alge- University of Stanford, and the second on computational bra operators. In paper [10] the themes languages models for the analysis and development of big data pro- and interfaces and algorithms are both addressed cessing algorithms by Ke Yi from the Hong Kong Uni- by presenting distributed algorithms and implemen- versity of Science and Technology. tations for computing the closure of certain OWL ontologies. The same themes also return in paper [12] which discusses distributed algorithms and 1. INTRODUCTION implementations for computing navigation queries In this edition of BeyondMR four main themes over RDF graphs formulated in a high-level query emerged: (i) languages and interfaces,whichin- language. The theme algorithms was again covered vestigates new interfaces and languages for speci- in paper [2] which discussed benchmarking data- fying data processing workflows to increase usabil- flow systems for the purpose of machine learning, ity, which includes both high-level and declarative by emphasizing scalability for high-dimensional but languages, as well as more low-level workflow eval- sparse data. In paper [8] the themes languages uation plans, (ii) algorithms, which involves the de- and interfaces, algorithms and integration were ad- sign, evaluation and analysis of algorithms for large- dressed by presenting a framework translating high- scale data processing and (iii) integration,which level specifications of data processing pipelines to concerns the integration of di↵erent types of analyt- di↵erent evaluation plans, making di↵erent choices ics frameworks, e.g., for database-oriented analytics concerning parallelization frameworks, acceleration and for linear algebra. 44 SIGMOD Record, December 2017 (Vol. 46, No. 4) frameworks and data-processing platforms. Fur- to be better utilized. These concerns are addressed thermore, the theme algorithms was covered by pa- in the projects Tungsten [16] and Weld [11]. The per [4] which discusses algorithms for matrix multi- first tries to make better use of the hardware by plication on MapReduce-like data processing plat- run-time code generation that exploits cache local- forms. The theme was also addressed by paper ity and uses o↵-heap memory management. The [6], presenting streaming algorithms for multi-way second allows the easy integration of di↵erent an- theta-joins, and by paper [3], presenting distributed alytics libraries by o↵ering a common algebra for algorithms for binning in big genomic datasets. relational algebra operations, linear algebra opera- We will give a summary of the two keynotes in tions, and graph operations. Section 2 followed by a description of the contribu- The final change discussed was that big-data pro- tions of the presented papers further in Section 3. cessing services are increasingly provided through cloud computing. This provides new challenges as 2. SUMMARY OF KEYNOTES well as opportunities. New challenges are scaling up state management and stress testing, but opportu- We give a summary of the two keynotes, which nities are created by fast roll-out of new functional- contributed to visibility of the workshop. ity and consequently rapid comprehensive feedback. At the end of the presentation research challenges What’s Changing in Big Data? were discussed. These included new types of inter- The first keynote was presented by Matei Zaharia, faces for new types of users, new optimization tech- Stanford University, USA, and Databricks. His pre- niques to deal with new types and configurations sentation discussed the main changes to big-data of hardware, the integration of di↵erent processing processing in the last 10 years, as he has experi- platforms and the support of interaction between enced as codeveloper of Spark and chief technologist users of such platforms. of Databricks. It focused on three developments, namely (1) the user base moving from software en- gineers to data analysts, (2) the changing hardware Sequential vs Parallel, Fine-grained vs Coarse- and the computational bottleneck moving from I/O grained, Models and Algorithms to the computation and (3) the delivery of big-data Ke Yi, Hong Kong University of Science and Tech- processing services through cloud computing. nology, China, was the second keynote speaker. His The discussion of the changing user-base started talk focused on computational models for approxi- with the observation that usability had been, and mating the running time of distributed algorithms. remains, a crucial factor in the success of systems He warned at the start, by quoting Gorge Box, such as Spark. The changing user base shows in All models are wrong, but some are useful. After the selection of programming languages for Spark, this, the talk started with a quick introduction to where a shift is seen from Scala and Java to Python RAM and PRAM models, emphasizing the Concur- and R. Three areas where there are currently us- rent Read / Exclusive Read (CREW) and Concur- ability problems are (a) the understandability of rent Write/ Exclusive Write (EREW) variants. He the API, (b) the complexity of performance pre- also presented the External Memory Model (EM) dictability, and (c) the difficulty in accessing the where internal memory is limited to size M and the system from smaller front-end tools such as Tableau data is transferred between internal and external and Excel. These problems are addressed by Spark memory in blocks of size B. This part of the pre- projects such as DataFrames and Spark SQL.The sentation was concluded by discussing the known first explicitly deals with structured data as it ap- relationships between those classes and with their pears in Python and R, and the second introduces a classification according to two independent criteria, high-level API for SQL-like processing with database- namely if the model is fine-grained or coarse-grained like query optimization. Next to that, there are also and if it is parallel or sequential. projects such as ML Pipelines for machine-learning Next, the Parallel External-Memory (PEM) model pipelines, graphFrames for graph processing, and for multicore GPUs [1] was presented together fol- the introduction of streaming over DataFrames. lowed by the Bulk-Synchronous Parallel (BSP) model. The changes in hardware in the last 10 years The latter is a shared-nothing, coarse-grained par- were summarized by the observations that although allel model [15] that is well suited for models used the size of storage and the speed of networks has in practice like MapReduce and Pregel. Simplifi- increased by an order of magnitude, the speed of cations and variants were presented and the mod- CPUs has not. Moreover, GPU s and FPGAshave els were compared. Also, methods to simulate one become more ubiquitous and powerful, and so need model by the other were discussed, as well as the re- SIGMOD Record, December 2017 (Vol. 46, No. 4) 45 lations between BSP, PRAM and EM. It was con- is more explicit then MapReduce, but also more cluded that it is unlikely that optimal BSP algo- general then RA and LA. The practicality and ef- rithms are produced by simulating the fine-grained ficiency of the algebra is demonstrated by imple- PRAM or the sequential EM model. menting its operators using range scans over parti- The presentation concluded with a discussion of tioned, sorted maps, a primitive that is available in how to design algorithms for those models. Work- a wide range of back-end engines. Concretely, the depth models and Brent’s theorem were recalled. Fi- operators were implemented as range iterators in nally, the multithreaded and external memory ver- Apache Accumulo, an implementation of Google’s sions of work-depth models were presented, and sev- BigTable. This implementation is shown to outper- eral problems were discussed for join algorithms, form the Accumulo’s native MapReduce integration such as the nested-loop and hypercube algorithms. for mixed-abstraction workflows involving joins and aggregations in the form of matrix multiplication.

Load more