Representations and Optimizations for Embedded Parallel Dataﬂow Languages

Representations and Optimizations for Embedded Parallel Dataflow Languages vorgelegt von M.Sc. Alexander Alexandrov geb. in Sofia, Bulgarien von der Fakultät IV - Elektrotechnik und Informatik der Technischen Universität Berlin zur Erlangung des akademischen Grades Doktor der Ingenieurwissenschaften - Dr.-Ing. - genehmigte Dissertation Promotionsausschuss: Vorsitzender: Prof. Dr. Odej Kao Gutachter: Prof. Dr. Volker Markl Gutachterin: Prof. Dr. Mira Mezini Gutachter: Prof. Dr. Torsten Grust Tag der wissenschaftlichen Aussprache: 31. Oktober 2018 Berlin 2019 In memory of my grandfather, who taught me how to count when I was very young and Prof. Hartmut Ehrig, who taught me how to comprehend counting twenty years later. Acknowledgments I would like to express my gratitude to the people who made this dissertation possible. First and foremost, I would like to thank my advisor Prof. Dr. Volker Markl. He offered me the chance to work in the area of data management and encouraged me to search for a motivating topic. Throughout my time at the Database and Information Systems Group at TU Berlin, his constant engagement and valuable advice allowed me to substantially improve the quality of my research. I am also deeply indebted to everybody who contributed to the Emma project. Asterios Katsifodimos and Andreas Kunft showed tremendous dedication and work ethic and played an essential role in bringing the original SIGMOD 2015 submission to an accept- able shape under a very tight deadline. Georgi Krastev was instrumental in shaping the design and implementation of the compiler internals. Without his passion for functional programming and sharp eye for elegant API design, the software artifact accompanying this thesis would undoubtedly have ended up in a much more rudimentary form. Gábor Gévay influenced the story presented in this thesis with a number of incisive comments. Most notably, he rigorously pointed out that state-of-the-art solutions fall into the cate- gory of deeply embedded DSLs, which forced me to pinpoint quotation-based embedding as the crux to the proposed solution. Andreas Salzmann developed the GUI for the demonstrator, and Felix Schüler and Bernd Louis contributed a number of algorithms to the Emma library. This work represents a natural fusion between two distinct lines of research. Stephan Ewen and Fabian Hüske developed the original PACT programming model, and I was lucky enough to work with both of them during my time at DIMA. The adopted categor- ical approach highlights the timeless relevance of the foundational research conducted by Phil Wadler, Peter Buneman and Val Tannen, and Torsten Grust in the 1990s. The intimate connection between the two areas was pointed out by Alin Deutsch during a visit at UC San Diego in the autumn of 2012. Last but not least, I am also grateful to my family and friends for their continuous love and support, to all past and future teachers of mine for their shared knowledge, and to Katya Tasheva, Emma Greenfield, and Petra Nachtmanova for the music. i Declaration of Authorship I, Alexander Alexandrov, declare that this thesis, titled “Representations and Optimiza- tions for Embedded Parallel Dataflow Languages”, and the work presented in it are my own. I confirm that: • This work was done wholly or mainly while in candidature for a research degree at this University. • Where any part of this thesis has previously been submitted for a degree or any other qualification at this University or any other institution, this has been clearly stated. • Where I have consulted the published work of others, this is always clearly at- tributed. • Where I have quoted from the work of others, the source is always given. With the exception of such quotations, this thesis is entirely my own work. • I have acknowledged all main sources of help. • Where the thesis is based on work done by myself jointly with others, I have made clear exactly what was done by others and what I have contributed myself. Berlin, February 12, 2019 .......................... iii Abstract Parallel dataflow engines such as Apache Hadoop, Apache Spark, and Apache Flink have emerged as an alternative to relational databases more suitable for the needs of modern data analysis applications. One of the main characteristics of these systems is their scalable programming model, based on distributed collections and parallel transformations. Notable examples are Flink’s DataSet and Spark’s RDD programming abstractions. The programming model is typically realized as an eDSL – a domain specific language embedded in a general-purpose host language such as Java, Scala, or Python. This approach has several advantages over traditional stand-alone DSLs such as SQL or XQuery. First, it allows for reuse of linguistic constructs from the host language – for example, anonymous functions syntax, value definitions, or fluent syntax via method chaining. This eases the learning curve for developers already familiar with the host language syntax. Second, it allows for seamless integration of library methods written in the host language via the function parameters passed to the parallel dataflow operators. This reduces the development effort for dataflows that go beyond pure SQL and require domain-specific logic, for example for text or image pre-processing. At the same time, state-of-the-art parallel dataflow eDSLs exhibit a number of shortcom- ings. First, one of the main advantages of a stand-alone DSL such as SQL – the high-level, declarative Select-From-Where syntax – is either lost or mimicked in a non-standard way. Second, execution aspects such as caching, join order, and partial aggregation need to be decided by the programmer. Automatic optimization is not possible due to the limited program context reflected in the eDSL intermediate representation (IR). In this thesis, we argue that these limitations are a side effect of the adopted type-based embedding approach. As a solution, we propose an alternative eDSL design based on quasi-quotations. We present a DSL embedded in Scala and discuss its compiler pipeline, IR, and some of the enabled optimizations. We promote the algebraic type of bags in union representation as a model for distributed collections, and its associated structural recursion scheme and monad as a model for parallel collection processing. At the source code level, Scala’s for-comprehensions can be used to encode Select-From-Where expressions in a standard way. At the IR level, maintaining comprehensions as a first- class citizen can be used to simplify the analysis and implementation of holistic dataflow optimizations that accommodate for nesting and control flow. The proposed DSL design therefore reconciles the benefits of embedded parallel dataflow DSLs with the declarativity and optimization potential of external DSLs such as SQL. v Zusammenfassung Parallele Datenflusssysteme wie Apache Hadoop, Apache Spark und Apache Flink haben sich als Alternative von relationalen Datenbanken etabliert, die für die Anforderungen moderner Datenanalyseanwendungen besser geeignet ist. Zu den Hauptmerkmalen dieser Systeme gehört ein auf verteilten Datenkollektionen und parallelen Transformationen basierendes Programmiermodell. Beispiele dafür sind die DataSet und RDD Programmier- schnittstellen von Flink und Spark. Diese Schnittstellen werden in der Regel als eDSLs realisiert, d.h. als domänenspezifische Sprachen, die in einer Hostsprache wie Java, Scala oder Python eingebettet sind. Dieser Ansatz bietet mehrere Vorteile gegenüber herkömmlichen externen DSLs wie SQL oder XQuery. Zum einen kann man bei einer eDSL syntaktische Konstrukte aus der Host- Sprache wiederverwenden. Dies verringert die Lernkurve für Entwickler, die bereits mit der Syntax der Hostsprache vertraut sind. Zum anderen ermöglicht der Ansatz eine nahtlose Integration von Bibliotheksmethoden, die in der Hostsprache verfügbar sind, und reduziert somit den Entwicklungsaufwand für Datenflüsse, die über reines SQL hinausgehen und domänenspezifische Logik erfordern. Gleichzeitig weisen eDSLs wie DataSet und RDD eine Reihe von Nachteilen auf. Erstens ist einer der Hauptvorteile von externen DSLs wie SQL - die deklarative Select-From-Where- Syntax - entweder verloren oder auf eine nicht-standardisierte Weise nachgeahmt. Zweitens werden Ausführungsaspekte wie Caching, Join-Reihenfolge und verteilte Aggregate vom Programmierer manuell festgelegt. Eine automatische Optimierung ist aufgrund des begrenzten Programmkontexts in der eDSL-Zwischenrepräsentation nicht möglich. Wir zeigen, dass diese Einschränkungen als Nebeneffekt des auf Typen basierenden Ein- bettungsansatzes verursacht werden. Als Lösung schlagen wir ein alternatives Design vor, das auf Quasi-Quotations basiert. Wir präsentieren eine Scala eDSL und diskutieren deren Compiler, Zwischenrepräsentation, sowie einigen von den ermöglichten Optimierungen. Als Grundlage für das verteilte Datenmodell benutzen wir den algebraischen Typ von Kol- lektionen in Union-Repräsentation, und für die parallele Datenverarbeitung – die damit verbundenen strukturelle Rekursion und Monade. Auf der Quellcode-Ebene kann man Comprehensions über die Monade verwenden, um Select-From-Where Ausdrücke in einer Standardform zu kodieren. In der Zwischenrepräsentation bieten Comprehensions eine Basis, auf der man Datenflussoptimierungen einfacher gestalten kann. Das vorgeschlagene Design vereinigt somit die Vorteile von eingebetteten parallelen Datenfluss-DSLs mit der deklarativen Natur und Optimierungspotenzial von externen DSLs wie SQL. vii Contents Acknowledgments i Declaration

Representations and Optimizations for Embedded Parallel Dataﬂow Languages

CS145 Lecture Notes #15 Introduction to OQL

EYEDB Object Query Language

SOQL and SOSL Reference

Query Service Specification

Sciql, a Query Language for Science Applications

Object Persistence

Object Database Management Systems: Concepts and Features

1 Query Languages This Section Lists All the Query Languages That the Query Language Investigation Team Determined Were Relevant Possibilities

Apache Calcite for Enabling SQL Access to Nosql Data Systems Such As Apache Geode Christian Tzolov Whoami Christian Tzolov

ODL, OQL and SQL3 Object Definition Language

A Query Language for Analyzing Information Networks

002.02 Comparing the Object and the Relational Data Model