Programming Abstractions, Compilation, and Execution Techniques for Massively Parallel Data Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Programming Abstractions, Compilation, and Execution Techniques for Massively Parallel Data Analysis vorgelegt von Dipl.-Inf. Stephan Ewen aus Mainz der Fakult¨atIV - Elektrotechnik und Informatik der Technischen Universit¨atBerlin zur Erlangung des akademischen Grades Doktor der Ingenieurwissenschaften - Dr. Ing. - genehmigte Dissertation Promotionsausschuss: Vorsitzender: Prof. Dr. Peter Pepper Gutachter: Prof. Dr. Volker Markl Prof. Dr. Odej Kao Prof. Mike Carey, Ph.D. Tag der wissenschaftlichen Aussprache: 22. Oktober 2014 Berlin 2015 D 83 ii Acknowledgments During the process of creating this thesis, I have been accompanied by a great many people over the time. Many of them offered valuable discussions and helped to shape the outcome of this thesis in one way or another. First and foremost, I want to thank my adviser Volker Markl. He has not only provided the great work environment and the collaborations under which this thesis was created, but helped with countless comments and valuable insights. Throughout my entire academic life, Volker has mentored me actively and given me the opportunity to work on amazing projects. I would like to thank my advisers Odej Kao and Mike Carey for their helpful comments and the discussions that we had at at various points, like project meetings, conferences, or academic visits. I worked with many great people on a day to day basis, among who I want to give special thanks to Alexander Alexandrov, Fabian H¨uske, Sebastian Schelter, Kostas Tzoumas, and Daniel Warneke. Thank you guys for the great work atmosphere, your ideas, your sense of humor, and for making the frequent long working hours not feel that bad. In addition, I had the pleasure of supervising several Master students who did amazing work in the context of our research project. At this point, I want to highlight the contributions made by Joe Harjung and Moritz Kaufmann. The collaboration with them yielded insights that evolved into aspects of this thesis. I also want to give thanks to the current generation of Master students, especially Ufuk Celebi, Aljoscha Krettek, and Robert Metzger. They are doing great work in both stabilizing the code base and infrastructure of the project, improving usability, and adding new ideas to the research system. With their help, it was made possible that many of the results of this thesis will live on under the umbrella of an open source project under teh Apache Software Foundation. Last, but not least, I would like to thank my girlfriend Anja for having my back while I wrote this thesis. Thank you for your kindness and understanding during this time. This thesis would not have been possible without the wondrous drink called Club Mate. iii iv Abstract We are witnessing an explosion in the amount of available data. Today, businesses and scientific institutions have the opportunity to analyze empirical data at unpreceded scale. For many companies, the analysis of their accumulated data is nowadays a key strategic aspect. Today's analysis programs consist not only of traditional relational-style queries, but they use increasingly more complex data mining and machine learning algorithms to discover hidden patterns or build predictive models. However, with the increasing data volume and increasingly complex questions that people aim to answer, there is a need for new systems that scale to the data size and to the complexity of the queries. Relational Database Management Systems have been the work horses of large-scale data analytics for decades. Their key enabling feature was arguably the declarative query language that brought physical schema independence and automatic optimization of queries. However, their fixed data model and closed set of possible operations have rendered them unsuitable for many advanced analytical tasks. This observation made way for a new breed of systems with generic abstractions for data parallel programming, among which the arguably most famous one is MapReduce. While bringing large-scale analytics to new applications, these systems still lack the ability to express complex data mining and machine learning algorithms efficiently, or they specialize on very specific domains and give up applicability to a wide range of other problems. Compared to relational databases, MapReduce and the other parallel programming systems sacrifice the declarative query abstraction and require programmers to implement low-level imperative programs and to manually optimize them. This thesis discusses techniques that realize several of the key aspects enabling the success of relational databases in the new context of data-parallel programming systems. The techniques are instrumental in building a system for generic and expressive, yet concise, fluent, and declarative analytical programs. Specifically, we present three new methods: First, we provide a programming model that is generic and can deal with complex data models, but retains many declarative aspects of the relational algebra. Programs written against this abstraction can be automatically optimized with similar techniques as relational queries. Second, we present an abstraction for iterative data-parallel algorithms. It supports incremental (delta-based) computations and transparently handles state. We give techniques to make the optimizer iteration-aware and deal with aspects such as loop invariant data. The optimizer can produce execution plans that correspond to well-known hand-optimized versions of such programs. That way, the abstraction subsumes dedicated systems (such as Pregel) and offers competitive performance. Third, we present and discuss techniques to embed the programming abstraction into a functional language. The integration allows for the concise definition of programs and supports the creation of reusable components for libraries or domain-specific languages. We describe how to v integrate the compilation and optimization of the data-parallel programs into a functional language compiler to maximize optimization potential. vi Zusammenfassung (German Abstract) Aufgrund fallender Preise zur Speicherung von Daten kann man derzeit eine explosion- sartige Zunahme in der Menge der verfgbaren Daten beobachten. Diese Entwicklung gibt Unternehmen und wissenschaftliche Institutionen die Mglichkeit empirische Daten in ungekannter Grenordnung zu analysieren. Fr viele Firmen ist die Analyse der gesam- melten Daten aus ihrem operationalen Geschft lngst zu einem zentralen strategischen Aspekt geworden. Im Gegensatz zu der seit lngerem schon betriebenen Business Intelli- gence, bestehen diese Analysen nicht mehr nur aus traditionellen relationalen Anfragen. In zunehmendem Anteil kommen komplexe Algorithmen aus den Bereichen Data Mining und Maschinelles Lernen hinzu, um versteckte Muster in den Daten zu erkennen, oder Vorhersagemodelle zu trainieren. Mit zunehmender Datenmenge und Komplexitt der Analysen wird jedoch eine neue Generation von Systemen bentigt, die diese Kombination aus Anfragekomplexitt und Datenvolumen gewachsen sind. Relationale Datenbanken waren lange Zeit das Zugpferd der Datenanalyse im groen Stil. Grund dafr war zum groen Teil ihre deklarativen Anfragesprache, welche es ermglichte die logischen und physischen Aspekte der Datenspeicherung und Verarbeitung zu trennen, und Anfragen automatisch zu optimieren. Das starres Datenmodell und ihre beschrnkte Menge von mglichen Operationen schrnken jedoch die Anwendbarkeit von relationalen Datenbanken fr viele der neueren analytischen Probleme stark ein. Diese Erkenntnis hat die Entwicklung einer neuen Generation von Systemen und Architekturen eingelutet, die sich durch sehr generische Abstraktionen fr parallelisierbare analytische Programme auszeichnen; MapReduce kann hier beispielhaft genannt werden, als der zweifelsohne prominenteste Vertreter dieser Systeme. Zwar vereinfachte und erschloss diese neue Generation von Systemen die Datenanalyse in diversen neuen Anwendungsfeldern, sie ist jedoch nicht in der Lage komplexe Anwendungen aus den Bereichen Data Mining und Maschinelles Lernen effizient abzubilden, ohne sich dabei extrem auf spezifische Anwendungen zu spezialisieren. Verglichen mit den relationalen Datenbanken haben MapReduce und vergleichbare Systeme auerdem die deklarative Abstraktion aufgegeben und zwingen den Anwender dazu systemnahe Programme zu schreiben und diese manuell zu optimieren. In dieser Dissertation werden verschiedene Techniken vorgestellt, die es ermglichen etliche der zentralen Eigenschaften von relationalen Datenbanken im Kontext dieser neuen Gen- eration von daten-parallelen Analysesystemen zu realisieren. Mithilfe dieser Techniken ist es mglich ein Analysesystem zu beschreiben, dessen Programme gleichzeitig sowohl generische und ausdrucksstark, als auch prgnant und deklarativ sind. Im einzelnen stellen wir folgende Techniken vor: Erstens, eine Programmierabstraktion die generisch ist und mit komplexen Datenmodellen umgehen kann, aber gleichzeitig viele der deklarativen Eigenschaften der relationalen Algebra erhlt. Programme, die gegen dies Abstraktion vii entwickelt werden knnen hnlich optimiert werden wie relationale Anfragen. Zweitens stellen wir eine Abstraktion fr iterative daten-parallele Algorithmen vor. Die Abstraktion untersttzt inkrementelle (delta-basierte) Berechnungen und geht mit zustandsbehafteteten Berechnungen transparent um. Wir beschreiben wie man einen relationalen Anfrageopti- mierer erweitern kann so dass dieser iterative Anfragen effektiv optimiert. Wir zeigen dabei dass der Optimierer dadurch in die Lage versetzt wird automatisch