Programming Abstractions, Compilation, and Execution Techniques for Massively Parallel Data Analysis

Total Page:16

File Type:pdf, Size:1020Kb

Programming Abstractions, Compilation, and Execution Techniques for Massively Parallel Data Analysis Programming Abstractions, Compilation, and Execution Techniques for Massively Parallel Data Analysis vorgelegt von Dipl.-Inf. Stephan Ewen aus Mainz der Fakult¨atIV - Elektrotechnik und Informatik der Technischen Universit¨atBerlin zur Erlangung des akademischen Grades Doktor der Ingenieurwissenschaften - Dr. Ing. - genehmigte Dissertation Promotionsausschuss: Vorsitzender: Prof. Dr. Peter Pepper Gutachter: Prof. Dr. Volker Markl Prof. Dr. Odej Kao Prof. Mike Carey, Ph.D. Tag der wissenschaftlichen Aussprache: 22. Oktober 2014 Berlin 2015 D 83 ii Acknowledgments During the process of creating this thesis, I have been accompanied by a great many people over the time. Many of them offered valuable discussions and helped to shape the outcome of this thesis in one way or another. First and foremost, I want to thank my adviser Volker Markl. He has not only provided the great work environment and the collaborations under which this thesis was created, but helped with countless comments and valuable insights. Throughout my entire academic life, Volker has mentored me actively and given me the opportunity to work on amazing projects. I would like to thank my advisers Odej Kao and Mike Carey for their helpful comments and the discussions that we had at at various points, like project meetings, conferences, or academic visits. I worked with many great people on a day to day basis, among who I want to give special thanks to Alexander Alexandrov, Fabian H¨uske, Sebastian Schelter, Kostas Tzoumas, and Daniel Warneke. Thank you guys for the great work atmosphere, your ideas, your sense of humor, and for making the frequent long working hours not feel that bad. In addition, I had the pleasure of supervising several Master students who did amazing work in the context of our research project. At this point, I want to highlight the contributions made by Joe Harjung and Moritz Kaufmann. The collaboration with them yielded insights that evolved into aspects of this thesis. I also want to give thanks to the current generation of Master students, especially Ufuk Celebi, Aljoscha Krettek, and Robert Metzger. They are doing great work in both stabilizing the code base and infrastructure of the project, improving usability, and adding new ideas to the research system. With their help, it was made possible that many of the results of this thesis will live on under the umbrella of an open source project under teh Apache Software Foundation. Last, but not least, I would like to thank my girlfriend Anja for having my back while I wrote this thesis. Thank you for your kindness and understanding during this time. This thesis would not have been possible without the wondrous drink called Club Mate. iii iv Abstract We are witnessing an explosion in the amount of available data. Today, businesses and scientific institutions have the opportunity to analyze empirical data at unpreceded scale. For many companies, the analysis of their accumulated data is nowadays a key strategic aspect. Today's analysis programs consist not only of traditional relational-style queries, but they use increasingly more complex data mining and machine learning algorithms to discover hidden patterns or build predictive models. However, with the increasing data volume and increasingly complex questions that people aim to answer, there is a need for new systems that scale to the data size and to the complexity of the queries. Relational Database Management Systems have been the work horses of large-scale data analytics for decades. Their key enabling feature was arguably the declarative query language that brought physical schema independence and automatic optimization of queries. However, their fixed data model and closed set of possible operations have rendered them unsuitable for many advanced analytical tasks. This observation made way for a new breed of systems with generic abstractions for data parallel programming, among which the arguably most famous one is MapReduce. While bringing large-scale analytics to new applications, these systems still lack the ability to express complex data mining and machine learning algorithms efficiently, or they specialize on very specific domains and give up applicability to a wide range of other problems. Compared to relational databases, MapReduce and the other parallel programming systems sacrifice the declarative query abstraction and require programmers to implement low-level imperative programs and to manually optimize them. This thesis discusses techniques that realize several of the key aspects enabling the success of relational databases in the new context of data-parallel programming systems. The techniques are instrumental in building a system for generic and expressive, yet concise, fluent, and declarative analytical programs. Specifically, we present three new methods: First, we provide a programming model that is generic and can deal with complex data models, but retains many declarative aspects of the relational algebra. Programs written against this abstraction can be automatically optimized with similar techniques as relational queries. Second, we present an abstraction for iterative data-parallel algorithms. It supports incremental (delta-based) computations and transparently handles state. We give techniques to make the optimizer iteration-aware and deal with aspects such as loop invariant data. The optimizer can produce execution plans that correspond to well-known hand-optimized versions of such programs. That way, the abstraction subsumes dedicated systems (such as Pregel) and offers competitive performance. Third, we present and discuss techniques to embed the programming abstraction into a functional language. The integration allows for the concise definition of programs and supports the creation of reusable components for libraries or domain-specific languages. We describe how to v integrate the compilation and optimization of the data-parallel programs into a functional language compiler to maximize optimization potential. vi Zusammenfassung (German Abstract) Aufgrund fallender Preise zur Speicherung von Daten kann man derzeit eine explosion- sartige Zunahme in der Menge der verfgbaren Daten beobachten. Diese Entwicklung gibt Unternehmen und wissenschaftliche Institutionen die Mglichkeit empirische Daten in ungekannter Grenordnung zu analysieren. Fr viele Firmen ist die Analyse der gesam- melten Daten aus ihrem operationalen Geschft lngst zu einem zentralen strategischen Aspekt geworden. Im Gegensatz zu der seit lngerem schon betriebenen Business Intelli- gence, bestehen diese Analysen nicht mehr nur aus traditionellen relationalen Anfragen. In zunehmendem Anteil kommen komplexe Algorithmen aus den Bereichen Data Mining und Maschinelles Lernen hinzu, um versteckte Muster in den Daten zu erkennen, oder Vorhersagemodelle zu trainieren. Mit zunehmender Datenmenge und Komplexitt der Analysen wird jedoch eine neue Generation von Systemen bentigt, die diese Kombination aus Anfragekomplexitt und Datenvolumen gewachsen sind. Relationale Datenbanken waren lange Zeit das Zugpferd der Datenanalyse im groen Stil. Grund dafr war zum groen Teil ihre deklarativen Anfragesprache, welche es ermglichte die logischen und physischen Aspekte der Datenspeicherung und Verarbeitung zu trennen, und Anfragen automatisch zu optimieren. Das starres Datenmodell und ihre beschrnkte Menge von mglichen Operationen schrnken jedoch die Anwendbarkeit von relationalen Datenbanken fr viele der neueren analytischen Probleme stark ein. Diese Erkenntnis hat die Entwicklung einer neuen Generation von Systemen und Architekturen eingelutet, die sich durch sehr generische Abstraktionen fr parallelisierbare analytische Programme auszeichnen; MapReduce kann hier beispielhaft genannt werden, als der zweifelsohne prominenteste Vertreter dieser Systeme. Zwar vereinfachte und erschloss diese neue Generation von Systemen die Datenanalyse in diversen neuen Anwendungsfeldern, sie ist jedoch nicht in der Lage komplexe Anwendungen aus den Bereichen Data Mining und Maschinelles Lernen effizient abzubilden, ohne sich dabei extrem auf spezifische Anwendungen zu spezialisieren. Verglichen mit den relationalen Datenbanken haben MapReduce und vergleichbare Systeme auerdem die deklarative Abstraktion aufgegeben und zwingen den Anwender dazu systemnahe Programme zu schreiben und diese manuell zu optimieren. In dieser Dissertation werden verschiedene Techniken vorgestellt, die es ermglichen etliche der zentralen Eigenschaften von relationalen Datenbanken im Kontext dieser neuen Gen- eration von daten-parallelen Analysesystemen zu realisieren. Mithilfe dieser Techniken ist es mglich ein Analysesystem zu beschreiben, dessen Programme gleichzeitig sowohl generische und ausdrucksstark, als auch prgnant und deklarativ sind. Im einzelnen stellen wir folgende Techniken vor: Erstens, eine Programmierabstraktion die generisch ist und mit komplexen Datenmodellen umgehen kann, aber gleichzeitig viele der deklarativen Eigenschaften der relationalen Algebra erhlt. Programme, die gegen dies Abstraktion vii entwickelt werden knnen hnlich optimiert werden wie relationale Anfragen. Zweitens stellen wir eine Abstraktion fr iterative daten-parallele Algorithmen vor. Die Abstraktion untersttzt inkrementelle (delta-basierte) Berechnungen und geht mit zustandsbehafteteten Berechnungen transparent um. Wir beschreiben wie man einen relationalen Anfrageopti- mierer erweitern kann so dass dieser iterative Anfragen effektiv optimiert. Wir zeigen dabei dass der Optimierer dadurch in die Lage versetzt wird automatisch
Recommended publications
  • Signature Redacted Kimberly Smith Program in Media Arts and Sciences May 12, 2017 Signature Redacted
    New Materials for Teaching Computational Thinking in Early Childhood Education by Kimberly Smith M.EA. Fine Arts School of Visual Arts, 2012 B.F.A. Fine Arts LIBRARIES Oregon State University, 2005 ARCHIVES Submitted to the Department of Media Arts and Sciences, School of Architecture and Planning, in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2017 2017 Massachusetts Institute of Technology. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Signature of Author:-Signature redacted Kimberly Smith Program in Media Arts and Sciences May 12, 2017 Signature redacted Certified by Sepandar Kamvar Associate Professor of Media Arts and Sciences Signature redacted Accepted by Pattie Maes, Ph.D. kcademic Head, Program in Media Arts and Sciences I New Materials for Teaching Computational Thinking in Early Childhood Education by Kimberly Smith M.F.A. Fine Arts School of Visual Arts, 2012 B.EA. Fine Arts Oregon State University, 2005 Submitted to the Department of Media Arts and Sciences, School of Architecture and Planning, in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2017 Abstract The need for computer science education is greater than ever. There are currently over 500,000 unfilled computer science jobs in the United States and many schools do not teach computer science in their classrooms.
    [Show full text]
  • Signature Redacted a Uthor
    Automatic Identification of Representative Content on Twitter by OF TECHNOLOGY Prashanth Vijayaraghavan JUL 14 2016 B.E, Anna University (2012) LIBRARIES Submitted to the Program in Media Arts and Sciences ARKIVES in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2016 @ Massachusetts Institute of Technology 2016. All rights reserved. Signature redacted A uthor ....... ...................... Program in Media Arts and Sciences May 11, 2016 Signature redacted Certified by ..................... Xeb Roy Associate Professor Program in Media Arts and Sciences Thesis Supervisor Signature redacted Accepted by ........................ Pattie Maes Academic Head Program in Media Arts and Sciences --owN Automatic Identification of Representative Content on Twitter by Prashanth Vijayaraghavan Submitted to the Program in Media Arts and Sciences on May 11, 2016, in partial fulfillment of the requirements for the degree of Master of Science in Media Arts and Sciences Abstract Microblogging services, most notably Twitter, have become popular avenues to voice opinions and be active participants of discourse on a wide range of topics. As a consequence, Twitter has become an important part of the political battleground that journalists and political analysts can harness to analyze and understand the narratives that organically form, spread and decline among the public in a political campaign. A challenge with social media is that important discussions around certain issues can be overpowered by majoritarian or controversial topics that provoke strong reactions and attract large audiences. In this thesis we develop a method to identify the specific ideas and sentiments that represent the overall conversation surrounding a topic or event as reflected in collections of tweets.
    [Show full text]
  • Signature Redacted Author
    Understanding email communication patterns Daniel Smilkov M.Sc. Computer Science, 2011 B.Sc. Computer Science, 2009 Ss. Cyril and Methodius University, Skopje, Macedonia Submitted to the Program in Media Arts and Sciences, MASSACHUSETS WMNfVlE School of Architecture and Planning, OF TECHNOLOGY in Partial Fulfillment of the Requirements for the Degree of JUL Master of Science in Media Technology at the Massachusetts Institute of Technology LIBRARIES June 2014 © Massachusetts Institute of Technology, 2014. All rights reserved. Signature redacted Author Ilaniel Smilkov Program in Media Arts and Sciences May 16, 2014 Signature redacted Certified By Dr. Cesar A. Hidalgo Assistant Professor in Media Arts and Sciences Program in Media Arts and Sciences Signature redacted Accepted By Dr. Pattie Maes Professor of Media Technology, Associate Academic Head Program in Media Arts and Sciences 1 Understanding email communication patterns Daniel Smilkov Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, On May 16, 2014 in Partial Fulfillment of the Requirements for the Degree of Master of Science in Media Technology at the Massachusetts Institute of Technology Abstract It has been almost two decades since the beginning of the web. This means that the web is no longer just a technology of the present, but also, a record of our past. Email, one of the original forms of social media, is even older than the web and contains a detailed description of our personal and professional history. This thesis explores the world of email communication by introducing Immersion, a tool build for the purposes to analyze and visualize the information hidden behind the digital traces of email activity, to help us reflect on our actions, learn something new, quantify it, and hopefully make us react and change our behavior.
    [Show full text]
  • An Algorithmic Approach to Social Networks David Liben-Nowell
    An Algorithmic Approach to Social Networks by David Liben-Nowell B.A., Computer Science and Philosophy, Cornell University, 1999 M.Phil., Computer Speech and Language Processing, University of Cambridge, 2000 Submitted to the Department of Electrical Engineering and Computer Science in partial ful¯llment of the requirements for the degree of Doctor of Philosophy in Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2005 c Massachusetts Institute of Technology 2005. All rights reserved. Author . Department of Electrical Engineering and Computer Science May 20, 2005 Certi¯ed by. Erik D. Demaine Assistant Professor Thesis Supervisor Accepted by . Arthur C. Smith Chairman, Department Committee on Graduate Students An Algorithmic Approach to Social Networks by David Liben-Nowell Submitted to the Department of Electrical Engineering and Computer Science on May 20, 2005, in partial ful¯llment of the requirements for the degree of Doctor of Philosophy in Computer Science Abstract Social networks consist of a set of individuals and some form of social relationship that ties the individuals together. In this thesis, we use algorithmic techniques to study three aspects of social networks: (1) we analyze the \small-world" phenomenon by examining the geographic patterns of friendships in a large-scale social network, showing how this linkage pattern can itself explain the small-world results; (2) using existing patterns of friendship in a social network and a variety of graph-theoretic techniques, we show how to predict new relationships that will form in the network in the near future; and (3) we show how to infer social connections over which information ows in a network, by examining the times at which individuals in the network exhibit certain pieces of information, or interest in certain topics.
    [Show full text]
  • Univerzita Karlova V Praze Internetové Vyhledávače S Důrazem Na
    Univerzita Karlova v Praze Filozofická fakulta Ústav informačních studií a knihovnictví Studijní program: informační studia a knihovnictví Studijní obor: informační studia a knihovnictví Libor Nováček internetové vyhledávače s důrazem na technický vývoj a ekonomiku provozu Diplomová práce Praha 2008 Vedoucí diplomové práce: PhDr. Richard Papík, Ph.D. Oponent diplomové práce: Datum obhajoby: Hodnocení: Vysoká škola: Univerzita Karlova v Praze Fakulta: Filozofická fakulta Součást: Ustav informačních studií a knihovnictví Školní rok: 2004/2005 - ZADÁNÍ DIPLOMOVÉ PRÁCE (PROJEKTU, UMĚLECKÉHO DÍLA, UMĚLECKÉHO VÝKONU) pro Libor Nováček obor Informační studia a knihovnictví Název tématu: Internetové vyhledávače s důrazem na technický vývoj a ekonomiku provozu Zásady pro vypracování: Cílem diplomové práce je popsat a analyzovat vývoj hlavních celosvětových a českých vyhledávačů typu „web search engines“ a současně se záměřit na jejich technické a ekonomické aspekty. Předběžná osnova: 1. Úvod do problematiky vyhledávacích prostředků na Internetu - období před komerčním rozšířením WWW služby - období druhé poloviny 90. let minulého století - "search engines" v 21. století 2. Technický vývoj vyhledávačů 3. Ekonomické aspekty provozu vyhledávačů 4. Předpokládané trendy vývoje internetových vyhledávacích prostředků Diplomová práce bude připravena a upravena v souladu s platnými vnitřními předpisy FF UK a dalšími metodickými pokyny a normativními dokumenty. Tisk: Rapy 0078/2001 o |SEVT| 49 395 0 1/2001 Rozsah grafických prací: Rozsah průvodní zprávy: Seznam odborné literatury: 1. BAEZA-YATES, Ricardo; RIBEIRO-NETO, Berthier. Modern information retrieval. New York : Addison-Wesley. 1999. 513 s. ISBN 0-201-39829-X 2. Search engine watch [online]. Darien (Connecticut): Jupitermedia, 1998-. [cit. 2005-01-11]. Dostupné z WWW: <http://searchenainewatch.com> 3. HOCK, Randolph. The extreme searcher's guide to Web search engines: a handbook for the serious searcher.
    [Show full text]
  • Web Search Engines: Practice and Experience
    Web Search Engines: Practice and Experience Tao Yang, University of California at Santa Barbara Apostolos Gerasoulis, Rutgers University 1 Introduction The existence of an abundance of dynamic and heterogeneous information on the Web has offered many new op- portunities for users to advance their knowledge discovery. As the amount of information on the Web has increased substantially in the past decade, it is difficult for users to find information through a simple sequential inspection of web pages or recall previously accessed URLs. Consequently, the service from a search engine becomes indispensable for users to navigate around the Web in an effective manner. General web search engines provide the most popular way to identify relevant information on the web by taking a horizontal and exhaustive view of the world. Vertical search engines focus on specific segments of content and offer great benefits for finding information on selected topics. Search engine technology incorporates interdisciplinary concepts and techniques from multiple fields in computer science, which include computer systems, information retrieval, machine learning, and computer linguistics. Understanding the computational basis of these technologies is important as search engine technologies have changed the way our society seeks out and interacts with information. This chapter gives an overview of search techniques to find information from on the Web relevant to a user query. It describes how billions of web pages are crawled, processed, and ranked, and our experiences in building Ask.com's search engine with over 100 million north American users. Section 2 is an overview of search engine components and historical perspective of search engine development.
    [Show full text]