Massive-Scale Processing of Record-Oriented and Graph Data

MASSIVE-SCALE PROCESSING OF RECORD-ORIENTED AND GRAPH DATA A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Semih Salihoglu September 2015 Abstract Many data-driven applications perform computations on large volumes of data that do not fit on a single computer. These applications typically must use parallel shared- nothing distributed software systems to perform their computations. This thesis addresses challenges in large-scale distributed data processing with a particular focus on two primary areas: (i) theoretical foundations for understanding the costs of distribution; and (ii) processing large-scale graph data. The first part of this thesis presents a theoretical framework for the MapReduce system, to analyze the cost of distribution for different problems domains, and for evaluating the “goodness” of different algorithms. We identify a fundamental tradeoff between the parallelism and communication costs of algorithms. We first study the setting when computations are constrained to a single round of MapReduce. In this setting, we capture the cost of distributing a problem by deriving a lower-bound curve on the communication cost of any algorithm that solves the problem for different parallelism levels. We derive lower-bound curves for several problems, and prove that existing or new one-round algorithms solving these problems are optimal, i.e., incur the minimum possible communication cost for different parallelism levels. We then show that by allowing multiple rounds of MapReduce computations, we can solve problems more efficiently than any possible one-round algorithm. The second part of this thesis addresses challenges in systems for processing large- scale graph data, with the goal of making graph computation more efficient and easier to program and debug. We focus on systems that are modeled after Google’s Pregel framework for large-scale distributed graph processing. We begin by describing an open-source version of Pregel we developed, called GPS (for Graph Processing iv System). We then describe new static and dynamic schemes for partitioning graphs across machines, and we present experimental results on the performance effects of different partitioning schemes. Next, we describe a set of algorithmic optimizations that address commonly-appearing inefficiencies in algorithms programmed on Pregel- like systems. Because it can be very difficult to debug programs in Pregel-like systems, we developed a new replay-style debugger called Graft. In addition, we defined and implemented a set of high-level parallelizable graph primitives, called HelP (for High- level Primitives), as an alternative to programming graph algorithms using the low- level vertex-centric functions of existing systems. HelP primitives capture several commonly appearing operations in large-scale graph computations. We motivate and describe Graft and HelP using real-world applications and algorithms. v Acknowledgments I want to start by thanking Prof. Jennifer Widom, for being an excellent PhD advisor. I realize that it takes a lot of effort to take a student with no background in research and prepare him to teach and advise his own students in other institutions. My approach to research, my writing, speaking, and presentation skills all bear deep marks from her. I am very grateful for the unique role she has played in my education. Prof. Jeff Ullman, along with Prof. Foto Afrati, initiated and advised the theory part of this thesis. Some of the most memorable moments of my PhD were the times I would try to solve a problem together with Jeff in front of his whiteboard. I want to thank him for giving me the opportunity to work with him. It was a true privilege for me. Prof. Christopher Ré has advised me in our work on multiround join algorithms. Since he joined the Stanford faculty, he has opened my eyes to amazing problems in databases, from generalized hypertree decompositions to worst-case and beyond worst-case join algorithms. His research will deeply influence the research that I will do after I leave Stanford. I also want to thank him for spreading my work to other database researchers in every opportunity that he found. I want to thank Prof. Hector Garcia-Molina for making life in InfoLab a lot of fun with his photography and movie nights. I also want to thank him for being on my orals and quals committees, and helping me with the conference presentations I gave throughout my PhD. I have collaborated very closely with Prof. Foto Afrati on the theory part of my thesis. Foto has been a great mentor and I hope to continue our collaborations in the years to come. vi I want to thank six other professors: Prof. Jure Leskovec for the numerous com- ments and suggestions he has given me on my work on graph processing; Prof. Ram Rajagopal for chairing my defense; Prof. Gio Wiederhold for continuing the tradition and being with us every Friday in InfoLab lunches; and Prof. Ashish Goel, Tim Roughgarden, and Luca Trevisan for inspiring me to learn more theory in the first few years of my PhD. I thank Marianne Siroker and Jam Kiattinant for helping me with countless bu- reaucratic troubles I had throughout my six years at Stanford. Their presence at Gates brings cheer to the CS department. I also want to acknowledge Andrej Krevl for helping me solve numerous systems issues in my projects. My research was done in collaboration with many exceptional colleagues: An- ish Das Sarma, Manas Joglekar, Jaeho Shin, Sungpack Hong, Robert Ikeda, Vikesh Khanna, Ba Qu Truang, Prof. Kunle Olukotun, Firas Abouzaid, and Anand Rajara- man. I thank all of them for all their contributions to this thesis. Most importantly, I want to thank my parents for encouraging me to go to the United States to study and enduring my absence for so long. They, more than any one else, have shaped me as a person, and I hope I can follow their steps in life. Finally, I want to thank Basak, my sister and brother and their families, and my friends for their love and support. vii Contents Abstract iv Acknowledgments vi 1 Introduction 1 1.1 MapReduce Overview . .2 1.2 Theoretical Foundations For Distributed Data-Processing Systems . .5 1.2.1 Outline of Thesis Chapters on Theoretical Foundations . .8 1.3 Pregel Overview . .9 1.4 Distributed Graph Processing Systems . 11 1.4.1 Efficiency . 12 1.4.2 Ease of Programming and Debugging . 13 1.4.3 Outline of Thesis Chapters on Distributed Graph Processing Systems . 16 2 A Theoretical Framework For MapReduce 17 2.1 Introduction . 17 2.1.1 Reducer Size As a Measure of Parallelism Level . 18 2.1.2 An Example of the Tradeoff: All-to-All Problem (A2A) . 19 2.1.3 How the Tradeoff Can Be Used . 21 2.1.4 Outline of the Chapter . 22 2.2 The Framework . 22 2.2.1 Problem Representation: Bipartite Input-Output Graphs . 22 2.2.2 Algorithm Representation: Mapping Schemas . 24 viii 2.2.3 Cost Model . 25 2.2.4 A Proof Template for Deriving Tradeoff Curves . 26 2.2.5 Criteria for Good MR Algorithms . 28 2.2.6 Summary of Our Results . 29 2.3 Dense Matrix Multiplication . 30 2.3.1 The Tradeoff Curve . 30 2.3.2 Algorithms . 32 2.4 Hamming Distance 1 . 33 2.4.1 The Tradeoff Curve . 35 2.4.2 Algorithms . 36 2.5 Multiway Equijoins . 38 2.5.1 Fractional Edge Coverings . 39 2.5.2 Vertex Packing and Max-Out Distribution . 40 2.5.3 The Tradeoff Curve . 42 2.5.4 Algorithms . 43 2.6 Related Work . 45 2.6.1 Models of MapReduce . 45 2.6.2 MapReduce Algorithms . 47 2.6.3 Other Related Work . 48 3 Fuzzy Joins With Hamming and Edit Distance d > 1 49 3.1 Problem Definition . 51 3.2 The Naive Algorithm . 52 3.3 Generalized Anchor Points Algorithm and Covering Codes . 53 3.3.1 Cost Analysis . 55 3.4 Constructing Explicit Codes By Cross Product . 56 3.5 Edit-Distance Covering Codes . 57 3.5.1 Elementary Lower Bounds . 58 3.5.2 Summary of Results . 59 3.6 Explicit Construction of Edit-Distance Covering Codes . 60 3.6.1 Insertion-1 Covering Codes . 60 ix 3.6.2 O(1/a2)-size Deletion-1 Covering Codes . 64 3.6.3 O(1/a3)-size Deletion-2 Covering Code For Shorter Strings . 65 3.7 Existence of O(log(n)/n)-size Deletion-1 Covering Codes . 67 3.7.1 Step 1: Run Patterns . 68 3.7.2 Step 2: Converting Run Patterns into Binary Strings . 68 3.7.3 Step 3: Partitioning Strings Based on Safe Bit Counts . 69 3.7.4 Step 4: Existence of a Deletion-1 Code Covering (1-1/n) Frac- tion of HS . 70 3.7.5 Proof of Lemmas . 75 3.7.6 Steps 5 and 6: Existence of O(log(n)/n)-size Deletion-1 Codes 78 3.8 Related Work . 79 4 Multiround MapReduce 81 4.1 Dense Matrix Multiplication . 82 4.2 Multiway Equijoins . 86 4.2.1 Summary of GYM and Main Results . 88 4.2.2 Outline for the Rest of the Chapter . 91 4.3 Generalized Hypertree Decompositions . 92 4.4 Distributed Yannakakis . 95 4.4.1 Serial Yannakakis Algorithm . 95 4.4.2 DYM-n . 97 4.4.3 DYM-d . 98 4.5 GYM . 101 4.5.1 Overview of GYM . 102 4.5.2 Analysis of GYM . 103 4.5.3 Example Execution of GYM . 104 4.6 Constructing O(log(n))-depth GHDs . 106 4.6.1 Extending D0 ..........................

Load more