A Distributed Execution Engine Supporting Data-Dependent Control flow
Total Page:16
File Type:pdf, Size:1020Kb
A distributed execution engine supporting data-dependent control flow Derek Gordon Murray University of Cambridge Computer Laboratory King’s College July 2011 This dissertation is submitted for the degree of Doctor of Philosophy Declaration This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This dissertation does not exceed the regulation length of 60,000 words, including tables and footnotes. A distributed execution engine supporting data-dependent control flow Derek G. Murray Summary In computer science, data-dependent control flow is the fundamental concept that enables a machine to change its behaviour on the basis of intermediate results. This ability increases the computational power of a machine, because it enables the machine to execute iterative or recursive algorithms. In such algorithms, the amount of work is unbounded a priori, and must be determined by evaluating a fixpoint condition on successive intermediate results. For example, in the von Neumann architecture—upon which almost all modern computers are based—these algorithms can be programmed using a conditional branch instruction. A distributed execution engine is a system that runs on a network of computers, and provides the illusion of a single, reliable machine that provides a large aggregate amount of compu- tational and I/O performance. Although each individual computer in these systems is a von Neumann machine capable of data-dependent control flow, the effective computational power of a distributed execution engine is determined by the expressiveness of the execution model that describes distributed computations. In this dissertation, I present a new execution model for distributed execution engines that sup- ports data-dependent control flow. The model is based on dynamic task graphs, in which each vertex is a sequential computation that may decide, on the basis of its input, to spawn additional computation and hence rewrite the graph. I have developed a prototype system that executes dynamic task graphs, and discuss details of its design and implementation, including the fault tolerance mechanisms that maintain reliability throughout dynamic task graph execution. Dy- namic task graphs support a variety of programming models, and I introduce a model based on multiple distributed threads of execution that synchronise deterministically using futures and continuations. To demonstrate the practicality of dynamic task graphs, I have evaluated its per- formance on several microbenchmarks and realistic applications, and it achieves performance that is similar to or better than an existing, less-powerful execution engine. Acknowledgements Foremost, I would like to thank my supervisor, Steve Hand, for his help and encouragement over the past four years. Through many hours of meetings, and his comments on countless drafts of this document, Steve’s feedback has been vital in helping me to shape my thesis. The system at the centre of this dissertation, CIEL, has grown from one student’s thesis project to become a thriving collaborative effort. The other members of the CIEL team are Malte Schwarzkopf, Chris Smowton, Anil Madhavapeddy and Steven Smith, and I am indebted to all of them for their contribution to the project’s success. CIEL also spawned a Part II undergradu- ate project, and I thank Seb Hollington for being such an enthusiastic supervisee. In addition to the team members, several current and former colleagues in the Computer Labo- ratory have commented on drafts of this dissertation. I am grateful to Jon Crowcroft, Stephen Kell, Amitabha Roy, Eiko Yoneki and Ross McIlroy for their comments and suggestions, which have greatly improved the clarity of my writing. I would also like to thank Andy Warfield of the University of British Columbia for his frequent invitations to give talks in Vancouver; as well as being enjoyable, the feedback that I have received has been useful in developing the ideas that I present in this dissertation. The genesis of CIEL can be traced to the summer internship that I spent at Microsoft Research in Silicon Valley. I thank Michael Isard and Yuan Yu for giving me the opportunity to work on the Dryad project. Working with Dryad gave me useful experience in using a real-world distributed system, and helped me to realise that there was an opportunity to develop a more powerful system. Contents 1 Introduction 11 1.1 Contributions . 12 1.2 Outline . 13 1.3 Related publications . 14 2 Background and related work 15 2.1 Scales of parallelism . 17 2.2 Parallel programming models . 25 2.3 Coordinating distributed tasks . 36 2.4 Summary . 46 3 Dynamic task graphs 47 3.1 Definitions . 48 3.2 Executing a dynamic task graph . 55 3.3 Relation to other execution models . 59 3.4 Summary . 70 4 A universal execution engine 72 4.1 Distributed coordination . 73 4.2 A simple distributed store . 80 4.3 Scheduling . 87 4.4 Fault tolerance . 95 4.5 Summary . 99 CONTENTS CONTENTS 5 Parallel programming models 100 5.1 Implementing existing models . 100 5.2 First-class executors . 105 5.3 Distributed thread programming model . 109 5.4 Summary . 125 6 Evaluation 127 6.1 Experimental configuration . 129 6.2 Task handling . 129 6.3 MapReduce . 134 6.4 Iterative k-means . 138 6.5 Fault tolerance . 141 6.6 Streaming . 144 6.7 Summary . 147 7 Conclusions and future work 149 7.1 Extending dynamic task graphs . 150 7.2 Alternative system architectures . 151 7.3 Separating policy from mechanism . 152 7.4 Summary . 153 6 List of Figures 2.1 An outline of the topics that have influenced dynamic task graphs . 16 2.2 Models of MIMD parallelism . 19 2.3 A message-passing version of Conway’s Game of Life . 28 2.4 A simple data-flow graph . 31 2.5 Basic task farm architecture . 36 2.6 Data-flow graphs for various models of task parallelism . 37 2.7 Example of acyclic data-flow in make ...................... 41 3.1 Symbols representing a concrete object and a future object . 48 3.2 Illustration of a store with concrete and future objects . 49 3.3 Illustration of a task with dependencies and expected outputs . 50 3.4 Illustration of task output delegation . 54 3.5 Lazy evaluation algorithm for evaluating an object in a dynamic task graph . 58 3.6 Dynamic task graphs for recursively calculating the nth Fibonacci number. 60 3.7 Dynamic task graph for performing a MapReduce computation . 62 3.8 Dynamic task graph for the first two supersteps of a Pregel computation . 65 3.9 Implementation of mutable state in a dynamic task graph . 67 3.10 Dynamic task graph for two iterations of a while loop . 68 4.1 The reference lifecycle . 72 4.2 A CIEL cluster has one or more clients, a single master and many workers . 73 4.3 Example job descriptor for a π-estimation job. 77 4.4 Pseudocode for the CIEL master and worker . 78 4.5 Representation of a Java example task and its code dependency . 79 4.6 Expected load distribution using two different replica placement strategies . 84 LIST OF FIGURES LIST OF FIGURES 4.7 Proportion of data-local tasks as the cluster size is varied . 89 4.8 Distribution of input object replication factors as an iterative job executes . 90 4.9 Non-local tasks in the first 20 iterations of k-means clustering . 93 4.10 Execution time for the first 20 iterations of k-means clustering . 94 4.11 Mechanisms for providing master fault tolerance. 98 5.1 Pseudocode for a MapReduce map task . 104 5.2 Rules for computing the result of the executor function for a first-class task . 105 5.3 Dynamic task graph for a distributed thread, comprising three tasks . 110 5.4 A continuation in a distributed thread . 111 5.5 Dynamic task graphs for Skywriting tasks . 115 5.6 Dynamic task graph for a Skywriting script that blocks on a child task . 115 6.1 A synthetic sequential workload . 131 6.2 A synthetic parallel workload . 132 6.3 A synthetic iterative workload . 133 6.4 Comparison of Hadoop and CIEL absolute performance for π estimation . 136 6.5 Comparison of Hadoop and CIEL parallel speedup for π estimation . 136 6.6 Comparison of Hadoop and CIEL absolute performance for Grep . 137 6.7 Comparison of Hadoop and CIEL execution time for k-means clustering . 139 6.8 Comparison of MPI and optimised CIEL execution time of k-means clustering . 139 6.9 Distribution of master overhead with and without fault tolerance . 141 6.10 The overhead of master fault tolerance under randomly-induced failures . 143 6.11 Streaming throughput in TCP and HTTP modes . 145 6.12 Illustration of the data dependencies in a BOPM calculation . 146 6.13 Speedup of BOPM for various problem sizes . 147 7.1 Possible representation for non-deterministic select in a dynamic task graph . 152 8 List of Tables 3.1 Definition of a dynamic task graph for calculating the nth Fibonacci number . 60 3.2 Definition of a dynamic task graph for performing a MapReduce computation . 61 3.3 Definition of a dynamic task graph for performing a Pregel computation . 64 3.4 Definition of a dynamic task graph for executing a while program . 68 4.1 Rules for combining an existing reference with an incoming reference . 76 4.2 Rules for combining an existing reference with an incoming stream reference . 85 4.3 Rules for combining an existing reference with an incoming sweetheart reference 91 4.4 Rules for combining an existing reference with an incoming tombstone reference 97 5.1 Rules for combining a fixed reference with an incoming tombstone reference .