Chromatic Scheduling of Dynamic Data-Graph

Chromatic Scheduling of Dynamic Data-Graph Computations by Tim Kaler Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Master of Engineering in Electrical Engineering and Computer Science at the AROH!VE rNSTITUTE MASSACHUSETTS INSTITUTE OF TECHNOLOGY May 2013 W 2.9 2MH3 Copyright 2013 Tim Kaler. All rig ts reserved. The author hereby grants to M.I.T. permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole and in part in any medium now known or hereafter created. A u th or ................................................................ Department of Electrical Engineering and Computer Science May 24, 2013 Certified by ..... ...... Charles E. Leiserson Professor of Computer Science and Engineering Thesis Supervisor Accepted by ...... 'I Dennis M. Freeman Chairman, Department Committee on Graduate Students Chromatic Scheduling of Dynamic Data-Graph Computations by Tim Kaler Submitted to the Department of Electrical Engineering and Computer Science on May 24, 2013, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Data-graph computations are a parallel-programming model popularized by programming systems such as Pregel, GraphLab, PowerGraph, and GraphChi. A fun- damental issue in parallelizing data-graph computations is the avoidance of races between computation occuring on overlapping regions of the graph. Common solutions such as locking protocols and bulk-synchronous execution often sacrifice performance, update atomicity, or determinism. A known alternative is chromatic scheduling which uses a vertex coloring of the conflict graph to divide data-graph updates into sets which may be parallelized without races. To date, however, only static data-graph computations, which do not schedule updates at runtime, have employed chromatic scheduling. I introduce PRISM, a work-efficient scheduling algorithm for dynamic data-graph computations that uses chromatic scheduling. For a collection of four application benchmarks on a modern multicore machine, chromatic scheduling approximately doubles the performance of the lock-based GraphLab implementation, and triples the performance of GraphChi's update execution phase when enforcing determinism. Chromatic scheduling motivates the development of efficient deterministic parallel coloring algorithms. New analysis of the Jones-Plassmann message-passing algorithm shows that only O(A + In A in V/ In ln V) rounds are needed to color a graph G = (V, E) with max vertex degree A, generalizing previous results for bounded degree graphs. A new log-degree ordering heuristic is described which can reduce the number of colors used in practice, while only increasing the number of rounds by a logrithmic factor. An efficient implementation for the shared-memory setting is described and analyzed using the CRQW contention model, showing that this algorithm performs 6(V + E) work and has expected span O(A In A + 1n 2A In V/In In V). Benchmarks on a set of real world graphs show that, in practice, these parallel algorithms achieve modest speedup over optimized serial code (around 4x on a 12-core machine). Thesis Supervisor: Charles E. Leiserson Title: Professor of Computer Science and Engineering 3 4 Acknowledgements Charles E. Leiserson, my advisor, has provided damned good advice and guidance which has influenced the contents of this thesis and my decision to remain in academia. His writing related suggestions have been especially helpful and, at times, amusing. My coauthors William Hasenplaugh and Tao B. Schardl have made significant contributions to the content of this thesis. In particular, the main results on parallel graph coloring (chapters 6-9) would not have been possible without their work. The Supertech research group as a whole has provided a variety of resources. It has provided an engaged audience to listen to my presentations and provide valuable feedback. It's provided a community effective researchers to serve as role models and provide examples of high quality research. Finally, its provided wonderful collabora- tors who have contributed their time and energy to help improve my research. Thanks! 5 6 Contents 1 Introduction 9 I Chromatic Scheduling 17 2 Data-graph Computations 19 3 Analysis of Parallel Data-Graph Computations 25 4 Dynamic Chromatic Scheduling 29 5 Performance Evaluation of Prism 37 II Deterministic Parallel Coloring 41 6 Graph Coloring 43 7 Jones-Plassmann Coloring Algorithm 47 8 Log Degree Ordering Heuristic 53 9 Multicore Implementation of Jones-Plassmann 57 10 Conclusion 65 7 8 Chapter 1 Introduction Graphs provide a natural representation for many systems from physics, artificial intelligence, and scientific computing. Physical systems can often be decomposed into a finite number of elements whose interactions induce a graph. Probabilistic graphical models in AI are often used to represent the dependency structure of a set of random variables. Sparse matrices may be interpreted as a graph for the purpose of performing scientific computing. Generally, these representations are data-graphs which have data associated with their vertices and edges. A data-graph computation is an algorithm implemented as a sequence of local updates on a data-graph. Each of these updates operates on the data at a vertex, its incident edges, and its neighboring vertices, providing new data values for the vertex. A data-graph computation typically proceeds in rounds in which a set of vertices in the graph are updated. Data-graph computations may be either static, where all vertices are updated in a round, or dynamic, where the set of vertices for the next round is determined during the execution of the current round. For example, a dynamic data-graph computation may schedule an update at a vertex only if the value of its neighbors have changed by some threshold. Dynamic data-graph computations avoid unnecessary work, which has been shown to improve the practical performance of many applications [45, 47]. Static data-graph computations include Gibbs sampling [18,19], iterative graph coloring [14], and n-body problems such as the fluidanimate PARSEC benchmark [7]. Dynamic data-graph computations include the 9 Google PageRank algorithm [10], loopy belief propagation [52,56], coordinate descent [15], co-EM [54], alternating least-squares [35], singular value decomposition [26], and matrix factorization [59]. Parallelizing data-graph computations The prevalence of data-graph computations in World Wide Web analysis, machine learning, numerical computation, and other areas has inspired several systems for parallelizing such applications, such as Pregel [49], GraphLab [45,47], PowerGraph [27], and GraphChi [41]. These systems use three common methods to parallelize data-graph computations: bulk-synchronous updates, lock-synchronized updates, and chromatic scheduling. These methods vary in the types of data-graph computations they can express and on the semantic guarantees they provide the programmer. These differences stem mostly from each method's strategy for resolving race conditions between parallel updates. Before discussing these three methods, let us review the two types of races that can affect a data-graph computation. A determinacy race [16] (also called a general race [53]) occurs if two parallel updates access the same memory location and one of those accesses is a write. A data race [53] exists if there is a determinacy race between two updates which hold no common mutual-exclusion locks. Data races can cause the result of a parallel data-graph computation to fail to correspond to any sequential application of updates, and a determinacy race can cause nondeterminism. A parallel data-graph computation guarantees update atomicity if the final result of the computation corresponds to some sequential application of updates. It is deterministic if all executions are guaranteed to produce the same result. Any determinacy race can cause a program to become nondeterministic, but only data- races can compromise update atomicity. It is desirable for parallel data-graph computations to be both deterministic and guarantee update atomicity. Determinism is valuable because it makes the execution of a data-graph computation repeatable. Repeatability allows data-graph algorithms to be debugged more effectively and makes it easier to implement fault tolerance 10 mechanisms, such as replicated state machines, when working with distributed systems. Update atomicity is valuable because it allows for the expression of a wider class of data-graph algorithms which require updates to be applied in some sequential order in order to guarantee correctness or good convergence properties. Bulk-synchronous The bulk-synchronous method executes a data-graph computation in parallel as a sequence of rounds. During each round, the bulk-synchronous method avoids nondeterminism by maintaining two copies of the data-graph: one read-only and the other write-only. Updates applied during the round only modify the write-only copy of the data-graph. After all updates in the round have been applied, the changes to the write-only data-graph are propagated to the read-only copy, and the next round is ready to execute. The bulk-synchronous method has been used to parallelize both static and dynamic data-graph computations. Pregel [49], for example, utilizes the bulk-synchronous method

Load more