Understanding and Improving Graph Algorithm Performance
Total Page:16
File Type:pdf, Size:1020Kb
Understanding and Improving Graph Algorithm Performance Scott Beamer Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2016-153 http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-153.html September 8, 2016 Copyright © 2016, by the author(s). All rights reserved. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission. Understanding and Improving Graph Algorithm Performance by Scott Beamer III A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Krste Asanovi´c,Co-chair Professor David Patterson, Co-chair Professor James Demmel Professor Dorit Hochbaum Fall 2016 Understanding and Improving Graph Algorithm Performance Copyright 2016 by Scott Beamer III 1 Abstract Understanding and Improving Graph Algorithm Performance by Scott Beamer III Doctor of Philosophy in Computer Science University of California, Berkeley Professor Krste Asanovi´c,Co-chair Professor David Patterson, Co-chair Graph processing is experiencing a surge of renewed interest as applications in social networks and their analysis have grown in importance. Additionally, graph algorithms have found new applications in speech recognition and the sciences. In order to deliver the full potential of these emerging applications, graph processing must become substantially more efficient, as graph processing's communication-intensive nature often results in low arithmetic intensity that underutilizes available hardware platforms. To improve graph algorithm performance, this dissertation characterizes graph process- ing workloads on shared memory multiprocessors in order to understand graph algorithm performance. By querying performance counters to measure utilizations on real hardware, we find that contrary to prevailing wisdom, caches provide great benefit for graph process- ing and the systems are rarely memory bandwidth bound. Leveraging the insights of our workload characterization, we introduce the Graph Algorithm Iron Law (GAIL), a simple performance model that allows for reasoning about tradeoffs across layers by considering al- gorithmic efficiency, cache locality, and memory bandwidth utilization. We also provide the Graph Algorithm Platform (GAP) Benchmark Suite to help the community improve graph processing evaluations through standardization. In addition to understanding graph algorithm performance, we make contributions to im- prove graph algorithm performance. We present our direction-optimizing breadth-first search algorithm that is advantageous for low-diameter graphs, which are becoming increasingly rel- evant as social network analysis becomes more prevalent. Finally, we introduce propagation blocking, a technique to reduce memory communication on cache-based systems by blocking graph computations in order to improve spatial locality. i To my family ii Contents Contents ii List of Figures iv List of Tables viii 1 Introduction 1 1.1 Focus on Shared Memory Multiprocessor Systems . 2 1.2 Thesis Overview . 3 2 Background on Graph Algorithms 5 2.1 The Graph Abstraction . 5 2.2 Topological Properties of Graphs . 6 2.3 Graphs in this Work . 7 2.4 Graph Algorithms in this Work . 8 2.5 Other Types of Graph Algorithms . 11 3 Direction-Optimizing Breadth-First Search 13 3.1 Introduction . 13 3.2 Conventional Top-Down BFS . 14 3.3 Bottom-Up BFS . 17 3.4 Direction-Optimizing BFS Algorithm . 18 3.5 Evaluation . 20 3.6 Discussion . 31 3.7 Related Work . 33 3.8 Conclusion . 35 4 GAP Benchmark Suite 37 4.1 Introduction . 37 4.2 Related Work . 38 4.3 Benchmark Specification . 39 4.4 Reference Implementation . 45 4.5 Conclusion . 48 iii 5 Graph Workload Characterization 49 5.1 Introduction . 49 5.2 Methodology . 50 5.3 Memory Bandwidth Potential . 52 5.4 Single-Core Analysis . 57 5.5 Parallel Performance . 62 5.6 NUMA Penalty . 64 5.7 Limited Room for SMT . 65 5.8 Related Work . 68 5.9 Conclusion . 71 6 GAIL: Graph Algorithm Iron Law 73 6.1 Introduction . 73 6.2 Graph Algorithm Iron Law . 75 6.3 Case Studies Using GAIL . 77 6.4 Using GAIL to Guide Development . 85 6.5 Frequently Asked Questions . 86 6.6 Conclusion . 87 7 Propagation Blocking 89 7.1 Introduction . 89 7.2 Background on PageRank . 90 7.3 Locality Challenges for PageRank . 92 7.4 Propagation Blocking . 94 7.5 Communication Model . 96 7.6 Evaluation . 99 7.7 Implementation Considerations . 107 7.8 Related Work . 109 7.9 Conclusion . 110 8 Conclusion 112 8.1 Summary of Contributions . 112 8.2 Future Work . 113 Bibliography 116 A Most Commonly Evaluated Graph Kernels 131 B GBSP: Graph Bulk Synchronous Parallel 134 iv List of Figures 1.1 Performance per core versus core count for June 2015 Graph 500 rankings [66]. Performance measures traversed edges per second (TEPS) executing breadth-first search. 2 3.1 Conventional BFS algorithm . 15 3.2 Single step of top-down approach. Vertex v unvisited if parents[v] = −1. 15 3.3 Classification of neighbors of frontier in terms of absolute totals (left) or normal- ized per depth (right) for a BFS traversal on kron graph (synthetically-generated social network of 128M vertices and 2B undirected edges). A single undirected edge can result in two neighbors of the frontier if both endpoints are in the frontier. 16 3.4 Categorization of the neighbors of the highlighted vertex for an example BFS traversal. The vertices are labelled with their depths from the BFS, and the arrows point from child to parent from the BFS. 16 3.5 Single step of bottom-up approach . 18 3.6 Example search on kron graph (synthetically-generated social network of 128M vertices and 2B undirected edges) on IVB platform from same source vertex as Figure 3.3. The left subplot is the time per depth if done top-down or bottom-up and the right subplot is the size of the frontier per depth in terms of either vertices or edges. The time for the top-down approach closely tracks the number of edges in the frontier, and depths 2 and 3 constitute the majority of the runtime since they constitute the majority of the edges. 19 3.7 Finite-state machine to control the direction-optimizing algorithm. When tran- sitioning between approaches, the frontier must be converted from a queue (top- down) to a bitmap (bottom-up) or vice versa. The growing condition indicates the number of active vertices in the frontier (nf ) increased relative to the last depth. 20 3.8 Speedups on IVB relative to top-down-bitmap. 23 3.9 Performance of hybrid-heuristic on each graph relative to its best performance on that graph for the range of α examined. We select α = 15. 25 3.10 Performance of hybrid-heuristic on each graph relative to its best performance on that graph for the range of β examined. We select β = 18. 25 v 3.11 Comparison of edges examined by top-down and bottom-up on example graph when frontier is at depth 1. Vertices labelled with their final depth. 26 3.12 Types of edge examinations for top-down versus direction-optimizing for a single search on flickr graph. The direction-optimizing approach switches to the bottom- up direction at depth 5 and back to the top-down direction at depth 9. Due to extremely small frontier sizes, depths 0{4 and 9{17 are difficult to distinguish, however, for these depths the direction-optimizing approach traverses in the top- down direction so the edge examinations are equivalent to the top-down baseline. 27 3.13 Categorization of edge examinations by hybrid-heuristic . 28 3.14 Categorization of execution time by hybrid-heuristic . 29 3.15 Speedups on IVB versus reductions in edges examined for hybrid-heuristic relative to top-down-bitmap baseline. 30 5.1 Parallel pointer-chase microbenchmark for k-way MLP . 53 5.2 Memory bandwidth achieved by parallel pointer chase microbenchmark (random) in units of memory requests per second (left axis) or equivalent effective MLP (right axis) versus the number of parallel chases (application MLP). Single core using 1 or 2 threads and differing memory allocation locations (local, remote, and interleave). 54 5.3 Memory bandwidth achieved by parallel pointer chase microbenchmark with vary- ing number of nops inserted (varies IPM). Using a single thread with differing numbers of parallel chases (application MLP). 55 5.4 Impact of 2 MB and 1 GB page sizes on memory bandwidth achieved by single- thread parallel pointer chase for array sizes of small (1 GB) and large (16 GB). 56 5.5 Random bandwidth of a socket or the whole system with different memory allo- cations. 57 5.6 Single-thread performance in terms of instructions per cycle (IPC) of full workload colored by: codebase (top), kernel (middle), and input graph (bottom). 58 5.7 Single-thread performance of full workload relative to branch misprediction rate colored by memory bandwidth utilization. 59 5.8 Single-thread achieved memory bandwidth of full workload relative to instructions per miss (IPM). Note: Some points from road & web not visible due to IPM>1000 but model continues to serve as an upper bound. 60 5.9 Histogram of MPKI (in terms of LLC misses) of full workload specified in Sec- tion 5.2 executing on a single thread. Most executions have great locality (low MPKI), especially those processing the web or road input graphs. 60 5.10 Single-thread achieved memory bandwidth of GAPBS for all kernels and graphs varying the operating system page size.