On Software Infrastructure for Scalable Graph Analytics

UNIVERSITY OF CALIFORNIA, IRVINE On Software Infrastructure for Scalable Graph Analytics DISSERTATION submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in Computer Science by Yingyi Bu Dissertation Committee: Professor Michael J. Carey, Chair Professor Michael T. Goodrich Professor Tyson Condie 2015 Portions of Chapter 2 c 2013 ACM doi 10.1145/2464157.2466485 Portions of Chapter 3 c 2015 VLDB Endowment doi 10.14778/2735471.2735477 All other materials c 2015 Yingyi Bu TABLE OF CONTENTS Page LIST OF FIGURES v LIST OF TABLES vii ACKNOWLEDGMENTS viii CURRICULUM VITAE x ABSTRACT OF THE DISSERTATION xiii 1 Introduction 1 2 A Bloat-Aware Design for Big Data Applications 5 2.1 Overview . .5 2.2 Memory Analysis of Big Data Applications . 11 2.2.1 Low Packing Factor . 11 2.2.2 Large Volumes of Objects and References . 16 2.3 The Bloat-Aware Design Paradigm . 18 2.3.1 Data Storage Design: Merging Small Objects . 19 2.3.2 Data Processor Design: Access Buffers . 22 2.4 Programming Experience . 27 2.4.1 Case Study 1: In-Memory Sort . 27 2.4.2 Case Study 2: Print Records . 29 2.4.3 Big Data Projects using the Bloat-Aware Design . 31 2.4.4 Future Work . 33 2.5 Performance Evaluation . 33 2.5.1 Effectiveness On Overall Scalability: Using PageRank . 34 2.5.2 Effectiveness On Packing Factor: Using External Sort . 36 2.5.3 Effectiveness On GC Costs: Using Hash-Based Grouping . 39 2.6 Related Work . 41 2.7 Summary . 43 3 Pregelix: Big(ger) Graph Analytics on A Dataflow Engine 45 3.1 Overview . 45 3.2 Background and Problems . 48 ii 3.2.1 Pregel Semantics and Runtime . 48 3.2.2 Apache Giraph . 49 3.2.3 Issues and Opportunities . 50 3.3 The Pregel Logical Plan . 52 3.4 The Runtime Platform . 57 3.5 The Pregelix System . 59 3.5.1 Parallelism . 59 3.5.2 Data Storage . 60 3.5.3 Physical Query Plans . 61 3.5.4 Memory Management . 66 3.5.5 Fault-Tolerance . 66 3.5.6 Job Pipelining . 67 3.5.7 Pregelix Software Components . 67 3.5.8 Discussion . 69 3.6 Pregelix Case Studies . 69 3.7 Experiments . 71 3.7.1 Experimental Setup . 72 3.7.2 Execution Time . 73 3.7.3 System Scalability . 75 3.7.4 Throughput . 78 3.7.5 Plan Flexibility . 79 3.7.6 Software Simplicity . 81 3.7.7 Summary . 81 3.8 Related Work . 83 3.9 Summary . 85 4 Rich(er) Graph Analytics in AsterixDB 87 4.1 Overview . 87 4.2 Temporary Datasets In AsterixDB . 92 4.2.1 Storage Management in AsterixDB . 92 4.2.2 Motivation . 95 4.2.3 Temporary Dataset DDL . 96 4.2.4 Runtime Implementation . 97 4.2.5 Temporary Dataset Lifecyle . 98 4.3 The External Connector Framework . 100 4.3.1 Motivation . 100 4.3.2 The Connector Framework . 101 4.3.3 The AsterixDB Connector . 103 4.4 Running Pregel from AsterixDB . 106 4.5 Experiments . 111 4.5.1 Experimental Setup . 111 4.5.2 Evaluations of Temporary Datasets . 112 4.5.3 Evaluation of GQ1 (Long-Running Analytics) . 113 4.5.4 Evaluation of GQ2 (Interactive Analytics) . 115 4.5.5 Evaluation of GQ3 (Complex Graph ETL) . 116 iii 4.5.6 Discussions . 116 4.6 Related Work . 118 4.7 Summary . 119 5 Conclusions and Future Work 120 5.1 Conclusion . 120 5.2 Future Work . 121 Bibliography 123 iv LIST OF FIGURES Page 2.1 Giraph object subgraph rooted at a vertex. 14 2.2 The compact layout of a vertex. 15 2.3 Vertices aligning in a page (slots at the end of the page are to support variable- sized vertices). 20 2.4 A heap snapshot of the example. 26 2.5 PageRank Performance on Pregelix. 34 2.6 ExternalSort Performance. 37 2.7 Hash-based Grouping Performance. 38 3.1 Giraph process-centric runtime. 50 3.2 Implementing message-passing as a logical join. 53 3.3 The basic logical query plan of a Pregel superstep i which reads the data gen- erated from the last superstep (e.g., Vertexi, Msgi, and GSi) and produces the data (e.g., Vertexi+1, Msgi+1, and GSi+1) for superstep i + 1. Global aggrega- tion and synchronization are in Figure 3.4, and vertex addition and removal are in Figure 3.5. 55 3.4 The plan segment that revises the global state. 56 3.5 The plan segment for vertex addition/removal. 56 3.6 The parallelized join for the logical join in Figure 3.2. 60 3.7 The four physical group-by strategies for the group-by operator which com- bines messages in Figure 3.3. 62 3.8 Two physical join strategies for forming the input to the compute UDF. On the left is an index full outer join approach. On the right is an index left outer join approach. 63 3.9 The implementation of the single source shortest paths algorithm on Pregelix. 70 3.10 Overall execution time (32-machine cluster). Neither Giraph-mem nor Giraph- ooc can work properly when the ratio of dataset size to the aggregated RAM size exceeds 0.15; GraphLab starts failing when the ratio of dataset size to the aggregated RAM size exceeds 0.07; Hama fails on even smaller datasets; GraphX fails to load the smallest BTC dataset sample BTC-Tiny. 75 3.11 Average iteration execution time (32-machine cluster). 76 3.12 Scalability (run on 8-machine, 16-machine, 24-machine, 32-machine clusters). 77 3.13 Throughput (multiple PageRank jobs are executed in the 32-machine cluster with different sized datasets). 80 v 3.14 Index full outer join vs. index left outer join (run on an 8 machine cluster) for Pregelix. 82 3.15 Pregelix left outer join plan vs. other systems (SSSP on BTC datasets). GraphX fails to load the smallest dataset. 83 4.1 The current Big Graph analytics flow. 88 4.2 The query plan of a scan-insert query. 94 4.3 The query plan of a secondary index search query. 95 4.4 The query plan of a scan-insert query for temporary datasets. 98 4.5 The query plan of a secondary index look on a temporary dataset. 99 4.6 The connector runtime dataflow. 103 4.7 The parallel speedup and scale-up of GQ1 (insert). 114 4.8 The parallel speedup and scale-up of GQ1 (upsert). 114 4.9 The parallel speedup and scale-up of GQ2 (insert). 115 4.10 The parallel speedup and scale-up of GQ2 (upsert). 116 4.11 The parallel speedup and scale-up of GQ3 (insert). 117 4.12 The parallel speedup and scale-up of GQ3 (upsert). 117 vi LIST OF TABLES Page 2.1 Numbers of objects per vertex and their space overhead (in bytes) in PageRank in the Sun 64-bit Hopspot JVM. 14 2.2 The line-of-code statistics of Big Data projects which uses the bloat-aware design. 32 3.1 Nested relational schema that models the Pregel state. 52 3.2 UDFs used to capture a Pregel program. 54 3.3 The Webmap dataset (Large) and its samples. ..

On Software Infrastructure for Scalable Graph Analytics

An Enterprise Knowledge Network

Return of Organization Exempt from Income

Graft: a Debugging Tool for Apache Giraph

Apache Giraph 3

Chapter 2 Introduction to Big Data Technology

Study Materials for Big Data Processing Tools

Harp: Collective Communication on Hadoop

Technology Overview

Improving the Performance of Graph Trans- Formation Execution in the BSP Model

Code Smell Prediction Employing Machine Learning Meets Emerging Java Language Constructs"

Hadoop Programming Options

Outline of Machine Learning