Abstract Real-Time Analytics on Large Dynamic Graphs
Total Page:16
File Type:pdf, Size:1020Kb
ABSTRACT Title of dissertation: REAL-TIME ANALYTICS ON LARGE DYNAMIC GRAPHS Jayanta Mondal, Doctor of Philosophy, 2015 Dissertation directed by: Professor Amol Deshpande Department of Computer Science In today’s fast-paced and interconnected digital world, the data generated by an increasing number of applications is being modeled as dynamic graphs. The graph struc- ture encodes relationships among data items, while the structural changes to the graphs as well as the continuous stream of information produced by the entities in these graphs make them dynamic in nature. Examples include social networks where users post status updates, images, videos, etc.; phone call networks where nodes may send text messages or place phone calls; road traffic networks where the traffic behavior of the road segments changes constantly, and so on. There is a tremendous value in storing, managing, and analyzing such dynamic graphs and deriving meaningful insights in real-time. However, a majority of the work in graph analytics assumes a static setting, and there is a lack of systematic study of the various dynamic scenarios, the complexity they impose on the analysis tasks, and the challenges in building efficient systems that can support such tasks at a large scale. In this dissertation, I design a unified streaming graph data management framework, and develop prototype systems to support increasingly complex tasks on dynamic graphs. In the first part, I focus on the management and querying of distributed graph data. I de- velop a hybrid replication policy that monitors the read-write frequencies of the nodes to decide dynamically what data to replicate, and whether to do eager or lazy replication in order to minimize network communication and support low-latency querying. In the sec- ond part, I study parallel execution of continuous neighborhood-driven aggregates, where each node aggregates the information generated in its neighborhoods. I build my sys- tem around the notion of an aggregation overlay graph, a pre-compiled data structure that enables sharing of partial aggregates across different queries, and also allows partial pre- computation of the aggregates to minimize the query latencies and increase throughput. Finally, I extend the framework to support continuous detection and analysis of activity- based subgraphs, where subgraphs could be specified using both graph structure as well as activity conditions on the nodes. The query specification tasks in my system are ex- pressed using a set of active structural primitives, which allows the query evaluator to use a set of novel optimization techniques, thereby achieving high throughput. Overall, in this dissertation, I define and investigate a set of novel tasks on dynamic graphs, design scalable optimization techniques, build prototype systems, and show the effectiveness of the proposed techniques through extensive evaluation using large-scale real and synthetic datasets. REAL-TIME ANALYTICS ON LARGE DYNAMIC GRAPHS by Jayanta Mondal Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2015 Advisory Committee: Professor Amol Deshpande, Chair/Advisor Professor Louiqa Raschid, Dean’s Representative Professor V. S. Subrahmanian Professor Hector´ Corrada Bravo Professor Tudor Dumitras, c Copyright by Jayanta Mondal 2015 Acknowledgments I feel greatly humbled in completing this dissertation. I owe my gratitude to all the people who have made this possible through their constant help, guidance, support, and encouragement. Foremost, I want to convey my heartfelt appreciation to my advisor, Dr. Amol Deshpande. It is only due to his patience and support, I have successfully achieved this landmark. He gave me the freedom to explore on my own, and at the same time the guid- ance to stay on the right track. His advice on how to approach problems in a structured fashion and express ideas with clarity, has helped me mature as a researcher. I feel very fortunate to have him as my advisor. Additionally, I would like to acknowledge my committee members V.S. Subrah- manian, Hector Corrada Bravo, Tudor Dumitras, Louiqa Raschid, for their constructive feedback and critiques. Special thanks to Tudor for the wonderful time I had collaborat- ing with him. His guidance has helped me garner a lot of knowledge outside my core area of expertise. I would also like to thank my mentor Dr. Sudipto Das at Microsoft for his guidance during my internships. Working with him has helped me broaden my vision about database research and gain valuable insights about doing research in industry. I would like to take this opportunity to thank the computer science department staff, Jenny Story, Fatima Bangura, Brenda Chick, Sharron McElroy, Jodie Gray, Adelaide Findlay for making my life much simpler by taking care of the dreaded administrative matters. ii Life is nothing without friends. I will cherish the time spent with my grad school friends Rajan, Udayan, Hui, Ashwin, Theodoros, Abdul, Souvik, Amit, Rajesh, Varun, Rajibul, Anirban, Shrin. All the technical and non-technical conversations we had, the meals we shared, the trips to Board and Brew will always be in my memory. This acknowledgement would not be complete without a shout-out to my long-time friends Suman, Sahani, Pravuda, Anushri, Ritwik, Arnab, Kaushani, Moumita, Nagda, Rajarshi, Dibyajyoti, Shaoni. Thank you guys for enriching my life with all the love, memories, conversations, food, and adventures. Special thanks to my girlfriend Jineta for being a great friend, critic, and guide over the years. Above all, I want to thank my parents Sukla and Jaychand and my brother Joydip, for their selfless love and support. Everything I have achieved in my life, I owe every bit to them. iii Table of Contents List of Tables ix List of Figures x 1 Introduction 1 1.1 Data Components of Dynamic Graphs . .6 1.1.1 Network Component . .7 1.1.2 Stream Component . .8 1.1.3 Dependancy between Network and Stream Components . 10 1.2 Task Categories . 11 1.2.1 Large Number of Independent Local Queries . 11 1.2.2 Global Analytics over Content-dynamic Graphs . 12 1.2.3 Event/pattern Detection on Dynamic Graphs . 12 1.2.4 Managing a Large Collection of Small Graphs. 13 1.3 Task/Query Model . 13 1.3.1 Query Issuer . 15 1.3.2 Query Scope . 16 1.3.2.1 Network Traversal Scope . 16 1.3.2.2 Temporal Scope . 18 iv 1.3.3 Answering Model . 20 1.4 System Architecture and Execution Model . 20 1.5 Overview of the Dissertation Research . 26 1.5.1 Distributed Graph Database with Adaptive Replication . 27 1.5.2 Ego-centric Aggregation Framework for Large Dynamic Graphs . 30 1.5.3 A System for Continuous Analytics of Activity-based Subgraphs . 33 1.6 Organization of the Dissertation . 35 2 Related Work 36 2.1 Systems and Techniques for Managing Large Graphs . 36 2.2 Data Stream Management . 41 2.3 Graph Query Languages . 43 2.4 Subgraph Pattern Matching on Dynamic Graphs . 45 3 Distributed Graph Data Management with Adaptive Replication 48 3.1 Overview . 48 3.1.1 Data and Query Model . 49 3.1.2 Architecture . 50 3.1.3 Trade-offs and Requirements . 54 3.2 Replication Manager . 57 3.2.1 Overview . 57 3.2.2 Monitoring Access Patterns . 60 3.2.3 Clustering . 62 3.3 Making Replication Decisions . 63 3.3.1 Problem Definition . 64 3.3.2 Analysis . 65 3.3.3 Proposed Algorithm . 69 3.4 Evaluation . 73 v 3.4.1 Dataset . 74 3.4.2 Experimental Setup . 75 3.4.3 Evaluation Metric . 76 3.4.4 Results . 77 4 Ego-centric Aggregation on Large Dynamic Graphs 85 4.1 Overview . 86 4.1.1 Data and Query Model . 86 4.1.2 Proposed Aggregation Framework . 89 4.1.2.1 Aggregation Overlay Graph . 90 4.1.2.2 Execution Model . 92 4.1.2.3 User-defined Aggregate API . 94 4.2 Constructing The Overlay . 95 4.2.1 Preliminaries . 96 4.2.2 Overlay Construction Algorithms . 98 4.2.3 Handling Dynamic Changes . 107 4.3 Making Dataflow Decisions . 109 4.4 Evaluation . 121 4.4.1 Experimental Setup . 122 4.4.2 Overlay Construction . 124 4.4.3 Dataflow Decisions . 127 4.4.4 Throughput Comparison . 129 5 Activity-based Subgraph Pattern Matching Queries 134 5.1 CASQD Overview . 134 5.1.1 Background . 135 5.1.2 Data Model . 137 5.1.3 Query Model . 140 vi 5.1.3.1 Specifying Active Subgraph Pattern . 140 5.1.3.2 Analytical Task . 143 5.1.4 Answer Reporting Model . 144 5.1.5 Computation Model . 144 5.1.6 Correctness Model . 147 5.1.6.1 Correctness in single-threaded Model . 148 5.1.6.2 Correctness in Multi-threaded Model . 148 5.2 Execution Strategies . 150 5.2.1 Node Activity-based Execution (TAP)............... 150 5.2.2 Neighbors’ Activity-based Execution (TAN)............ 153 5.2.3 Model-based Execution (TMB)................... 155 5.3 Exploration Algorithms . 158 5.3.1 Star Primitive . 158 5.3.2 Clique Primitive . 159 5.3.3 Biclique Primitive . 160 5.4 System Architecture . 162 5.4.1 In-Memory Data Structures . 162 5.4.2 Controllers . 163 5.5 Evaluation . 164 5.5.1 Experimental Setup . 164 5.5.2 Execution algorithms . 167 5.5.2.1 Varying activity threshold . 167 5.5.2.2 Varying pattern size . 171 5.5.3 Other Experiments . 173 6 Conclusion 176 6.1 Insights . 179 vii 6.2 Future Directions . 180 6.2.1 Context-aware Stream Processing . 180 6.2.1.1 Supporting Composition of the Active Primitives . 182 6.2.1.2 Approximate Identification for Active Primitives .