ABSTRACT HISTORICAL GRAPH DATA MANAGEMENT Udayan

ABSTRACT Title of dissertation: HISTORICAL GRAPH DATA MANAGEMENT Udayan Khurana, Doctor of Philosophy, 2015 Dissertation directed by: Professor Amol Deshpande Department of Computer Science Over the last decade, we have witnessed an increasing interest in temporal analysis of information networks such as social networks or citation networks. Finding temporal interaction patterns, visualizing the evolution of graph properties, or even simply com- paring them across time, has proven to add significant value in reasoning over networks. However, because of the lack of underlying data management support, much of the work on large-scale graph analytics to date has largely focused on the study of static properties of graph snapshots. Unfortunately, a static view of interactions between entities is often an oversimplification of several complex phenomena like the spread of epidemics, information diffusion, formation of online communities, and so on. In the absence of appropriate support, an analyst today has to manually navigate the added temporal complexity of large evolving graphs, making the process cumbersome and ineffective. In this dissertation, I address the key challenges in storing, retrieving, and analyzing large historical graphs. In the first part, I present DeltaGraph, a novel, extensible, highly tunable, and distributed hierarchical index structure that enables compact recording of the historical information, and that supports efficient retrieval of historical graph snapshots. I present analytical models for estimating required storage space and snapshot retrieval times which aid in choosing the right parameters for a specific scenario. I also present optimizations such as partial materialization and columnar storage to speed up snapshot retrieval. In the second part, I present Temporal Graph Index that builds upon DeltaGraph to support version-centric retrieval such as a node’s 1-hop neighborhood history, along with snapshot reconstruction. It provides high scalability, employing careful partitioning, distribution, and replication strategies that effectively deal with temporal and topological skew, typical of temporal graph datasets. In the last part of the dissertation, I present Temporal Graph Analysis Framework that enables analysts to effectively express a variety of complex historical graph analysis tasks using a set of novel temporal graph operators and to execute them in an efficient and scalable manner on a cloud. My proposed solutions are engineered in the form of a framework called the Historical Graph Store, designed to facilitate a wide variety of large-scale historical graph analysis tasks. Historical Graph Data Management by Udayan Khurana Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2015 Advisory Committee: Professor Amol Deshpande Chair/Advisor Professor Jimmy Lin Dean’s Representative Professor Tudor Dumitras¸ Committee Member Dr. Catherine Plaisant Committee Member Professor Mihai Pop Committee Member c Copyright by Udayan Khurana 2015 Dedication To my beloved family and friends. To all the data that my code crunched. To all the moments that didn’t count in the end. To the hope of a happier planet. ii Acknowledgments My ability to write this dissertation has been in part due to the help, guidance and support from several people. I am grateful to all of them for being a part of my life. Foremost, I want to sincerely thank my advisor Amol Deshpande. His guidance over the course of graduate school greatly helped me in all aspects of my doctoral research. Working with him continuously raised my expectations which was instrumental in the growth of my skills as a researcher. Also, his patience on several occasions helped me work my way through tough times. I also acknowledge my defense committee members Jimmy Lin, Catherine Plaisant, Mihai Pop and Tudor Dumitras for their constructive suggestions and critiques. My dis- cussions with professors Ben Shneiderman and Nick Roussopoulos over the years were also fruitful in several ways. I am thankful to the department staff, Jenny Story and Fatima Bangura in particular, for their help in all things bureaucratic. My internships at IBM helped me gain a different perspective on research and for that I would like to thank my mentors Srinivasan Parthasarathy and Deepak Turaga. I cherish the time spent with my database lab-mates Ashwin, Theodoros, Jayanta, Waala, Abdul, Hui, Souvik and Amit. Jokes, caricatures and conversations we shared, apart from all the help in form of critiques, motivation and suggestions will be thoroughly missed. During the long time I spent in Maryland and Washington DC, I met many people who touched my life and made this place feel like home. In particular, I feel fortunate to have met Alice, Hong, Abhishek, Kostas, Sheetal, Sushant, Nakul, Pang, Parul, Kleoniki, Nof and Vikas. I don’t want to miss a shout out to those who were miles or oceans away, iii but always cared – Sumeet, Mubin, Medha, Joseph, Sandy and Anand. I want to thank my brother Sukant for always inspiring me with his passion for science. Above all, I want to thank my parents Sarita and Vinod, for being there for me, unconditionally and endlessly. I hope I succeed in imbibing a tiny fraction of the academic quality and personal integrity I see in both of them. iv Table of Contents List of Figures 1 List of Tables 6 1 Introduction 7 1.1 Examples of Temporal Networks and Analysis . 10 1.2 Challenges . 16 1.2.1 Storing and fetching temporal graphs . 18 1.2.2 Scalable Analytics . 22 1.3 Contributions . 24 1.3.1 DeltaGraph: Reconstructing Past States of a Graph . 24 1.3.2 TGI: Generalized Temporal Graph Indexing at Scale . 26 1.3.3 Temporal Graph Analytics . 27 1.4 Graph Data Model . 27 1.5 Organization of the dissertation . 29 2 Background 31 2.1 Temporal Graph Analysis . 31 2.1.1 Dynamic Social Network Analysis . 31 2.1.2 Temporal Network Visualization . 32 2.2 Temporal Data Management . 33 2.2.1 Temporal Queries and Storage . 34 2.2.2 Version Management . 35 2.3 Graph Data Management . 36 2.3.1 Graph Storage . 36 2.3.2 Graph Query and Processing Frameworks . 38 2.3.3 Temporal Graph Data Stores . 38 3 Historical Graph Snapshot Retrieval 40 3.1 Introduction . 40 3.2 Background . 42 3.3 DeltaGraph . 45 v 3.3.1 Anatomy of DeltaGraph . 47 3.3.2 Snapshot Queries . 49 3.3.3 Accessibility . 50 3.4 Snapshot Retrieval . 52 3.4.1 Single-point Snapshot Retrieval . 52 3.4.2 Multi-point Snapshot Retrieval . 53 3.4.3 Composite Time-expression Graph Retrieval . 55 3.4.4 Speeding up Snapshot Queries using Prefetching . 57 3.4.5 Extensibility . 60 3.5 DeltaGraph Analysis . 64 3.5.1 Model of Graph Dynamics . 64 3.5.2 Differential Functions . 65 3.5.3 Space and Time Estimation Models . 66 3.5.4 Discussion . 69 3.6 Construction and Maintenance . 70 3.6.1 Batch Construction . 70 3.6.2 Configuring DeltaGraph . 72 3.6.3 Updates . 74 3.6.3.1 Preliminaries . 74 3.6.3.2 Update Procedure . 76 3.7 Experiments . 79 3.8 Conclusion . 90 4 Generalized Historical Graph Storage in the Cloud 92 4.1 Introduction . 92 4.2 Temporal Graph Index . 95 4.2.1 Preliminaries . 96 4.2.2 Prior Techniques . 98 4.2.3 TGI Description . 100 4.2.4 Scalable TGI design . 102 4.2.5 Partitioning and Replication . 106 4.2.6 Dynamic Graph Partitioning . 109 4.2.7 Fetching Graph Primitives . 113 4.3 Evaluation . 114 4.4 Conclusion . 122 5 Temporal Graph Analytics 125 5.1 Introduction . 125 5.1.1 Contributions . 126 5.2 Background . 127 5.3 Memory Efficient Loading of Multiple Versions . 129 5.3.1 GraphPool . 130 5.3.2 Clean-up of a graph from memory . 132 5.3.3 Visual Analytics on Historical Graph Snapshots . 132 5.3.3.1 Features . 133 vi 5.3.4 Experiments . 138 5.4 Temporal Analytics Framework (TAF) . 140 5.4.1 Temporal Graph Analysis Library . 140 5.4.2 System Implementation . 144 5.4.3 Experiments . 153 5.5 Summary . 154 6 Conclusion 156 6.1 Insights . 158 6.2 Future Directions . 160 6.2.1 Unified Temporal Graph Processing Framework . 160 6.2.2 Discovering Graphs of Interest . 161 6.2.3 A Framework for Finding Temporal Patterns in Evolving Graphs . 163 6.2.4 Shared Computation Across Graph Versions . 164 6.2.5 Estimating Changes in Graph Properties . 164 Bibliography 165 vii List of Figures 1.1 Dynamic network analysis (e.g., understanding how “communities” evolve in a social network, how centrality scores change in co-authorship networks, etc.) can lend important insights into social, cultural, and natural phenomena. 12 1.2 The plot was constructed using our system over the DBLP network, and shows the evolution of the nodes ranked in top 25 in 2004. 13 1.3 Performing a snapshot-based evolutionary analysis with and without the support for efficient snapshot retrieval. 17 1.4 A temporal graph can be represented across two different dimensions - time and entity. The chart lists retrieval tasks, graph operations, example queries at different granularities of time and entity size. 20 1.5 Historical Graph Store. 25 3.1 Temporal relational database indexing techniques. 43 1 3.2 Interval tree is a hierarchical data structure that indexes intervals and can be potentially used to perform snapshot retrieval queries on temporal datasets. 44 3.3 DeltaGraphs with 4 leaves, leaf-eventlist size L, arity 2. ∆(Si;Sj) de- notes delta needed to construct Si from Sj.................. 48 3.4 The (Java) code snippet shows an example program that retrieves several graphs, and operates upon them.

ABSTRACT HISTORICAL GRAPH DATA MANAGEMENT Udayan

BIOINFORMATICS Doi:10.1093/Bioinformatics/Bti144

BIOGRAPHICAL SKETCH NAME: Berger

John Anthony Capra

Graduation 2019

Emerging Topics in Biological Networks and Systems Biology Symposium at the Swedish Collegium for Advanced Study (Scas), Uppsala 9-11 October, 2017

Detection of Cooperatively Bound Transcription Factor Pairs Using Chip-Seq Peak Intensities and Expectation Maximization

PSB 2009 Attendees (As of January 9, 2009) Mario Albrecht Max Planck

Predicting Specificity in Bzip Coiled-Coil Protein Interactions Comment Jessica H Fong, Amy E Keating† and Mona Singh

Real-Time Analytics for Complex Structure Data

The USTC Complex Co-Opts an Ancient Machinery to Drive Pirna Transcription In

Accurate Estimation of Gene Expression Levels from DGE Sequencing Data

Information Extraction Meets the Semantic Web: a Survey