Graph Database Management Systems Oshini Goonetilleke
Total Page:16
File Type:pdf, Size:1020Kb
Graph Database Management Systems Storage, Management and Query Processing A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy Oshini Goonetilleke School of Science College of Science, Engineering, and Health RMIT University December, 2017 Declaration I certify that except where due acknowledgement has been made, the work is that of the author alone; the work has not been submitted previously, in whole or in part, to qualify for any other academic award; the content of the thesis is the result of work which has been carried out since the official commencement date of the approved research program; and, any editorial work, paid or unpaid, carried out by a third party is acknowledged; and, ethics procedures and guidelines have been followed. Oshini Goonetilleke School of Science RMIT University 20th December, 2017 Acknowledgements First, I would like to thank my supervisor, Timos Sellis, for all his advice, support and patience throughout my PhD journey. Over the past four years while working with Timos, I have learned many things about research and many important lessons in life. I feel privileged to have worked under his guidance. I would also like to thank my other supervisor, Jenny Zhang, for her advice and support during my PhD. I had an amazing group of collaborators, each contributing and supporting me in different ways. I would like to thank each of them: Saket Sathe, Danai Koutra, David Meibusch, Ben Barham and Kewen Liao. Saket for his feedback and advice during the early years of my PhD and his humorous insights; and Danai for long conversations about research problems at odd hours in the US, and suggesting interesting directions for my work. Working with all these collaborators was a great learning experience for me. I spent a great summer at Oracle Labs during my internship with them. I enjoyed working with a small team having many insightful discussions and learned different approaches to prob- lem solving. I believe this time was truly a turning point in my PhD. I’m particularly thankful to David Meibusch who has been a great mentor. My PhD experience was made so much easier because of my friends scattered throughout the world; especially Pathi for our long chats about work, life and everything else in between. I would specially like to thank Ammi (mother) and Nangi (sister) who have missed me so much back home during this period but have given me emotional and moral support all the way. Thank you for always believing in me. I would also like to thank my late father who has been the silent inspiration behind everything I do. Finally I thank my husband, Dineth, for his love, patience and encouragement; he has been very much my biggest pillar of strength. Contents Declaration ii Acknowledgement iii Contents iv List of Figures x List of Tables xiii Abstract xv 1 Introduction1 1.1 Graph Data Management Systems ......................... 1 1.2 Motivation and Contributions............................ 2 1.2.1 Modeling large scale applications in a GDBMS.............. 3 1.2.2 Storage and indexing issues in GDBMS .................. 4 1.2.3 Query processing in GDBMS........................ 4 1.3 Publications...................................... 6 1.4 Thesis Overview ................................... 7 2 Background8 2.1 Preliminaries ..................................... 8 2.2 Graph Representation ................................ 11 2.3 Real-world Graphs and Applications........................ 14 2.4 Graph Data Management Systems ......................... 16 2.4.1 Property Graph Data Model ........................ 16 iv CONTENTS v 2.4.2 Options for modeling graph data...................... 18 2.4.2.1 Relational model with SQL queries................ 18 2.4.2.2 RDF model with SPARQL queries................ 19 2.4.2.3 Native graph model with custom query languages . 20 2.4.3 Graph Database Systems .......................... 21 2.4.3.1 Neo4j................................ 21 2.4.3.2 Sparksee .............................. 23 2.4.3.3 Titan................................ 26 2.4.3.4 Graph database system summary ................ 26 2.4.4 Graph processing systems.......................... 27 2.5 Summary ....................................... 28 3 Graph Database Systems for Microblogging Analytics 29 3.1 Introduction...................................... 30 3.2 Data Collection.................................... 32 3.3 Data Management Frameworks........................... 34 3.3.1 Focused Crawlers............................... 34 3.3.2 Pre-processing and Information Extraction ................ 34 3.3.3 Generic Platforms .............................. 35 3.3.4 Application-specific Platforms........................ 36 3.3.5 Support for Visualization Interfaces .................... 37 3.3.6 Discussion: Data Model and Storage Mechanisms............. 37 3.4 Languages for Querying Tweets........................... 38 3.4.1 Generic Languages.............................. 39 3.4.2 Query Languages for Social Networks ................... 40 3.4.3 Information Retrieval - Tweet Search.................... 40 3.4.4 Discussion: Data Model and Storage for the Languages ......... 41 3.5 Requirements of an Integrated Solution ...................... 42 3.5.1 Focused crawler................................ 43 3.5.2 Pre-processor................................. 44 3.5.3 Data Model.................................. 45 3.5.4 Query Language ............................... 46 3.5.5 General Challenges in Data Management ................. 47 3.6 Graph Database Systems for Microblogging Queries................ 48 CONTENTS vi 3.6.1 Database Schema............................... 49 3.6.2 Graph Databases............................... 50 3.7 Data Ingestion and Query Processing........................ 51 3.7.1 Dataset and Pre-processing......................... 51 3.7.2 Data Ingestion ................................ 52 3.7.2.1 Neo4j................................ 53 3.7.2.2 Sparksee .............................. 54 3.7.3 Query Processing............................... 55 3.7.3.1 Basic Queries............................ 56 3.7.3.2 Advanced Queries......................... 57 3.7.3.3 Deriving Other Queries...................... 60 3.8 Discussion....................................... 60 3.8.1 Efficiency of alternate solutions....................... 61 3.8.2 Overhead for aggregate operations ..................... 62 3.8.3 Problems with the cold cache........................ 62 3.8.4 Processing keyword search on graphs.................... 62 3.9 Summary ....................................... 63 4 Evolving Dependency Graphs for Multi-versioned Codebases 64 4.1 Introduction...................................... 65 4.1.1 Chapter Organisation ............................ 67 4.2 Frappé background.................................. 67 4.2.1 Architecture.................................. 67 4.2.2 Graph Model................................. 68 4.2.3 Code Comprehension Queries........................ 70 4.3 Related Work..................................... 72 4.3.1 Evolving Graphs............................... 72 4.3.2 Industry Projects............................... 73 4.3.3 Source code analysis and other program meta-models........... 74 4.3.4 Syntactic and Semantic Differencing.................... 74 4.4 Versioning Dependency graphs ........................... 76 4.4.1 Potential solutions to versioning dependencies............... 76 4.4.1.1 Autonomous storage........................ 76 4.4.1.2 Delta storage............................ 76 CONTENTS vii 4.4.1.3 Use of an existing program meta-model............. 77 4.4.1.4 Proposed unified model...................... 77 4.4.2 Preliminaries of the Unified model ..................... 78 4.4.3 Queries in the Unified Model ........................ 79 4.5 Node and Edge Resolutions............................. 80 4.5.1 Resolution Rules............................... 80 4.5.1.1 Resolutions in a single version .................. 80 4.5.2 Versioned graph construction........................ 83 4.5.3 Model Improvements............................. 86 4.6 Evaluation....................................... 89 4.6.1 Datasets.................................... 89 4.6.2 Resolution Evaluation............................ 90 4.6.3 Discussion................................... 92 4.7 Queries on Versioned Graphs ............................ 93 4.7.1 Time-point queries.............................. 93 4.7.2 Time-interval Queries ............................ 95 4.8 Summary ....................................... 98 5 Edge Labeling Schemes for Graph Data 99 5.1 Introduction...................................... 100 5.1.1 Chapter Organisation ............................ 103 5.2 Related Work..................................... 103 5.2.1 Node arrangement .............................. 103 5.2.2 Graph compression and Space filling curves . 104 5.2.3 Graph partitioning, Community detection and Clustering . 105 5.2.4 Sparksee.................................... 106 5.3 Edge-labeling schemes ................................ 107 5.3.1 Problem formulation............................. 108 5.3.2 Labeling schemes............................... 110 5.3.2.1 Baselines for labeling....................... 111 5.3.2.2 Proposed Method: GrdRandom . 111 5.3.2.3 Proposed Method: FlipInOut . 111 5.4 Experimental Evaluation............................... 116 5.4.1 Experimental Setup ............................. 117 CONTENTS viii 5.4.2 Speedup of Queries.............................. 117 5.4.2.1 Friend-of-Friend (FoF) Queries