Query Analysis on a Distributed Graph Database Student: Bc

ASSIGNMENT OF MASTER’S THESIS Title: Query Analysis on a Distributed Graph Database Student: Bc. Lucie Svitáková Supervisor: Ing. Michal Valenta, Ph.D. Study Programme: Informatics Study Branch: Web and Software Engineering Department: Department of Software Engineering Validity: Until the end of summer semester 2019/20 Instructions 1. Research current practicies of data storage in distributed graph databases. 2. Choose an existing database system for your implementation. 3. Analyse how to extract necessary information for further query analysis and implement that extraction. 4. Create a suitable method to store the data extracted in step 3. 5. Propose a method for a more efficient redistribution of the data. References Will be provided by the supervisor. Ing. Michal Valenta, Ph.D. doc. RNDr. Ing. Marcel Jiřina, Ph.D. Head of Department Dean Prague October 5, 2018 Master’s thesis Query Analysis on a Distributed Graph Database Bc. et Bc. Lucie Svitáková Department of Software Engineering Supervisor: Ing. Michal Valenta, Ph.D. January 8, 2019 Acknowledgements I would like to thank my supervisor Michal Valenta for his valuable suggestions and observations, my sister SárkaSvitákováforˇ spelling corrections, my friends Martin Horejˇsand TomáˇsNováˇcekfor their valuable comments and my family for their continual motivation and support. Declaration I hereby declare that the presented thesis is my own work and that I have cited all sources of information in accordance with the Guideline for adhering to ethical principles when elaborating an academic final thesis. I acknowledge that my thesis is subject to the rights and obligations stip- ulated by the Act No. 121/2000 Coll., the Copyright Act, as amended. In accordance with Article 46(6) of the Act, I hereby grant a nonexclusive authorization (license) to utilize this thesis, including any and all computer pro- grams incorporated therein or attached thereto and all corresponding docu- mentation (hereinafter collectively referred to as the “Work”), to any and all persons that wish to utilize the Work. Such persons are entitled to use the Work in any way (including for-profit purposes) that does not detract from its value. This authorization is not limited in terms of time, location and quan- tity. However, all persons that makes use of the above license shall be obliged to grant a license at least in the same scope as defined above with respect to each and every work that is created (wholly or in part) based on the Work, by modifying the Work, by combining the Work with another work, by including the Work in a collection of works or by adapting the Work (including trans- lation), and at the same time make available the source code of such work at least in a way and scope that are comparable to the way and scope in which the source code of the Work is made available. In Prague on January 8, 2019 . Czech Technical University in Prague Faculty of Information Technology c 2019 Lucie Svitáková.All rights reserved. This thesis is school work as defined by Copyright Act of the Czech Republic. It has been submitted at Czech Technical University in Prague, Faculty of Information Technology. The thesis is protected by the Copyright Act and its usage without author’s permission is prohibited (with exceptions defined by the Copyright Act). Citation of this thesis Svitáková,Lucie. Query Analysis on a Distributed Graph Database. Mas- ter’s thesis. Czech Technical University in Prague, Faculty of Information Technology, 2019. Abstrakt Pˇrestoˇzednes existuje nˇekolik produkt˚ugrafových databáz´ıurˇcených pro dis- tribuovanáprostˇred´ı, jsou v nich obvykle data distribuovánana jednotlivé fyzickéuzly náhodnˇe,bez pozdˇejˇs´ırevize zat´ıˇzen´ıa moˇznéreorganizace dat. Tato práceanalyzuje souˇcasnépraktiky ukládán´ıdat v distribuovaných grafo- vých databáz´ıch. Podle tétoanalýzyprácenavrhuje a implementuje nový modul pro grafovývýpoˇcetn´ıframework TinkerPop, kterýloguje provoz vy- generovanýdotazy uˇzivatel˚u. Je rovnˇeˇznaimplementovánasamostatnáap- likace pro ukládán´ıtakovýchto logovaných dat do databázeJanusGraph. Tento program rovnˇeˇzspouˇst´ıredistribuˇcn´ıalgoritmus navrhuj´ıc´ıefektivnˇejˇs´ıuloˇzen´ı dat v clusteru. Existuj´ıc´ı algoritmus od Vaquera a spol., kompatibiln´ı se systémemPregel, je aplikováns podstatnýmirozˇs´ıˇren´ımi.Výsledkem je návrh reorganizace dat s 70–80% zlepˇsen´ımkomunikace mezi uzly clusteru. Takovýto výsledekje porovnatelnýs jinou známoumetodou Ja-be-Ja, kterávˇsakvyˇza- duje výraznˇevyˇsˇs´ıvýpoˇcetn´ıprostˇredky. Na druhou stranu metoda v této prácizavád´ımalou disbalanci na uzlech clusteru. Nakonec tato práceuvád´ı doporuˇcen´ıpro moˇznábudouc´ırozˇs´ıˇren´ıa vylepˇsen´ı. Kl´ıˇcová slova grafovédatabáze,rozdˇelen´ıgrafu, redistribuce, NoSQL, Ja- nusGraph, TinkerPop, Pregel vii Abstract Although products of graph databases intended for distributed environments already exist, the data are usually distributed randomly on particular physical hosts without any subsequent load examination and possible data reorganiza- tion. This thesis analyses the current practices of data storage in distributed graph databases. According to this analysis, it designs and implements a new module of the general graph computing framework TinkerPop for logging the traffic generated with user queries. A separate implementation for storage of such logged data in the JanusGraph database is provided. It also executes a redistribution algorithm proposing a more efficient distribution of data. An existing Pregel-compliant algorithm of Vaquero et al. with substantial enhancements is applied. Results with 70–80% improvement of communication among physical hosts of the cluster are obtained, which is comparable to another well-known method Ja-be-Ja with much higher computational demands. On the other hand, the method in this thesis imposes a necessary slight imbalance of the cluster nodes. Finally, this thesis introduces suggestions for future extensions and enhancements. Keywords graph databases, graph partitioning, redistribution, NoSQL, Ja- nusGraph, TinkerPop, Pregel viii Contents Introduction 1 1 Theory of Graph Databases 3 1.1 Databases . 3 1.2 Graph Databases . 7 2 Graph Partitioning 13 2.1 Algorithms . 14 2.2 Pregel-Compliant Algorithms . 19 3 Technology 29 3.1 Distributed Graph Databases . 29 3.2 Apache TinkerPop . 37 4 Analysis and Design 43 4.1 Requirements Analysis . 44 4.2 Extracted Data Storage . 45 4.3 Architecture . 49 4.4 Graph Redistribution Manager . 50 4.5 Algorithm Design . 53 5 Implementation 57 5.1 Graph Database Selection . 57 5.2 Query Data Extraction . 58 5.3 Graph Redistribution Manager . 65 6 Evaluation 71 6.1 Datasets . 71 6.2 Evaluation Methods . 77 6.3 Settings . 78 ix 6.4 Results . 79 6.5 Summary . 86 7 Enhancements 89 7.1 Query Data Extraction . 89 7.2 Graph Redistribution Manager . 89 7.3 Algorithm . 90 7.4 Experiments . 91 Conclusion 93 Bibliography 95 A README 105 A.1 Processed Result Logging . 105 A.2 Graph Redistribution Manager . 111 B Evaluation Results 117 B.1 Twitter Dataset . 117 B.2 Road Network Dataset . 126 C Contents of CD 135 x List of Figures 1.1 DBMS popularity changes since 2013 . 7 1.2 Example of a property graph . 9 1.3 Example of an RDF graph . 9 2.1 The multilevel approach to graph partitioning . 17 2.2 Maximum Value Example in Pregel Model. 20 2.3 Label propagation without and with oscillation prevented. 22 2.4 Examples of two potential color exchanges . 26 3.1 Local Server Mode of JanusGraph with storage backend . 31 3.2 Remote Server Mode of JanusGraph with storage backend . 31 3.3 Remote Server Mode with Gremlin Server of JanusGraph with storage backend . 31 3.4 Embedded Mode of JanusGraph with storage backend . 32 3.5 JanusGraph data layout . 33 3.6 OrientDB APIs . 34 3.7 Apache TinkerPop framework . 38 3.8 OLAP processing in Apache TinkerPop . 41 4.1 Architecture design . 51 4.2 Activity diagram of the main process in the Graph Redistribution Manager . 52 5.1 Query Data Extraction in different connection modes. 59 5.2 Structure of the Processed Result Logging module . 61 5.3 Structure of the Graph Redistribution Manager project . 66 6.1 Degree distribution of the Twitter dataset . 74 6.2 Degree distribution of the Road network dataset . 76 6.3 Development of cross-node communication improvement with var- ious cooling factors . 80 xi 6.4 Effect of cooling factor on cross-node communication and number of iterations. 81 6.5 Effect of imbalance factor on cross-node communication, number of iterations and capacity usage. 82 6.6 Effect of adoption factor on cross-node communication and number of iterations. 84 6.7 Effect of adoption factor on cross-node communication and number of iterations. 85 B.1 Development of cross-node communication improvement with var- ious cooling factors . 117 B.2 Twitter dataset: effects of cooling factor on improvement, number of iterations and capacity usage . 118 B.3 Twitter dataset: effects of imbalance factor on improvement, number of iterations and capacity usage . 120 B.4 Twitter dataset: effects of adoption factor on improvement, number of iterations and capacity usage . 122 B.5 Twitter dataset: effects of number of nodes in a cluster on improvement, number of iterations and capacity usage . 124 B.6 Road network dataset: effects of cooling factor on improvement, number of iterations and capacity usage . 126 B.7 Road network dataset: effects of imbalance factor on improvement, number of iterations and capacity usage . 128 B.8 Road network dataset: effects of adoption factor on improvement, number of iterations and capacity usage . 130 B.9 Road network dataset: effects of number of nodes in a cluster on improvement, number of iterations and capacity usage .

Query Analysis on a Distributed Graph Database Student: Bc

Powerdesigner 16.6 Data Modeling

Demonstrate an Understanding of Computer Database Management Systems 114049

An Evaluation of the Design Space for Scalable Data Loading Into Graph Databases

OLAP, Relational, and Multidimensional Database Systems

L21: Data Models & Relational Algebra

On Active Deductive Databases

Data-Centric Graphical User Interface of the ATLAS Event Index Service

Multidimensional Network Analysis

What Is Database? Types and Examples

Release Notes Date Published: 2021-03-25 Date Modified

Oracle Rdb™ SQL Reference Manual Volume 4

Curriculum Vitae Michael Kay Phd FBCS