Query Analysis on a Distributed Graph Database Student: Bc
Total Page:16
File Type:pdf, Size:1020Kb
ASSIGNMENT OF MASTER’S THESIS Title: Query Analysis on a Distributed Graph Database Student: Bc. Lucie Svitáková Supervisor: Ing. Michal Valenta, Ph.D. Study Programme: Informatics Study Branch: Web and Software Engineering Department: Department of Software Engineering Validity: Until the end of summer semester 2019/20 Instructions 1. Research current practicies of data storage in distributed graph databases. 2. Choose an existing database system for your implementation. 3. Analyse how to extract necessary information for further query analysis and implement that extraction. 4. Create a suitable method to store the data extracted in step 3. 5. Propose a method for a more efficient redistribution of the data. References Will be provided by the supervisor. Ing. Michal Valenta, Ph.D. doc. RNDr. Ing. Marcel Jiřina, Ph.D. Head of Department Dean Prague October 5, 2018 Master’s thesis Query Analysis on a Distributed Graph Database Bc. et Bc. Lucie Svit´akov´a Department of Software Engineering Supervisor: Ing. Michal Valenta, Ph.D. January 8, 2019 Acknowledgements I would like to thank my supervisor Michal Valenta for his valuable suggestions and observations, my sister S´arkaSvit´akov´aforˇ spelling corrections, my friends Martin Horejˇsand Tom´aˇsNov´aˇcekfor their valuable comments and my family for their continual motivation and support. Declaration I hereby declare that the presented thesis is my own work and that I have cited all sources of information in accordance with the Guideline for adhering to ethical principles when elaborating an academic final thesis. I acknowledge that my thesis is subject to the rights and obligations stip- ulated by the Act No. 121/2000 Coll., the Copyright Act, as amended. In accordance with Article 46(6) of the Act, I hereby grant a nonexclusive au- thorization (license) to utilize this thesis, including any and all computer pro- grams incorporated therein or attached thereto and all corresponding docu- mentation (hereinafter collectively referred to as the “Work”), to any and all persons that wish to utilize the Work. Such persons are entitled to use the Work in any way (including for-profit purposes) that does not detract from its value. This authorization is not limited in terms of time, location and quan- tity. However, all persons that makes use of the above license shall be obliged to grant a license at least in the same scope as defined above with respect to each and every work that is created (wholly or in part) based on the Work, by modifying the Work, by combining the Work with another work, by including the Work in a collection of works or by adapting the Work (including trans- lation), and at the same time make available the source code of such work at least in a way and scope that are comparable to the way and scope in which the source code of the Work is made available. In Prague on January 8, 2019 . Czech Technical University in Prague Faculty of Information Technology c 2019 Lucie Svit´akov´a.All rights reserved. This thesis is school work as defined by Copyright Act of the Czech Republic. It has been submitted at Czech Technical University in Prague, Faculty of Information Technology. The thesis is protected by the Copyright Act and its usage without author’s permission is prohibited (with exceptions defined by the Copyright Act). Citation of this thesis Svit´akov´a,Lucie. Query Analysis on a Distributed Graph Database. Mas- ter’s thesis. Czech Technical University in Prague, Faculty of Information Technology, 2019. Abstrakt Pˇrestoˇzednes existuje nˇekolik produkt˚ugrafov´ych datab´az´ıurˇcen´ych pro dis- tribuovan´aprostˇred´ı, jsou v nich obvykle data distribuov´anana jednotliv´e fyzick´euzly n´ahodnˇe,bez pozdˇejˇs´ırevize zat´ıˇzen´ıa moˇzn´ereorganizace dat. Tato pr´aceanalyzuje souˇcasn´epraktiky ukl´ad´an´ıdat v distribuovan´ych grafo- v´ych datab´az´ıch. Podle t´etoanal´yzypr´acenavrhuje a implementuje nov´y modul pro grafov´yv´ypoˇcetn´ıframework TinkerPop, kter´yloguje provoz vy- generovan´ydotazy uˇzivatel˚u. Je rovnˇeˇznaimplementov´anasamostatn´aap- likace pro ukl´ad´an´ıtakov´ychto logovan´ych dat do datab´azeJanusGraph. Tento program rovnˇeˇzspouˇst´ıredistribuˇcn´ıalgoritmus navrhuj´ıc´ıefektivnˇejˇs´ıuloˇzen´ı dat v clusteru. Existuj´ıc´ı algoritmus od Vaquera a spol., kompatibiln´ı se syst´ememPregel, je aplikov´ans podstatn´ymirozˇs´ıˇren´ımi.V´ysledkem je n´avrh reorganizace dat s 70–80% zlepˇsen´ımkomunikace mezi uzly clusteru. Takov´yto v´ysledekje porovnateln´ys jinou zn´amoumetodou Ja-be-Ja, kter´avˇsakvyˇza- duje v´yraznˇevyˇsˇs´ıv´ypoˇcetn´ıprostˇredky. Na druhou stranu metoda v t´eto pr´acizav´ad´ımalou disbalanci na uzlech clusteru. Nakonec tato pr´aceuv´ad´ı doporuˇcen´ıpro moˇzn´abudouc´ırozˇs´ıˇren´ıa vylepˇsen´ı. Kl´ıˇcov´a slova grafov´edatab´aze,rozdˇelen´ıgrafu, redistribuce, NoSQL, Ja- nusGraph, TinkerPop, Pregel vii Abstract Although products of graph databases intended for distributed environments already exist, the data are usually distributed randomly on particular physical hosts without any subsequent load examination and possible data reorganiza- tion. This thesis analyses the current practices of data storage in distributed graph databases. According to this analysis, it designs and implements a new module of the general graph computing framework TinkerPop for logging the traffic generated with user queries. A separate implementation for storage of such logged data in the JanusGraph database is provided. It also executes a redistribution algorithm proposing a more efficient distribution of data. An existing Pregel-compliant algorithm of Vaquero et al. with substantial en- hancements is applied. Results with 70–80% improvement of communication among physical hosts of the cluster are obtained, which is comparable to an- other well-known method Ja-be-Ja with much higher computational demands. On the other hand, the method in this thesis imposes a necessary slight im- balance of the cluster nodes. Finally, this thesis introduces suggestions for future extensions and enhancements. Keywords graph databases, graph partitioning, redistribution, NoSQL, Ja- nusGraph, TinkerPop, Pregel viii Contents Introduction 1 1 Theory of Graph Databases 3 1.1 Databases . 3 1.2 Graph Databases . 7 2 Graph Partitioning 13 2.1 Algorithms . 14 2.2 Pregel-Compliant Algorithms . 19 3 Technology 29 3.1 Distributed Graph Databases . 29 3.2 Apache TinkerPop . 37 4 Analysis and Design 43 4.1 Requirements Analysis . 44 4.2 Extracted Data Storage . 45 4.3 Architecture . 49 4.4 Graph Redistribution Manager . 50 4.5 Algorithm Design . 53 5 Implementation 57 5.1 Graph Database Selection . 57 5.2 Query Data Extraction . 58 5.3 Graph Redistribution Manager . 65 6 Evaluation 71 6.1 Datasets . 71 6.2 Evaluation Methods . 77 6.3 Settings . 78 ix 6.4 Results . 79 6.5 Summary . 86 7 Enhancements 89 7.1 Query Data Extraction . 89 7.2 Graph Redistribution Manager . 89 7.3 Algorithm . 90 7.4 Experiments . 91 Conclusion 93 Bibliography 95 A README 105 A.1 Processed Result Logging . 105 A.2 Graph Redistribution Manager . 111 B Evaluation Results 117 B.1 Twitter Dataset . 117 B.2 Road Network Dataset . 126 C Contents of CD 135 x List of Figures 1.1 DBMS popularity changes since 2013 . 7 1.2 Example of a property graph . 9 1.3 Example of an RDF graph . 9 2.1 The multilevel approach to graph partitioning . 17 2.2 Maximum Value Example in Pregel Model. 20 2.3 Label propagation without and with oscillation prevented. 22 2.4 Examples of two potential color exchanges . 26 3.1 Local Server Mode of JanusGraph with storage backend . 31 3.2 Remote Server Mode of JanusGraph with storage backend . 31 3.3 Remote Server Mode with Gremlin Server of JanusGraph with stor- age backend . 31 3.4 Embedded Mode of JanusGraph with storage backend . 32 3.5 JanusGraph data layout . 33 3.6 OrientDB APIs . 34 3.7 Apache TinkerPop framework . 38 3.8 OLAP processing in Apache TinkerPop . 41 4.1 Architecture design . 51 4.2 Activity diagram of the main process in the Graph Redistribution Manager . 52 5.1 Query Data Extraction in different connection modes. 59 5.2 Structure of the Processed Result Logging module . 61 5.3 Structure of the Graph Redistribution Manager project . 66 6.1 Degree distribution of the Twitter dataset . 74 6.2 Degree distribution of the Road network dataset . 76 6.3 Development of cross-node communication improvement with var- ious cooling factors . 80 xi 6.4 Effect of cooling factor on cross-node communication and number of iterations. 81 6.5 Effect of imbalance factor on cross-node communication, number of iterations and capacity usage. 82 6.6 Effect of adoption factor on cross-node communication and number of iterations. 84 6.7 Effect of adoption factor on cross-node communication and number of iterations. 85 B.1 Development of cross-node communication improvement with var- ious cooling factors . 117 B.2 Twitter dataset: effects of cooling factor on improvement, number of iterations and capacity usage . 118 B.3 Twitter dataset: effects of imbalance factor on improvement, num- ber of iterations and capacity usage . 120 B.4 Twitter dataset: effects of adoption factor on improvement, num- ber of iterations and capacity usage . 122 B.5 Twitter dataset: effects of number of nodes in a cluster on improve- ment, number of iterations and capacity usage . 124 B.6 Road network dataset: effects of cooling factor on improvement, number of iterations and capacity usage . 126 B.7 Road network dataset: effects of imbalance factor on improvement, number of iterations and capacity usage . 128 B.8 Road network dataset: effects of adoption factor on improvement, number of iterations and capacity usage . 130 B.9 Road network dataset: effects of number of nodes in a cluster on improvement, number of iterations and capacity usage .