Analysis and Comparison of Common Graph Query Languages

Analyse und Vergleichvon gängigen Graph-Anfragesprachen DIPLOMARBEIT zur Erlangung des akademischen Grades Diplom-Ingenieur im Rahmen des Studiums Software Engineering &Internet Computing eingereicht von Martin Klampfer,BSc Matrikelnummer 01526110 an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Univ.Prof.Mag.rer.nat. Dr.techn. Reinhard Pichler Mitwirkung: Univ.Ass.Dipl.-Ing. Matthias Lanzinger,BSc Wien, 9. Dezember 2020 Martin Klampfer Reinhard Pichler Technische UniversitätWien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Analysis and Comparisonof CommonGraph Query Languages DIPLOMA THESIS submitted in partial fulfillmentofthe requirements forthe degree of Diplom-Ingenieur in Software Engineering &Internet Computing by Martin Klampfer,BSc Registration Number 01526110 to the Faculty of Informatics at the TU Wien Advisor: Univ.Prof.Mag.rer.nat. Dr.techn. Reinhard Pichler Assistance: Univ.Ass.Dipl.-Ing. Matthias Lanzinger,BSc Vienna, 9th December,2020 Martin Klampfer Reinhard Pichler Technische UniversitätWien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at Erklärung zur Verfassungder Arbeit Martin Klampfer,BSc Hiermit erkläre ich, dass ichdieseArbeit selbständig verfasst habe,dass ichdie verwen- detenQuellenund Hilfsmittel vollständig angegeben habeund dass ichdie Stellen der Arbeit–einschließlichTabellen,Karten undAbbildungen –, dieanderen Werken oder dem Internet im Wortlaut oder demSinn nach entnommensind, aufjeden Fall unter Angabeder Quelleals Entlehnung kenntlich gemacht habe. Wien, 9. Dezember 2020 Martin Klampfer v Danksagung Ichmöchtemichganz herzlichbei meinemBetreuer, Prof. Dr.Reinhard Pichler, für seine Unterstützung durchden ganzen Diplomarbeits-Prozess bedanken. Er unterstützte mich beider Auswahl einesspannendenThemas, ließ mir nützliche Unterlagen zukommen und warimmer füreinen Ratschlagerreichbar. Außerdemgilt mein herzlicher Dank meinen Elternund meinem Bruder,die michwährend meiner Studienzeit immer unterstützten, ermutigten und anspornten. vii Acknowledgements First and foremostIwanttothank my supervisorProf. Dr. Reinhard Pichlerfor his support and guidancethroughoutthe whole process. He not onlyhelpedmewhen first choosing atopicbut providedvaluable materialsand wasalwaysreachable whenIneeded some advice.Iwould also like to thank my parentsand my brother, they notonly supported me withmywish to studybut also encouraged me throughout those years. ix Kurzfassung Graphdatenbanken werdenimmeröfter verwendet um vernetzte Daten abzuspeichern. Viele dieser Systeme bietenein flexiblesDatenmodellbasierend auf property graphs, dassind GraphenwoKnoten und Kanten mit Tags gekennzeichnet und weitere Eigen- schafteninSchlüssel-WertPaarengespeichertwerdenkönnen. Um auf Daten in solch einerDatenbankzuzugreifenverwendet manAnfragesprachen. Im Unterschied zu re- lationalen Datenbanken, wo SQL die standardisierte Anfragesprache ist,gibt es noch keine standardisierte Anfragesprache für Graphdatenbanken. Nutzer könnendahervon einerbreiten Palette an Sprachenwählendie sichindem verwendeten Datenmodell, der Ausdrucksstärke,der Einfachheitder Verwendung und so weiter unterscheiden. Das Ziel dieser Arbeitist einsystematischerVergleich vonsolchen Sprachen. Daraus wollen wir Hilfestellungenzur Sprachauswahl beibestimmten Anwendungsfällen ableiten. In einemerstenSchritt identifizieren und analysieren wirKernfunktionen dieinallen Anfragesprachenfür Graphdatenbanken vorhandensind. Dazu zählt dasFindenvon Teilgraphen anhand vonGraphmustern (patterns), vonPfaden mittels Pfadanfragen (path queries) und die Kombinationdieser beidenzuNavigationsanfragen(navigatio- nalqueries). Wir erweitern diese Funktionen um strukturunabhängige Anfragen sowie Datenmanipulations- undDatendefinitions-Operationen. Anhanddieser Funktionen analysieren wir fünf moderne Anfragesprachenfür Graphda- tenbanken: Cypher, Gremlin, PGQL,GSQL und G-CORE. Wir analysieren dieSprachen nicht nuranhand vongenerellen Charakteristika, sondern implementieren auch Teile des “Social Network Benchmarks” und analysieren diese Anfragen.Esstelltsichheraus, dass Cypher, PGQL und G-COREeine ähnlicheSyntax verwendensowie ähnlich aus- drucksstarkund einfachzuverwenden sind. Aufder anderenSeite ähneln sichGSQL und Gremlin: beide sindTuring-vollständig undunterstützen nichtnur deklarative, sondern auchimperativeKonstruktionen. xi Abstract Graph databasesare increasingly adoptedinindustry as they offer nativesupport for graph-like data. Manysuchsystems use aflexible datamodel based on property graphs, theseare graphswherevertices and edges can be labeledand augmented with key-value pairs,the properties. Data stored in suchadatabaseisaccessed viagraph query languages. UnlikeSQL,thatisthe standardized query language for relational databases,there is no standard foragraphquery languageyet. Userscan thereforechoose from arange of languages that vary in theirspecific datamodel, expressiveness, ease of use and so on. The goal of this thesisisasystematic comparisonofgraphquery languages and to give guidelines for choosingalanguage for specific application scenarios.Towardsachieving this goal, we identify and analyze corefeatures inherent to suchlanguages. The most common onesare subgraph discovery by matching graph patterns, pathdiscoveryusing path queries and thecombination of both in navigational queries. We extend these featurestoalso include structure independent queries as well as datamanipulation and data definition operations. Basedonthese features,wethen analyze fivecontemporarygraphquery languages: Cypher, Gremlin, PGQL, GSQL and G-CORE. Theanalysis is notonly basedon general characteristicsofthe languages as we alsoimplementpartsofthe SocialNetwork Benchmarkand analyze thesequeries. It turns out that Cypher, PGQL and G-CORE are quitesimilarintheir syntax, expressiveness and ease of use. On theotherhand, GSQL and Gremlin sharesome similarities as they are both Turing-completelanguages thatare notsolely declarativebut alsoofferimperativeconstructs. xiii Contents Kurzfassung xi Abstract xiii Contents xv 1Introduction1 2Graph Databases 3 2.1 BriefHistory and Scope.......................... 3 2.2 BasicConcepts ............................... 4 2.3 Graph Data Models ............................ 8 2.4 Query Languages .............................. 10 2.5 LDBCSocialNetwork Benchmark .................... 17 3QueryLanguage Features 21 3.1 Structure Independent Queries .......................21 3.2 Pattern Matching Queries.........................22 3.3 Path and Navigational Queries ...................... 28 3.4 DataManipulation ............................. 38 3.5 DataDefinition ............................... 40 3.6 Further Features ............................... 41 4QueryLanguage Analysis45 4.1 Cypher. ................................... 46 4.2 Gremlin................................... 59 4.3 PGQL ....................................74 4.4 GSQL ....................................83 4.5 G-CORE ..................................95 4.6 Further Languages .............................103 4.7 LessonsLearned ..............................106 5Conclusion 113 xv ListofFigures 115 List of Tables 117 Listings117 Acronyms 121 Bibliography 123 CHAPTER 1 Introduction Relational databasesare well studied and optimized for datathatcan be modeledvia tabular relations. However, using arelational database comeswith serious restrictions, especiallywhen the informationishighly interconnected. In this case, querying data often meanstomovealong some relationships between theentities. This is typicallydone by joiningmultiple relations,whichadds computationaloverhead compared to systems tailored forsuchqueries. Furthermore, relational databases are not well suitedtohandle adynamic model where the structure changesovertime. Graph databasesare atypeofaNoSQLdatabase and specifically developedtohandle graph-like data, where information coming from relationships is at least as importantas informationstoredinentities [Ang12]. Theyare increasinglyadopted in industry as for examplechemistry-, biology-, or social network- related data canoften be modeledas agraph withinformation storedinnodes andrelationships connecting them. In these cases graph databases offer amorenatural andflexible model as well as nativesupport forqueriesonagraph structure, likenavigating alongthe relationships [AG08, RWE15]. Graph databases have been aroundsince the 1980s and enjoyedariseinpopularity over the last decade. Thiscomes from technicaladvancements that allow large graphslike social networks with millionsorevenbillions of nodestobestoredand processed [AG18]. Data stored in agraph database is accessedvia aquerylanguage. Unlike for relational database managementsystems1 ,whereSQL is the standardized query language,there is no standardfor agraph query language yet, as work on suchastandard started in 2019. Therefore, manyvendorsofgraphdatabasesystems define and use their ownquery language that differs from others in their expressiveness, their supported datatypes, theirease of use andsoon. The rise of interest in graph databasesisaccompanied by 1 We use the terms database management system, database system or simply database interchangeably for the system that “enables users to define, create, maintainand control access” [CB05]toacollection of data. 1 1. Introduction academiaputting more focus on this area as well. This resulted in abetter understanding of somesystems and theirunderlying structures, andalso in the invention of morequery languages. Choosing asystemtherefore involves notonly the propertiesofthe

Load more