Analyse und Vergleichvon gängigen Graph-Anfragesprachen

DIPLOMARBEIT

zur Erlangung des akademischen Grades

Diplom-Ingenieur

im Rahmen des Studiums

Software Engineering &Internet Computing

eingereicht von

Martin Klampfer,BSc Matrikelnummer 01526110

an der Fakultät für Informatik der Technischen Universität Wien Betreuung: Univ.Prof.Mag.rer.nat. Dr.techn. Reinhard Pichler Mitwirkung: Univ.Ass.Dipl.-Ing. Matthias Lanzinger,BSc

Wien, 9. Dezember 2020 Martin Klampfer Reinhard Pichler

Technische UniversitätWien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at

Analysis and Comparisonof CommonGraph Query Languages

DIPLOMA THESIS

submitted in partial fulfillmentofthe requirements forthe degree of

Diplom-Ingenieur

in

Software Engineering &Internet Computing

by

Martin Klampfer,BSc Registration Number 01526110

to the Faculty of Informatics at the TU Wien Advisor: Univ.Prof.Mag.rer.nat. Dr.techn. Reinhard Pichler Assistance: Univ.Ass.Dipl.-Ing. Matthias Lanzinger,BSc

Vienna, 9th December,2020 Martin Klampfer Reinhard Pichler

Technische UniversitätWien A-1040 Wien Karlsplatz 13 Tel. +43-1-58801-0 www.tuwien.at

Erklärung zur Verfassungder Arbeit

Martin Klampfer,BSc

Hiermit erkläre ich, dass ichdieseArbeit selbständig verfasst habe,dass ichdie verwen- detenQuellenund Hilfsmittel vollständig angegeben habeund dass ichdie Stellen der Arbeit–einschließlichTabellen,Karten undAbbildungen –, dieanderen Werken oder dem Internet im Wortlaut oder demSinn nach entnommensind, aufjeden Fall unter Angabeder Quelleals Entlehnung kenntlich gemacht habe.

Wien, 9. Dezember 2020 Martin Klampfer

v

Danksagung

Ichmöchtemichganz herzlichbei meinemBetreuer, Prof. Dr.Reinhard Pichler, für seine Unterstützung durchden ganzen Diplomarbeits-Prozess bedanken. Er unterstützte mich beider Auswahl einesspannendenThemas, ließ mir nützliche Unterlagen zukommen und warimmer füreinen Ratschlagerreichbar. Außerdemgilt mein herzlicher Dank meinen Elternund meinem Bruder,die michwährend meiner Studienzeit immer unterstützten, ermutigten und anspornten.

vii

Acknowledgements

First and foremostIwanttothank my supervisorProf. Dr. Reinhard Pichlerfor his support and guidancethroughoutthe whole process. He not onlyhelpedmewhen first choosing atopicbut providedvaluable materialsand wasalwaysreachable whenIneeded some advice.Iwould also like to thank my parentsand my brother, they notonly supported me withmywish to studybut also encouraged me throughout those years.

ix

Kurzfassung

Graphdatenbanken werdenimmeröfter verwendet um vernetzte Daten abzuspeichern. Viele dieser Systeme bietenein flexiblesDatenmodellbasierend auf graphs, dassind GraphenwoKnoten und Kanten mit Tags gekennzeichnet und weitere Eigen- schafteninSchlüssel-WertPaarengespeichertwerdenkönnen. Um auf Daten in solch einerDatenbankzuzugreifenverwendet manAnfragesprachen. Im Unterschied zu re- lationalen Datenbanken, wo SQL die standardisierte Anfragesprache ist,gibt es noch keine standardisierte Anfragesprache für Graphdatenbanken. Nutzer könnendahervon einerbreiten Palette an Sprachenwählendie sichindem verwendeten Datenmodell, der Ausdrucksstärke,der Einfachheitder Verwendung und so weiter unterscheiden. Das Ziel dieser Arbeitist einsystematischerVergleich vonsolchen Sprachen. Daraus wollen wir Hilfestellungenzur Sprachauswahl beibestimmten Anwendungsfällen ableiten. In einemerstenSchritt identifizieren und analysieren wirKernfunktionen dieinallen Anfragesprachenfür Graphdatenbanken vorhandensind. Dazu zählt dasFindenvon Teilgraphen anhand vonGraphmustern (patterns), vonPfaden mittels Pfadanfragen (path queries) und die Kombinationdieser beidenzuNavigationsanfragen(navigatio- nalqueries). Wir erweitern diese Funktionen um strukturunabhängige Anfragen sowie Datenmanipulations- undDatendefinitions-Operationen. Anhanddieser Funktionen analysieren wir fünf moderne Anfragesprachenfür Graphda- tenbanken: Cypher, , PGQL,GSQL und G-CORE. Wir analysieren dieSprachen nicht nuranhand vongenerellen Charakteristika, sondern implementieren auch Teile des “Social Network Benchmarks” und analysieren diese Anfragen.Esstelltsichheraus, dass Cypher, PGQL und G-COREeine ähnlicheSyntax verwendensowie ähnlich aus- drucksstarkund einfachzuverwenden sind. Aufder anderenSeite ähneln sichGSQL und Gremlin: beide sindTuring-vollständig undunterstützen nichtnur deklarative, sondern auchimperativeKonstruktionen.

xi

Abstract

Graph databasesare increasingly adoptedinindustry as they offer nativesupport for graph-like data. Manysuchsystems use aflexible datamodel based on property graphs, theseare graphswherevertices and edges can be labeledand augmented with key-value pairs,the properties. Data stored in suchadatabaseisaccessed viagraph query languages. UnlikeSQL,thatisthe standardized query language for relational databases,there is no standard foragraphquery languageyet. Userscan thereforechoose from arange of languages that vary in theirspecific datamodel, expressiveness, ease of use and so on. The goal of this thesisisasystematic comparisonofgraphquery languages and to give guidelines for choosingalanguage for specific application scenarios.Towardsachieving this goal, we identify and analyze corefeatures inherent to suchlanguages. The most common onesare subgraph discovery by matching graph patterns, pathdiscoveryusing path queries and thecombination of both in navigational queries. We extend these featurestoalso include structure independent queries as well as datamanipulation and data definition operations. Basedonthese features,wethen analyze fivecontemporarygraphquery languages: Cypher, Gremlin, PGQL, GSQL and G-CORE. Theanalysis is notonly basedon general characteristicsofthe languages as we alsoimplementpartsofthe SocialNetwork Benchmarkand analyze thesequeries. It turns out that Cypher, PGQL and G-CORE are quitesimilarintheir syntax, expressiveness and ease of use. On theotherhand, GSQL and Gremlin sharesome similarities as they are both Turing-completelanguages thatare notsolely declarativebut alsoofferimperativeconstructs.

xiii

Contents

Kurzfassung xi

Abstract xiii

Contents xv

1Introduction1

2Graph Databases 3 2.1 BriefHistory and Scope...... 3 2.2 BasicConcepts ...... 4 2.3 Graph Data Models ...... 8 2.4 Query Languages ...... 10 2.5 LDBCSocialNetwork Benchmark ...... 17

3QueryLanguage Features 21 3.1 Structure Independent Queries ...... 21 3.2 Pattern Matching Queries...... 22 3.3 Path and Navigational Queries ...... 28 3.4 DataManipulation ...... 38 3.5 DataDefinition ...... 40 3.6 Further Features ...... 41

4QueryLanguage Analysis45 4.1 Cypher...... 46 4.2 Gremlin...... 59 4.3 PGQL ...... 74 4.4 GSQL ...... 83 4.5 G-CORE ...... 95 4.6 Further Languages ...... 103 4.7 LessonsLearned ...... 106

5Conclusion 113

xv ListofFigures 115

List of Tables 117

Listings117

Acronyms 121

Bibliography 123 CHAPTER 1

Introduction

Relational databasesare well studied and optimized for datathatcan be modeledvia tabular relations. However, using arelational database comeswith serious restrictions, especiallywhen the informationishighly interconnected. In this case, querying data often meanstomovealong some relationships between theentities. This is typicallydone by joiningmultiple relations,whichadds computationaloverhead compared to systems tailored forsuchqueries. Furthermore, relational databases are not well suitedtohandle adynamic model where the structure changesovertime. Graph databasesare atypeofaNoSQLdatabase and specifically developedtohandle graph-like data, where information coming from relationships is at least as importantas informationstoredinentities [Ang12]. Theyare increasinglyadopted in industry as for examplechemistry-, biology-, or social network- related data canoften be modeledas agraph withinformation storedinnodes andrelationships connecting them. In these cases graph databases offer amorenatural andflexible model as well as nativesupport forqueriesonagraph structure, likenavigating alongthe relationships [AG08, RWE15]. Graph databases have been aroundsince the 1980s and enjoyedariseinpopularity over the last decade. Thiscomes from technicaladvancements that allow large graphslike social networks with millionsorevenbillions of nodestobestoredand processed [AG18]. Data stored in agraph database is accessedvia aquerylanguage. Unlike for relational database managementsystems1 ,whereSQL is the standardized query language,there is no standardfor agraph query language yet, as work on suchastandard started in 2019. Therefore, manyvendorsofgraphdatabasesystems define and use their ownquery language that differs from others in their expressiveness, their supported datatypes, theirease of use andsoon. The rise of interest in graph databasesisaccompanied by

1 We use the terms database management system, database system or simply database interchangeably for the system that “enables users to define, create, maintainand control access” [CB05]toacollection of data.

1 1. Introduction

academiaputting more focus on this area as well. This resulted in abetter understanding of somesystems and theirunderlying structures, andalso in the invention of morequery languages. Choosing asystemtherefore involves notonly the propertiesofthe system but alsothe supportedquery language. With many languages to choose from, it is important to get an overviewabout theirexpressiveness, their characteristicsand theirease whenformulating queries.But to our knowledgethere is no comprehensive comparisonofcurrentmajor and most promising languages regardingthese factors. Thisthesis providesacomparison of common query languages, especially regardingtheir expressiveness and ease of use. To get acomprehensible comparison, core features of query languages are identifiedand analyzed, andthe comparisonisdone based on these features. We analyze four languages from industry (Cypher, Gremlin, PGQL,GSQL) and one promising language from aresearchproject(G-CORE). Theanalysis is not onlybasedongeneral information as we alsoimplementparts of theSocialNetwork Benchmark. Whereas afull implementation of all chosen queriescan be found on our GitHub repository2,weanalyzeonly some of them in detail in this thesis. It turnsout that Cypherisagood choice, both when first starting with graph databases and for theuse in applications, as longasmore expressive path queriesand aschema are not needed.PGQL is very similartoCypher, and what it lacks in functionality it makes up for with sophisticated path querying capabilities. AlthoughGremlin allows for expressive queries, its low-level approachwith afocus on imperative traversals makes it harder to get into for someone familiar with high-level declarative querylanguages. GSQL is designed forhuge graphs with support fornativeparallel query execution. That results in aslightly different syntax comparedtoCypher and PGQL, but themajor difference is the support of astrong type systemand schema.Weconcludewith G-CORE, an expressive researchlanguage that combinesusefulfeatures from existinglanguages, achieves fullcomposabilityand elevates pathstofirst-class citizens in the graph. The remainder of this thesisisstructured as follows: Chapter2introducesgraph databases in general,including theirunderlying data modelsand the query languages we focus on in our comparison. We then identify commonfeatures of query languages and analyzethese languages regardingthe expressiveness they enable in Chapter 3. The fivequerylanguages are introducedand analyzed in Chapter 4. We endthis chapter with ageneral comparison and provide guidelines for selecting alanguage for specific application scenarios.Finally, Chapter 5concludes thethesisand includes notes on future work.

2https://github.com/martin-kl/Diploma_Thesis_01526110_code

2 CHAPTER 2

GraphDatabases

Since the 1980s, amultitude of database systemssupporting differentgraphmodels and query languages targetingdifferent applications have been introduced. We start with adescription of graph databasesingeneral and givesome historical background in Section 2.1.The basicconceptsofgraph databasesand underlying data modelsare introduced in Section2.2 and Section 2.3, respectively.Section 2.4 gives an introduction to fiveofthe most commongraph query languages. These are thelanguages we analyze and compare in Chapter 4.

2.1BriefHistory and Scope

Graph databasesand query languages have been aroundfor multiple decades.Initially introduced in the 1980s, with theinventionofgraphdatabaseslikeG-BASEin1987 [Kun87]and corresponding data modelsand query languageslikeG[CMW87], interest inthe areadeclinedinthe later1990s as databasestargetingfor examplespatial and geographical dataemerged [AG08]. In the late 2000s and especiallyduring the last decade we sawabreakthroughofgraphdatabases, as theneed for frequentschema changesand real-time query responsesondatabases containing vast amounts of interconnected data became morerelevant[FB18]. They are nowused to handledatafrom awide range of fields: from transport networks overbiological information like genes and protein interactions to socialnetworks [RWE15]. Using agraph databasebecamearequirementtohandle vast amounts of interconnected data in fields likethese, as they are designed to allow amore natural and faster access to graph-like data [Ang12]. As an examplefor this, assume that we modelsuchdata in arelational database.This could be done by storingthe edges in atablewhereeach rowrepresents an edgethatcontains twoforeign keyreferences, one for eachadjacent node. We endupwith lots of many-to-manyrelationshipsfor all theconnections carrying data.Althoughweare abletomodel graphsinthis way, queryingthe structure can

3 2. Graph Databases

become quitecumbersome. In contrasttoconventionaldatamanagementweare not only interested in the valuesstoredinthe tablesbut alsointhe connection patterns [EAL+15]. If we search for suchpatterns or query not onlyinformation stored in asingle relation,we use join operations tomovealongapath 1 in the graph. These joinscan easilybecome complex,therefore computationally demandingand impact theperformance,evenifonly asmallpart of thegraphisqueried. Graph databases on the otherhand are designedto store and process such interconnected data.They canhandle large graphswhile providing ahighperformance, especially for queries traversingthem [HP16]. One cansplit the numerous available systemshandling graph dataintotwo maincategories basedontheir focus on either transactional or analytical processing: [RWE15, RH19]

Transactional Systems thatare focused on transactional processingare usuallyonline, meaningthat they are built for real-time access to thecurrentdata. Similar to the world of relational databasestheyare designed for OnlineTransactional Processing (OLTP) and called graph database managementsystems or simply graph databases. We focus on these tasks and systems in ourthesis.

Analytical Systems focused on analyzinggraph-likedata can either be onlineoroffline, i.e. they canwork on thecurrentdata or on an offline copyofit. Regardless of this design choice, thesetasksare often called Online AnalyticalProcessing(OLAP)2 andare typically done in batch-processing fashion.Graph databases support these tasks to some degree but are not optimized for them and sometimes even lackthe ability to perform them at all.Therefore, specialtechnologiesexist thatare often called graph compute engines or graphanalytics systems.

2.2 Basic Concepts

Unlikerelational databasesthat are builtontabular relationsasunderlying , graph databasesare builtonthe mathematicalstructures of graphs. Although thereare multiple differentdata modelsusedinvarious graph databases (see Section 2.3), they all representtheir domain using itsstructure via nodes (vertices), edges (relationships,links) andinmost casesproperties. Unlessmentionedotherwise,weassume directed graphsas most systems do not support undirected edges. Let’s look at an example for suchadata structure. Figure2.1 depicts asmallsocialnetwork containing three nodes representing its users thatare connected via multiple edges. Edges and nodesare either labeledas User or FOLLOWS and eachnodehas apropertycalled name.The structure tells us thatAlice follows Bob, Carol follows Alice and Carol and Bob follow eachother. This is asmallmotivating exampleofasocial network thatcouldeasily be extended to also containfurtherinformationabout the users or relationships by addingmore properties.

1Moving along apath,that is aconcatenationofadjacentedges,means to traverse the pathfrom its start to its end node. 2They are sometimes also denoted as Offline Analytical Processing [BPG+19].

4 2.2. BasicConcepts

Figure2.1: Asmallsocialgraph.

Furthermore, the graph couldalso be extended to includeother entities likepostsor messages by adding appropriate nodes and edges. Depicting information in such aconnected data model allowsagraph database to handle highly interconnecteddatawell, but one has to keep in mind that building on graphs comes with its owndrawbacks.While queryingfor related dataand traversingthe graph can be done efficiently, queryingthe global structure is often challenging and shows notably worse performance compared to relationaldatabasesinsome cases. An example forsuchaglobal querywould be to searchfor all usersataspecific age or for persons thathaveatleastone friendfollowing them (i.e. one incoming edge labeled FOLLOWS in our example). As graph databasesdonot provideparticularly efficientsolutions for all problems, they are not seen as areplacementfor theirrelational counterparts but more as acomplementary system thatisused especiallytoanalyze stronglyinterrelated data. [FB18]

2.2.1 NativevsNon-NativeGraph Representation Databases with built-in support forgraphscan be split into two categories that differ in theirinternal datarepresentation:native(alsopure)graphdatabasesand multi-model (also non-native) ones [EAL+15, AG18, BPG+19].

• Native graph databases like Neo4j3 and TigerGraph4 focussolely on datastructured in graphs. As theirdatamodel and querylanguages are specificallydesignedwith graphsinmind, graphsare first class citizens in thesesystems. Most of these databases do not only presentthe data as agraph to the userbut alsouse agraph data structure internally.Thisallows them to be optimized for certaingraph queries. An examplewouldbethe provisionofconstanttime adjacency retrievals regardlessofthe sizeofthe graph thatispossibleinsuchsystems. Furthermore,

3https://neo4j.com/ 4https://www.tigergraph.com/

5 2. Graph Databases

manysuchsystems provide graph specific functions like the shortestpathalgorithm or breadth-first search out of thebox.

• Multi-modeldatabases support other models apart from graphs, like the relational or documentoriented one, as well.Manyofthese systemstransformthe data into adifferentmodel to store it.Consequently, aglobal indexiscommonlyused to store and retrieve adjacencyinformation. Although some of thesesystemsalso provide graph specificfunctions out of thebox,they are in manycases not that performant caused by the use of differentstorage and processingmechanisms. On the other hand, multi-model databasescan integratedatacoming in various different structuresand possibly frommultiple sources into asingle system. Having all data in asingle system allowsfor auniform access via asinglequery language and the data can be analyzed overdifferent structures. Examplesfor such systems are ArrangoDB5 and AzureCosmos DB6.

Regardlessofagraph databasebeing categorized as nativeormulti-model we can identify two major components in suchasystem: thestorage and theprocessing component [RWE15]. Although thesecomponents are commoninbothdatabase categories, they differsubstantially and their designand mechanismsinfluence thecategory of adatabase.

Storage The storage componentisresponsible for reliablestorage and access to thedata. One possibility to achievethis is to serialize the graph andstore it in arelational table. This allowsfor theuse of asingle copy of the data thatcan be accessed by either SQLqueriesorquerieswritten in agraph query language,depending for exampleonthe knowledge of the engineers or theease-of-use for thegiven problem. However, serializingthe graph andkeeping it in adatamodel that is not specifically designed and optimized for suchinterconnected data comes with all performance drawbacks mentionedearlier. Thisisone of thereasons whydatabaseswith sucha non-native storage component are not categorizedasnativegraph databases, but as multi-model ones. Anotherpossibility is to storedatadirectly in aconnected structure thatallows for faster, in manycases constanttime, retrievals of adjacententities.Byusing such a storage component, thesedatabasesare usuallyabletoscale notablybettertolarge graphscontaining millionsofnodes while stillprovidingahighperformance for most graph related queries. Astorage component built on suchadesign is referred to as native graph storage.Asmentioned earlier,most nativegraph databases representthe data internallyasagraph and use such anativegraph storage.

Processing The processing componentconnects to thestorage componentand is re- sponsible for all operations on the data.Asthe choice of the storage mechanism heavily influences the innerworkings of theprocessing component, likethe use of a

5https://www.arangodb.com/ 6https://azure.microsoft.com/services/cosmos-db/

6 2.2. BasicConcepts

nativegraph storage that providesdifferentperformancecharacteristics and query possibilities to theprocessing component, the borderbetween thesecomponents is quite blurry.Therefore, the most interesting part of theprocessing component concernsalsothe storage technique andisabout the questionwhether the system reliesonindex-free adjacency or not. Whenusing index-free adjacency,adjacentnodes and edges are physically stored next to eachother with direct referencestoeachother. Thesecomponents are also referred to as native graph processing and are used in most nativegraph databases. As an example, assume that we have the nodes x, y and therelationship e =(x, y) in asystem. With index-free adjacency,the processing componentcan “move”from x to e and further to y by directly retrieving theirinformation without the needof aglobalindexcontaining therelationships.Having thesedirectpointersallows for constanttime retrieval of an adjacentnodeorrelationship,regardless of thesizeof the whole graph.Onthe downside,graphdatabasesingeneral and especiallynative graph processingsystems thatrelyontheir localadjacency information instead of global indexes have the problemthatsharding agraph is very hard.Sharding is the process of partitioning agraph into multiple subgraphs that can be stored on differentmachinestoachievehorizontal scalability. [RN11] Anotheroption for thedesign of aprocessing component is to use aglobalindex thatcontains references to relationships and thenodes in theserelationships.To navigatethe graph,data is joined with the global index. This approachisused in multi-model graph databases.

2.2.2 Indexes Indexes are used to speed up access to relevantdata. Relational databases relyheavily on theseindexes, as they enable tablestobeefficientlyjoined and data entries to be efficiently found. Graphdatabasesalso rely on indexesinvarious ways. Systems not built on theprinciple of index-freeadjacency use an index containing adjacency information. Thisindexcan then be usedtomovebetween connected entities in thegraph by joining the data with the index. Depending on thesystem,this canwork similar to relational databases. Graph databases relying on index-freeadjacencyhave localindexes integrated into the graph structure. As briefly explained in the processing componentabove,every node and edge in suchasystem keepsdirectpointerstoall adjacentnodes and edges. These are stored together with theentityand form alocal indexfor every node andedge. Once the start node of atraversalisfound, they can traverse the neighborhood without accessingadedicated or global indexbyusing thepointers in their local index [BPG+19]. Thisisasignificantadvantage compared to systems without index-free adjacency where performance usually deteriorates as thegraphbecomes bigger, caused by an increasein the sizeofthe index.However, thesedatabasesalso use additional indexestofind specific nodes in the graph,likethe start node in aquery,and to accelerateother operations, especiallyglobal ones. As an example, assume that our social network fromFigure 2.1 alsocontains theage of every user. If we want to getthe names of all usersolder than20

7 2. Graph Databases

years for example, an indexonthe age allowsthe system to checkfor thecondition on the index, and limitsthe access to nodes that satisfyit. [RN11]

2.3GraphData Models

Depending on thedefinition,the term data model (also database model,db-model or model) comprises not onlythe underlying datastructure usedbyadatabase system but alsothe data description, integrityconstraints,maintenanceand even access and query mechanisms [AG08, Ang18]. We focus on theunderlying data structure used by a database systeminthis section. As brieflymentionedearlier, it is defined aroundthe mathematical structures of graphs, nodesand edges [Ang12]. Unlikeinthe world of relational databasesthere is no standard data model usedingraph databases. Figure 2.1 is an example of a labeledproperty graph that is widely used, especially in native graph databases. Other notablemodelsinclude the Resource Description Framework (RDF) modelused inthe domainofSemanticWeb and hypergraphs thatgeneralize thenotion of agraph to allow edges between morethantwo nodes [AG18]. We nowtakeacloser lookatthe RDF and various propertygraph modelsasthese are the data modelsused in most graphdatabases of our time.

2.3.1 RDF The RDF model encodesdata in subject-predicate-object expressions and databasesusing this model aretherefore alsodenoted as Triplestore.Incontrast to other models,ituses aschema that is integrated together with thedataintoasingle graph.The schemaisa smallpart of thegraphand contains information about possibleconnections as well as node and edge types. SinceitisaWorld Wide WebConsortium(W3C)recommendation, it enjoyswidespreadrecognition,gooddocumentationand the available systemsare unified by acommon and standardizedquery language:SPARQL. It wasoriginally developedtorepresentweb metadata, butcan alsobeused in amoregeneral wayto modelinterconnected resources using theaforementioned triples. On the other hand, thesesystemsare usually notnativegraph databases as manytransform the triplesinto other formatsand store them forexample in relational tables. Nonetheless, they fall under the category of graph databasesaslarger amounts of thesetriples canform abody of interconnected nodesand therefore agraphstructure. [RWE15, AG08, AG18] Figure 2.2 gives an exampleofanRDF graph containing thedatafrom our social network example. An exampleofatripleinthis case is “Alice followsBob”, where Alice is the subject(theresource thatisdescribed), follows the predicate (property) and Bob the object (or property value) [AG08]. The schemaisgiven in aseparatepart of the graph thatisconnected to theinstancesvia type-edges. These connections representthe information given by the labels in Figure 2.1. Apartfrom the missing labels there arealso no inherentpropertiesonnodes.Suchapropertyisstoredinatriplethatintroduces a node containing thevalue and an edge labeled by the nameofthe property. An example for thisisthe node "Alice" connectedbythe name-edge to thenode user1.

8 2.3. Graph Data Models

Figure 2.2: Asmallsocialgraphinthe RDF data model.

Despite being endorsed by W3C and having,incontrast to other graph databases,a standardizedquery language, theseRDF stores are mainlyused in the domain of Semantic Web. Areason for thelackofutilization in other domains couldbethe comparably bloated datastructure that uses an additional node and edge for every label and property. Anotherreason couldbethe use of aschema that limitsthe flexibilityofthe data model [SS19].

2.3.2Labeled (Property) Graphs Edge-labeledgraphs are one of thesimplest form of graphswherenodes and edges can be eachlabeled with one label.Labelsassignedtoedges indicate the type of therelationship between thenodes. Thisallows us to model different relationships as well as multiple edges between anytwo nodes in asingle graph.RDF-graphsare builtonthe basis of edge-labeled graphsbut differintheir use of aschema andtherefore the use of node and edge types. Edge-labeled graphsdonot provide aschema and typescan onlybegiven implicitly by setting an appropriate label. Figure 2.3 shows thedata fromour small social network example encoded in an edge-labeled graph.Weadded the information thatCarol is thesister of Alice to showthe powerofedgelabels. Without theselabels, we would simply have twoedges fromCarol to Alice withoutbeing abletodistinguish them.[AAB+17] (Labeled)Property graphs,also LPGsorsimply propertygraphs,extendthis model by addinglabelstonodes and including propertiesdirectly in the nodes and edges. Properties, alsocalled attributes, are key-valuepairs containingnon-graphical information like name: Carol.The inclusionofpropertiesallows foramore natural datamodelingand faster retrievalofthe data related to anodeoredge[Ang12]. Therefore, propertygraphs can be described as a"directed, labeled, attributed multigraph"[AG18], where multigraph denotes that multiple edges betweenany two nodes are possible.Although this model

9 2. Graph Databases

Figure 2.3: Asmallsocialgraphinthe edge-labeleddata model.

is more popularthanothers [FGG+18b, BPG+19]and usedinmanygraphdatabases, it is notstandardized and manyvariants of it exist.Angles [Ang18]presents aformal definitionofthe propertygraph data model thatallows nodes and edges to be labeled with multiple labels and everypropertytobeassigned aset of values. Other definitions for examplerestrictedges to be labeledbyonlyasinglelabel [FGG+18b]. Somepapers also requirethe useofaunique identifier for every node and edge,orrestrict properties to onlystore single values insteadofsets[AAB+17]. Let’s look at an enriched examplebasedonour small socialnetwork. Thepropertygraph in Figure 2.4 containsauniqueidentifierfor everynodeand edge. This allowsfor fast and clear identification of entities and is alsoused in manygraphdatabases. As property graphsdonot have an explicit schema, thetypes of nodes and edges are given through labels attachedtothem.Our example allows anodetohave multiple labels,like User and Admin on node n1.Incontrast to theRDF model fromFigure 2.2, theproperties of anodeare stored inside the node itself. This results in amorecompact modelaswe do not need an additionalnodefor everyproperty.Asmentionedbefore, edges can also have propertieslikethe since attributes on the follows relationships. Propertygraphs are very flexible as they do not use an explicit schema. Onecan simply omitpropertiesthatare notavailable forsome parts in thegraph, likethe missingage attribute in the node n3.Likewise,other propertiescan be added to nodes, like the mail propertyinn3.Furthermore,ifwewanttoadd new informationlikeapicture thatisposted by one of our users, we can add anodelabeled Picture and connect it via an edge labeled posted to the user that posted it. Sincemostnative graph databasesuse the propertygraph data model,they do not have adefinedschema and are thereforemore flexiblecompared to relational or other schema-dependent databases. We will focusonthe propertygraph model and some variations of it in therestofthis thesis as manypopular graph query languagesare designed to operate on some variationofthatmodel.

2.4Query Languages

Aquery language is ahigh-level language used to accessdatastoredinadatabase [ARV19]. In the strictestdefinition, aquerylanguage onlysupportsoperations to retrieve

10 2.4. Query Languages

Figure 2.4: Asmallsocialgraph in the propertygraph data model.

data.Codddescribes suchalanguage moreformally as “a collectionofoperators or inferencingrules, whichcan be applied to any validinstances of the data types (ofthe model), to retrieve or derive data from any partsofthosestructuresinany combinations desired” [Cod80]. This descriptionlimitsqueriestoonly retrievedata, andsuchoperators form theclass of Data Query Language (DQL).Most definitionsextend this strictform to alsoinclude manipulationoperations thatallowtoadd,modify and remove data.These operationsform theclass of the Data Manipulation Language (DML).Databases that rely on, or support aschema furthermoreinclude operationsofthe Data Definition Language (DDL).Analogously to the relationalworld wherethe StructuredQuery Language (SQL) alsosupportsmore than just dataretrieval operations but is denoted as querylanguage, agraph query language supports DML and sometimes DDLoperationsaswell. Taken together,agraphquery languagedefines operations to access and manipulatedatathat is structured as agraph and provides support for graph specificoperations[ARV19]. [Cha12, Ang12] Figure 2.5 gives an overviewofsome graph query languages thatare specificallydesigned forgraph data models. As brieflymentionedearlier, thegraphical query language G [CMW87] wasintroduced in 1987 and is one of thefirst suchlanguages.Wewill briefly analyzethis languageasitintroduced importantconcepts that are stillinuse in modern query languages.Gis built on thedatamodel of edge-labeled graphs and introduced the concept of agraphicalquery.Aquery Q consistsofaset of labeled and directed graphs, called querygraphs. Nodesinaquerygraph are labeled by either constants or variables, and edges can be labeledbyregularexpressions over constants and variables. These querygraphs canbeseenasgraphpatterns that are matched to subgraphsofthe graph database G to evaluatethe query.

Figure 2.6 shows an example of aquery Q = {Q1, Q2} thatcontains twoquerygraphs.

11 2. Graph Databases

Figure 2.5: Evolution of graph query languages. Source:AnIntroductiontoGraph Data Management, Figure 1.7[AG18].

Q1 : Q2 : y

’AC’ ’ + ’A w ’Tor’ x ’Tor’ z ’AC’ ’AC’

Figure 2.6: An exampleofagraphicalquery in the query languageG.Source: AGraphical Query Language Supporting Recursion, page 2, example 2[CMW87].

Each node in thedatabase G represents acityand is labeledbyits namewhile the edges represent airline connections betweenthe cities and are labeled by the nameofthe airline. The query Q findsthe first and last citiesvisited by roundtrips from’Tor’ (Toronto), with therestriction that thefirst and lastconnections are via ’AC’ (Air Canada).Another restrictionisgiven by the regularexpression w+ thatlimits all connections between the cities y and z,the first and last citiesinaroundtrip, to be with thesame airline.Such aregular expression is depicted as adashed line to highlightthat this is apath and does not have to be asingle edge. Thesupport for regular expressions in query graphs allows foramore general and simplerexpressionofrecursivequeries compared to other query languageslikeStructured Query Language (SQL) for example [AG18].

Allmajor propertygraph query languages introduced in this millennium are built around some core languagefeatures. Thestructureofaquery graph as agraphical pattern, like the simplepattern depicted by Q1,isarguable the most popularfeature supported by most,ifnot all,graph query languages. Depending on thelanguage, thesepatterns can be either given directly as agraph patternlikeinG,orasatextual representation of the patternasinmanymodern query languages. We define this patternand other featuresof query languages and provide examplesfor them in Chapter3.Anothernotabledifference

12 2.4. Query Languages

between manygraph query languages and theirrelational counterpart SQLis, that their inputformat differsfromtheir output format. Unlikeinthe relationalworld, wherea query takes atable(relation) as input andoutputsatable, agraph query takes agraph as inputbut in many languages outputs atablecontaining values[Ang18]. Thisisthe mainreason whymanygraph query languages are not composable. We givemore details on thecomposability of queriesinSection 3.6.

The language Gevolved into G+ [CMW88]that added support for graph specificfunctions like the computationofthe sizeofthe shortestpath, and extended thecapabilities of querygraphs. Since then, manydifferentgraph query languages builtonvariousdata models were introduced. Examples are thelanguages THQL[WS90]and HQL [The02] forhypergraphs, Graphlog [CM90], GRE [Woo90], Gram [AS92]and PDL [ABR13] (as PDQL in Figure 2.5) for edge-labeled graphsand SPARQL version1.0 [PS08]and version1.1 [HS13]for RDFgraphs. Many of the query languagesinFigure 2.5 are moretheoreticaland notused inreal world database system. Therise in popularityof graph databases thatmainlyuse thepropertygraph data model wasaccompanied by the introduction of query languages builtonthis model.Examples of suchlanguages are PRPQ [LS06], Gremlin[Rod15], Cypher [FGG+18b]and PGQL [vRHK+16], thelatter three being actively used in currentgraph databases.

Apartfrom SPARQL, that is standardizedbythe W3C, there areavarietyoflanguages builtonotherdatamodelsthan RDF as no standardexists forthese models. Furthermore, at thetime of writing,noquery language that is currentlyused by anativegraph database has acomplete and formal specification. One reasonfor this is that these query languages have notbeen aroundfor that longand researchers are onlystarting to formalizeparts of them, like with the popular query language Cypher and itsopensource variant openCypher[MSV17, FGG+18a]. As manyofthesequery languagesare still being activelydeveloped,they evolveand change frequentlywhichmakes an effort for aformal specification somewhat impractical at thecurrenttime.Anotherreason for thelackofsuch aspecification could be thatqueryinggraph patterns is computationally hard [BLR11]. We will seethat querying suchpatternsisone of themainfeatures of everygraphquery languageinChapter3.Therefore, queryinggraphdatabasesisalso computationally hard [Bar13]. This is also one of the reasons forthe existenceofthe multitudeofgraphquery languages that provide differentpossibilities,functionalities and expressiveness.

The ISO/IEC JTC1/SC 327 is thecommittee responsiblefor datamanagementand interchangeinthe International Organization for Standardization (ISO). Members of theirworkinggroup 3, that is responsible for database languages,raised theidea of astandalonegraphquery language that complementsSQL in 2017 [MHM17]. This idea wasfurther publicizedinthe GQL manifesto [Gre18]in2018 that motivated the move towards the creation of astandardized query languagefor thepropertygraph data model. National standardization bodies aroundthe world voted in favorofaproposal for suchanewlanguage standard, calledGraph Query Language(GQL), in 2019 [Gre19b].

7https://www.iso.org/committee/45342.html

13 2. Graph Databases

Thiswas the first time in the last 35 years that theISO consideredthe creation and standardization of anew database language [Gre19a]. The project is planned to takefour years and even then it will most likely take some time until the vendors of graph databasesimplementand supportthe new standard, if they choose to do so in thefirst place.Therefore, choosing agraph databasesystem before the widespread support of GQLinvolves not onlythe propertiesofthe system but alsothe supported querylanguage. We will nowbriefly introduce the fivegraph query languages thatwefocus on in ourcomparison.

2.4.1 Cypher Cypher[FGG+18b]isahigh-level,declarative querylanguage that wasoriginally invented for the use in the graph database Neo4j.Asthatdatabase is one of themost used and well known nativegraph databases8,Cypherisregarded as perhaps the most well-known propertygraph query language [AAB+17]. Its syntax is inspiredbySQL to ease the transitionfor userscoming from relational databases and thelanguage is identifiedas “expressive”compared to other languages[RABM17]. Apartfrom being inspired by SQL, it alsoincorporates conceptsfrom other languages like Pythonand SPARQL as well as from in general. The developmentofCypher came handinhand with thedevelopment of Neo4j and waslargely an invention of Andrés Taylorin2011. Sincethen,the language has been continually improvedand extended. [FGG+18b] The companybehindNeo4j,Neo4j Inc., announced theopenCypherproject9 in 2015 that opened up thedevelopment of the language andthe currentversion of thelanguage is nowreferencedasversion9[ope18]. Theprojectprovidesanopenplatform that should enable Cyphertobecome afully-specifiedstandard [GJK+18]. Thiswouldenable other vendors of graph databasestoprovide implementationsofCypherintheir products. Even though the language is not yetfully-specified or standardized, it has been implementedin other products like SAPHANA Graph,Redis Graph or Memgraph.The move towards standardization came from thegoal of thecompanytoestablish Cypher“as the property graph querylanguage”[FGG+18b]. That companywas alsobehind, or at least involved in, some of themore influential pushes towards the creation of GQLasanew query language standardbyISO/IEC.Aninfluentialand driving figure in this areaisfor exampletheir productmanager for Cypher, AlastairGreen, whopublished the aforementioned GQL manifesto[Gre18] among otherthings. That also explains whysome information about the progress of GQLispublished on the website of Neo4j, like the notificationthatthe GQL project wasapprovedbythe nationalstandardization bodies [Gre19a].

2.4.2 Gremlin Gremlin [Rod15]isalow-level graph query languagethat supportsboth, thedeclarative patternmatching style but alsothe imperativegraph traversalstyle [HP16, RWE15].

8https://db-engines.com/en/ranking/graph+dbms 9https://www.opencypher.org/

14 2.4. Query Languages

Thisallows the language to be used in anygraph computingsystem,beitanOLTP graph databaseoranOLAP graph processing system[TPAV19]. Although Gremlin supports both styles,itismore imperative in nature with themainfocus on traversing the graph [AAB+17]. Together with thecompactsyntaxused in itsqueries, it is not that easytoread,especially compared to ahigher-level language like Cypher [HP16].

Gremlin wasintroduced together with theApacheTinkerPop open-source project10 in 2009 and is stilldesigned and developedbythis project[Rod15]. TinkerPop provides a graph computing framework that is not bound to aspecific vendorbut canbeintegrated into other database and processing systems. Gremlin is an integral partofthis framework and denotes not onlythe query language,but alsothe graph traversalmachine called Gremlin TraversalMachine (GTM).The GTM processes queriesand allowsthe user to define and use adomainspecific query language that is “compiled” to theGTM. Systems thatintegrate theTinkerPop framework are called TinkerPop enabled. Notable examples of suchsystems are theAmazon Neptune11 graph database and themulti-model databases OrientDB12 and AzureCosmos DB13.

As an interestingside node, parts of theTinkerPop framework including theGremlin APIs originated in thedevelopment of theNeo4j database from 2007 onwards [FGG+18b, Lin18]. While early versions of theNeo4j database supported thedirectuse of Gremlin, this has been removedbut is still possible viaaplugin likeNeo4j-Gremlin14.

2.4.3 PGQL

PGQL [vRHK+16]isahigh-level, declarativegraph querylanguage designedand devel- oped by Oracle as partoftheir Parallel Graph AnalytiX (PGX)15 framework.The first versionofthe language wasintroduced in 2016 and it is stillunder activedevelopment, with version1.3 [Ora20]being released in March 2020. In contrast to other languages like Gremlin, PGQL is closely aligned to SQLand supportsmanyconceptsand keywords fromSQL.These areextendedbygraphspecific conceptsand algorithms like powerful path expressions and graph pattern matching [AAB+18, AG18]. As in manyother graph query languages, the output of aqueryisstructured in tabular form. Thisallows PGQL queries to be nestedinside,and usedseamlesslytogether with SQL queries.

The language is published as an open sourcespecification together with aparsing software16.Implementationsofthe language exist in research projects as adistributed [RTH+ 17]and non-distributed version[SHvR+16]and it is usedinthe Oracle (Big Data)

10http://tinkerpop.apache.org/ 11https://aws.amazon.com/neptune/ 12http://orientdb.com/ 13https://azure.microsoft.com/services/cosmos-db/ 14https://github.com/neo4j-contrib/neo4j-tinkerpop-api-impl 15https://www.oracle.com/middleware/technologies/parallel-graph-analytix. 16https://github.com/oracle/pgql-lang

15 2. Graph Databases

Spatialand Graph databases 17.Thesedatabases support the propertygraph as well as theRDF data model but one has to keep in mind that while in some configurations an in-memorynative graph storage via the PGX frameworkisused, the data is kept in arelational database in other configurations [BPG+19]. Overall, the SQL-like syntax of PGQL tries to lure theSQL communitythatisalready using Oracle products but lacks standardizationand support by the broader community and other database vendors [TPAV19].

2.4.4 GSQL GSQL [Tig]isahigh-level querylanguage developedbyTigerGraph Inc, the company behind the TigerGraph database. Awhite paper for thelanguage [WD18]togetherwith one for thedatabase [DXWL19]werepublished in 2018 and 2019, respectively. Similar to PGQL,GSQL hasanSQL-likesyntax and goes even further,asaGSQL query that does not containgraph-specific primitives is astandard SQL query. In that sense,GSQL extends therelational SQLtoincludeprimitives to query and analyze graphs. The goal whendeveloping theTigerGraph database wastodesign “aproperty graph databasefor tomorrow’s big data and analytics”[HLP+19]. To achievereasonableperfor- manceingraph analytics workloads on graphs spanning billions- or trillions of nodes and edges, thedatabase provides computation in massively parallel processing fashion.This objectivealsoinfluenced the design of GSQL thatprovidesnot only primitives forOLTP workloads but also allows forthe specification of iterativealgorithms and supports the MapReduceinterpretation [Wu18]. Anotherdesign decisiontowards better performance, thatalso came from the tightalignmentwith SQL,isthe use of aschema and . In contrast to all other systems and languages we looked at untilnow,one has to specify the schemaofagraph in GSQL’s data model.The schemadefinesthe typesof verticesand edges in thegraph and theirpossible relationshipstoeachother [Tig]. This allowsfor abetter queryoptimization, space savingswhen storing huge graphsaswell as moresophisticated securityand privacy features like limited access to parts of the graphs for some users [Wu18]. Although thelanguage and thesystem built aroundthe language isdesigned to handle and analyzehuge graphs, thesupport for OLTP workloads and unique usage of aschema made us include thelanguage in our comparison.

2.4.5 G-CORE G-CORE[AAB+18]isahigh-level query language designed in acollaboration between industry and academiaunderthe patronage of theLinked Data Benchmark Council (LDBC)18.Introduced in 2018, thequery languageintegratessome of themainfeatures frommultipleexisting languages like Cypher, Gremlinand PGQL in order to provide

17https://www.oracle.com/database/technologies/spatialandgraph.html, https:// www.oracle.com/database/technologies/bigdata-spatialandgraph.html 18http://ldbcouncil.org/

16 2.5. LDBCSocialNetwork Benchmark acore for future graph query languages [AG18, ARV19]. As it is acore andreference for futuredevelopments, thequery language does notsupport DML or DDLoperations. One of themajor differencescompared to other querylanguages is that G-CORE is a composablelanguage, meaningthataquerytakes agraph as input, processes it and outputsagraph as well. This allowsqueriestobechained or nested, as an output of a query canbedirectlyused as theinputofanother query.Asecond major noveltyisthe treatment of pathsasfirst-classcitizens. Theycan be stored next to nodesand edges in an extended property graph datamodel and can alsohavepropertiessimilartonodes and edges. We willanalyzeand explain this in more detailinSection 3.6. As G-CORE is aresearchlanguage, thereare onlypartial implementationsinopen-source researchprojects like aparserthatisavailable on GitHub19.

2.5 LDBC SocialNetwork Benchmark

We analyze theselanguages based on theircharacteristicsand commonlanguage features and compare them by implementing multiple queriesinChapter4.Topickqueries thatrepresentapotential use case in areal world scenario,weuse theschema and some queriesfrom the interactiveworkload of LDBCs Social Network Benchmark (SNB) [EAL+15]. The goal of this benchmark is “todefine aframework where different graph basedtechnologiescan be fairly testedand compared ”[AAA+20]and is thereforeaperfect fit to test and compare graph query languages.Itisdevelopedbyaninitiativefrom major actors in industry and academiaand currently contains two groups of workloads: the businessintelligence workload that focusesonOLAP queries, and theinteractive workload with afocus on transactional queries. We limitourselves mostlytothe interactive workload as this one fitsour focus on OLTP queries. The latest stable version at thetime of writing is v0.3.2[AAA+20]and we use mainly this versioninour experiments. On one occasion, namely for queriesthatremove data items, we use the currentsnapshotofversion 0.4.0[LDB20]asdata deletions are not presentinearlierversions. Theunderlying schemaaswell as all other queriesinuse did not change betweenthese versions, so we can safelyuse thenew additions together with the model fromthe stableversion. The interactive workload itself contains queriesthat are grouped into the followingquery classes:

• Short Reads (IS, interactive short): This class contains 7short read-only queries thattouchonly asmallpart of thegraph.

• ComplexReads(IC, interactive complex): Thisclass contains 14 complex read-only queriesthattoucharathersignificantamountofnodes and edges, most often in the neighborhood of astarting node.

19https://github.com/ldbc/ldbc_gcore_parser

17 2. Graph Databases

• Insertions (IU, interactive update;inthe snapshotofversion0.4.0:INS, insertbut with thesame queries): This classcontains insertions of either singleedges or nodes together with connections to other, existing nodes.Contrary to thename of these queriesinthe stable version, interactive update,the benchmark does not contain anyqueries thatfocus solely on the updateofexisting entities.

We furthermore use the class thatcontains deletion queries(DEL), that is nowcontained in the business intelligence workload.This class is only presentinthe currentsnapshot of version0.4.0 and contains queriesthat remove either asingle edgeoranodetogether withall incident edgesand, if needed,nodes that depend on theinitially deleted one. The queriesare enumerated inside eachclass and use the abbreviation of theclass as aprefix, i.e. the first query in the class of shortreads is IS1.Wewill state which querieswere taken from the benchmark in the next chapter whenweintroducequeries representing the language features.

2.5.1 Schema The queriesofthe SNBare defined on themodel of asocial network, hence the name. The schema in UML notation can be seeninFigure 2.7. It revolves mainly around persons interactingwith one or multipleforums. Tags associatedwith forumsormessages allow groupsfor specific topicsorinterests. Aperson can create or likeamessage, which in turn canbeeitheratextoranimage, but not both. Persons in aforumcan further reply to postsorprevious commentsbyreplying with anew comment. The schema also supports friendship relationships between multiple persons via the knows edge. Although this edgeisgiven as adirected edgeinFigure 2.7, thenewer versionofthe benchmark states that it shouldbetreated as undirected, thereforerepresentingmutualrelationships between two persons [LDB20]. The schemaalso allowsfor furtherinformationtobe stored,including the citythatapersonisliving in and the universityorcompanythey are studying respectively workingat. Apartfrom the queries, the benchmark comes with adata generator20 thatgenerates parts of thedatastatically and parts of it dynamically. Thiscan alsobeseeninthe schema butisofnofurtherconcerntousaswefocus on thequeriesworkingonthat model. The generated data using thedefaultparameters representsasnapshotofa potentialsocialnetwork duringaperiodofroughly 3yearswithmanycharacteristics taken from real world social networks. As we do notcomparethe database systems but onlythe query languages,itisnot necessarytogeneratedata and store it in a database. Nonetheless, we did generate and import thedataintoaNeo4j, aTigerGraph and aTitan21 database.Titanisascalable graph database with nativesupport of the TinkerPopstack, thereforesupporting Gremlin asagraph query language. Thisallows us to run the queriesfrom the benchmark as well as implementand test ourown queries on these databases and the supported query languages. 20https://github.com/ldbc/ldbc_snb_datagen 21http://titan.thinkaurelius.com/

18 2.5. LDBCSocialNetwork Benchmark

Figure 2.7: Thedata modelinUML notation of theSNB. Source:The LDBCSocial Network Benchmark, Figure 2.1[AAA+20].

19

CHAPTER 3

QueryLanguage Features

Although thereare many graph querylanguages based on different ideas thatvary in theirstyle, expressiveness or purpose, they all share aconceptual core[AAB+17]. This coreincludesprimitives thatare basedongraphfeatures like graph patterns, paths and neighborhoods(adjacency)[AG18]. As briefly mentioned whenweintroduced the languageGin Section 2.4, PatternMatching Queries formthe basisofevery graph query language. Apartfrom them,primitives thatallowustonavigate agraph on itspaths form asecondmajor feature that we denoteasPath Queries.Weadd the feature of Structure Independent Queries for queries and primitivesthat do not consider thegraph structure. Thisisanextensionofthe summarization queriesfromAngles [Ang12], that contains simple aggregate functions to summarize theresultsaswell as further operations thatdonot rely on thegraph structure. As thesefeatures formthe basisofevery graph query language, theirunderlying primitives are well studied [CM90, BLR11, LV12, Woo12, Bar13, LMV16, TKL19]. We willintro- duce these groups of queriesinmore detailand giveexamples forthem. The examples are then implementedinChapter4to compare thelanguages.Westart with an introduc- tionofstructureindependent andother summarization queriesinSection 3.1. Pattern matching and pathqueriesare explained in Section 3.2 and Section 3.3, respectively. We continuewith two further features that do not concern thequeryingpart of thelanguages but theirabilitytomanipulate and potentially define the structure of the data: Data Manipulation in Section 3.4 and Data Definition in Section 3.5.

3.1 Structure IndependentQueries

Thisfeature of agraph query language providesprimitives that allow for queriesthat mainlydonot consider thegraph structure but focus on theinformation stored in entities. In itssimplestform, astructureindependent query operates on theentities (either nodes or edges).Itprovidesprimitives thatallowthe usertoselectsuchentities and

21 3. QueryLanguageFeatures

theirpropertiesaswell as aggregate functions that summarize structure independent information. Asimple example of suchaqueryinthe setting of theLDBC SocialNetwork Benchmark(SNB)would be to select the userwith agiven first and last name. We do notset thename in thequery but theuser to pass it viathe $firstName and $lastName parameters:

Selectthe personwith firstName=$firstName and lastName=$lastName. (1)

Anotherexample,this time fromthe benchmark itself, would be the query IS4(interactive short4)that selects amessage with agiven id. Thequery expects the $messageId as input and returns thecontent andthe creation dateofthe message. It is moresophisticated compared to theprevious one as amessage can be either apost that contains an imageFile,apost that contains atext-only content or acomment with atextual content.Ifthereisanimage, theresultshould containthe image file, otherwise the content.

Given a$messageId,retrievethe contentofthe message and thecreationdate. (2)

These structure independent queries are not limitedtoselectasingle entity, they can forexample alsoreturn all entities of agiven type,likeall edges labeledwith knows,or containmore complicated selection criteriathansimple equality. As brieflymentionedbefore, the feature alsoincludes summarization queriesand therefore primitivesthatallow us to summarizeagraph,orparts of it.Theseprimitives are often applied on theintermediate result of aquerythat selects partsofagraph, and aggregate the intermediate valuestoreturn asingle one.Manygraph query languages support the same aggregate operationsasSQL, namelycount, sum, max, min and average, and possiblyfurtherdomain- or graph-specific ones [Ang12]. An exampleofsuchaquery would be to calculate the average number of spoken languagesoverall persons:

Retrievethe average number of spoken languagesper person. (3)

Most query languages alsocontainprimitives thatallowustocompute properties of a graph itself, not the data contained in thegraph. Examples forsuchpropertiesare the number of vertices (order),the diameter,the average degree of anodeand so on. In some definitions, likein[Ang12], queries basedonsuchprimitives alsobelong to the group of Summarizationqueries. However, we do not include theseprimitives in our definition of this feature as queriesusing them focusonthe structureofthe graph and thereforeare not structure independentanymore.

3.2 Pattern Matching Queries

Before we start with an explanation of pattern matching querieswewill introduce an extension of our smallsocialgraphfrom Figure 2.4 that allowsustogointomore detail

22 3.2. Pattern Matching Queries

Figure 3.1: Avariation of our smallsocialnetwork exampleincludingpostsinthe propertygraph data model. in this section. The variation of thenetwork is given in Figure3.1 where we removed Alice from thenetwork and instead added twoposts, i.e. nodeslabeled as Post,each containing itstitle. Furthermore, the graph nowcontainsrelationships between theusers and theposts, given by edges labeled likes and created. As mentionedbefore, matching graph query patterns on agraph form a, if not the, major operation on graphsand thereforeare at thecore of manygraph query languages.Most languages are designed startingwith graph patterns and other features are added to these patterns [ARV19]. Theidea behind suchagraph pattern is to specify theshape of the result via apropertygraph 1 itself. In difference to thedata graph,aquery graph is not limitedtoconstants but canalso contain variablesinstead of anyconstant[AAB+17]. Thisallows us to query forinformationand return the valuesmatched by the system. Suchaquerypattern defines aclass of subgraphs[BPG+19]. The system in turn matches the querypattern to the data graph by findingoccurrences of thesesubgraphs thatmaps possible variablestoconstantswhile preservingthe structure. In itssimplestform, as explained untilnow,suchapatternisdenoted as basic graph pattern(bgp).Asanexample of such abgp we can queryfor relationshipsbetween the post with thetitle “GQL” and usersinour small socialnetwork. Figure 3.2 depictssuch aqueryasapossible pattern in apropertygraph data model. Manymodernquery languages do not support queriesinagraphical representation but onlyinatextualone. ADatalog related syntax is often used whenstudying the theory behind query languages and consistsof(node, edge, node) triples.Assuming a simpler datamodel without node and edge IDs, thequery could be represented in sucha syntax as: (?W, ?X, GQL), (?Y, ?Z, GQL) where W and Y representnodevariables

1Assuming thatthe system uses apropertygraph datamodel. In other words, suchagraphpattern is given in the same shapeand style as the data model used by thesystem and query language.

23 3. QueryLanguageFeatures

Figure 3.2: An exampleofabasic graph query searching for relationships between a specific post and users in ahypothetical graphical querylanguage.

and X and Z edge variablesthat are matched against constants in thedatagraph. Most modern query languages used in industry relyonanASCIIart representation of thequery thatmimicsconnections between entities via “−”, “>”and “<”signs. The greater and smallersignsare usedtospecify thedirection of arelationship,whereas the lack of these signs describes an undirected relationshipinmanylanguages.Apossiblerepresentation of thequery in aCypher-related syntaxlooks likethis: (w:User) -[x]->(:Post {title:GQL}) <-[y]- (z:User)

In termsofrelational operations, basic graph patterns allowfor selectionbased on equality via the use of constants,like title:GQL,aswellasnatural joins viathe patternand the graph structure itself. These basicgraph patterns canbeextended by furtherrelational operations: union, difference,projection, left outerjoin(optional)and more sophisticated filtering (selection) rules. Theoptional operator is especially interesting in most graph databases as it allowsustoquery incomplete information that is caused by the lackofa schema in manysuchsystems [ARV19]. Aquerypattern using suchextendedprimitives is denoted as complexgraph pattern(cgp).The risinginterest in graph databases over the last decade broughtthe invention of further additions likeinexact and approximate matching to graph patterns [CL14, AG18, FPSW19]. However, we limitourselves to bgps and cgps as mentionedabove as theseform thecoreofmost modern querylanguages. Furthermore, additionslikeinexact matching are stillinaresearchstate and not yet supported by anyofthe majorgraph query languages. [AAB+17] We have already seenanexampleofsuchpatterns in the previous chapter:the query graph Q1 in Figure 2.6 is abasicgraphpattern with asingle variable.Furthermore, if we disregard theregular expression in Q2 for now, the query Q = {Q1, Q2} forms acgp using the unionoperator.

3.2.1 Evaluation Semantics We will nowlookatdifferentevaluationsemantics in use by the query languagesand thesystems using theselanguages as thesemantics can influence theoutcome of aquery. Sincecgps are an extension of bgps, theirevaluation semantics are alsobasedonthe ones frombgps.For this reason, we limitourselves to bgps in this part. As brieflymentioned before, abgp is evaluated by matching the patternontothe data graph such that the variables are replacedbyconstants while thestructure is preserved, at least in some way.This means, that the system searches for allsubgraphs in the data graphthat are isomorphic to the bgp [AG18]and thereforeclosely resemblesthe subgraph isomorphism

24 3.2. Pattern Matching Queries problemthatisNP-complete [Yan90]. One canidentify twomajor approachesfor the evaluation semantics of abgp: thehomomorphism- and theisomorphism-based approach [AAB+17].

• In the homomorphism-basedsemantics,the matches of apattern on the data graph are not restricted.This means that theevaluation of abgp Q on agraph G consists of all homomorphisms from Q to G and thereforeresembles thesemantics of queries in relational databases. As it is used in therelational world and heavily studied in theory,this is the most commonsemantics [ARV19]. However, when using this strategy thedatabase designer has to deal with thepotential problemofinfinite resultsetswhen enumerating the paths in apatternaswedonot restrict the matches [AAB+18].

• Under thetypeofisomorphism-basedsemantics,the matches of apatternare restricted in some sense. This restrictionmostlyconcerns thedistinctness of matchingsfrom variables to constants. We can further refinethis type of semantics based on therestriction it enforces:

– If no twovariables can be bound to the samedata item in amatch, we are limited to injectivemappings and this is denoted as no-repeated-anything semantics.Asaneffect,the evaluation of apattern underthis semantics preserves the structure of the query patternasneitheredges nornodes can “collapse” by binding to the same data item.

– If therestriction onlyappliestovariables on nodes, we call it no-repeated-node semantics.

– Consequently, if the restrictiononly applies to variables on edges we speak of no-repeated-edge semantics.

Note that thelimitations on theno-repeated-node and -edge semantics apply only to variablesthat map to IDs, i.e. variablesfor labelsorpropertiescan stillmap to the same dataitem.

If we evaluate thebasicgraphpattern from Figure3.2 on thenew variation of our social network given in Figure3.1,weget thefollowing non-restricted matches:

25 3. QueryLanguageFeatures

x1 x2 x3 x4 x5 x6 x7 x8 x9 n2 Carol e5 created n1 e4 likes n3 Bob n3 Bob e4 likes n1 e5 created n2 Carol n2 Carol e1 likes n1 e4 likes n3 Bob n3 Bob e4 likes n1 e1 likes n2 Carol n2 Carol e5 created n1 e1 likes n2 Carol n2 Carol e1 likes n1 e5 created n2 Carol n2 Carol e1 likes n1 e1 likes n2 Carol n2 Carol e5 created n1 e5 created n2 Carol n3 Bob e4 likes n1 e4 likes n3 Bob

As thehomomorphism-based semantics does not restrict the matches,all of them are valid. Only the first two matches are valid under the no-repeated-anything semantics as the latterones bindatleasttwo variables to thesame dataitem.Notethatwedonot onlyrestrictthe variablesmappingtoIDs but all variables, including the ones mapping to labels or properties, in this semantics.Anexample of this is the third resultthat is notvalidasthe variablesx4and x7 are both mapped to the label likes.The first four matches are valid under the no-repeated-node semantics sincethey donot bind variables that map to node IDs to the sameID. As we do notrestrictthe mapping of edge variablesthere canbemultiple suchvariables mapping to the same edgelabel,as is the case in the matches 3and 4with the label likes on x4 and x7. Therefore, an evaluation under this semantics still preserves the structure of thequery pattern as the nodes cannot“collapse”, while it does not further restrict therelationsbetween them. Under theno-repeated-edge semantics thefirst 6matchesare valid as thelast threebind the twovariablesmapping to edge IDs, x3 and x6,tothe same ID. The semantics explained above solelyfocus on asingle matchand possiblyrestrictsuch amatch. Thisisenoughinthe setting of bgpswherenotwo suchpatterns can be evaluated to theexactsame match.However, whendealing withthe evaluation of cgps and thereforewith further operators like union, we can have duplicate matches regardless of thechosensemantics.Inthat case we can further distinguish between a set semantics, meaningthat the evaluation of apattern is accumulated in aset of matches and therefore cannot containduplicates, and a bagsemantics thatallows duplicates of matches in the result. According to the choice of either set or bag semantics we can for example speak of ahomomorphism-based bag semantics or ano-repeated-node-based setsemantics. [AAB+17]

Each of thesemantics can be desired in some application scenarios and thereforethere is no single one used in all database systems and query languagesbut each systemis basedonone semantics. While some systems onlysupport asingle semantics, others like Neo4j with theCypher query language allowfor queryconstructs thatresult in a differentevaluationsemantics.Still otherlanguageslikePGQL even provide keywords thatallowthe user to switchtoadifferent semantics on aquery by querybasis.

26 3.2. Pattern Matching Queries

3.2.2 Queries Nowthat we understand the concept behind thesequerieswewill look at some examples from theLDBCSNB. The query IS1(interactiveshort 1) representsasimple example of aqueryusingabasic graph pattern.

Given a$personId, retrievethe first name, last name, birthday, IP address, browser, gender,the creationdate of the entryaswell as thecityofresidenceofthe person. (4) Thisquery canalsobegiven via the graph pattern in Figure3.3 thatshows the single directed relationship betweenthe twoentitiesinquestion.

Figure 3.3: Query 4(IS1) as agraphicalquery pattern. Source: The LDBCSocial Network Benchmark, page56[LDB20].2

Note that if thequery did not askfor thecityofresidence, it is astructureindependent queryasthe other information is stored directly with the person. Atypical query in the setting of asocial network is to searchfor friends of aperson.IS3 represents this in asimple manner where all friends of agiven personand the creation date of their relationship are returned.

Given a$personId, retrievethe id, first name and lastname of all theirfriends as well as thedateofthe friendship-creation. (5)

Suchpatternmatchingqueriescan alsogenerate boolean valuesthatindicatewhether there exists apossiblerelationshipbetween some entities.Asanexample,the query IS7 fetches the author of agiven message as well as all persons thatcreated adirectcomment to it. The optional relationship,denotedasknows,indicates whether thesetwo persons knoweachother.

Given a$messageId, get alldirectcomments replying to this message and abooleanvalue that indicates if the authors knoweachother. (6)

2The graphicalrepresentations of thisquery and allfollowing ones are taken from the currentsnapshot of version0.4.0 of the benchmark [LDB20]. The queries themselvesare identical between the stable version 0.3.2 [AAA+20]and the currentsnapshot,but some graphicalpatternsinversion0.3.2 do not depict the specified query.

27 3. QueryLanguageFeatures

Agraphicalvisualization of this pattern can be seeninFigure 3.4,wherethe dotted line indicatesthe flag, or optional relationship.

Figure 3.4: Query 6(IS7) as agraphical query pattern. Source: The LDBCSocial Network Benchmark, page 60 [LDB20].

3.3Pathand Navigational Queries

Graph patterns as explained untilnow allow us to queryfor boundedand fixed rela- tionships between entities in agraph.However, they are not powerful enoughfor many queriesthatnavigatemore complexrelationshipsinagraph.This in turn is one of the coremotivations of using agraph database,asrelationships either containinformation themselves or tell us something about theentities they connect. Path queriestacklethis problembyprovidingqueriesthatespeciallyfocus on one thing: paths. Navigational queriesthen integrate path queriesintopatterns to allow formore sophisticated data retrievals.

3.3.1 Path Queries In itssimplestform, apath querycan test for theexistence of an unrestricted path between twonodes. Thisisaninherently recursiveoperationasthe database engine traverses from one node to another, often without anybounds on thelengthofthe path. Compared to theworld of relational databases, where recursiveoperations are supported onlyinalimited way, theseoperations are heavily usedingraphdatabasesand form one of themajor advantages whenqueryingrelated information. Apartfrom this simplistic form thattests for theexistence of an unrestricted path, we can add some criteriathat the path has to satisfy.Asimple way to achievethis is to allow restrictions on the path by requiring the edges to have aspecific label.Ifwethink about asocial network, this allows us to searchfor apath between twousersthatislimitedtofriend-relationships(edges labeledwith follows or friend forexample). In general,suchqueries canbegrouped as Reachability Queries and we will explain further characteristics and ways to specify restrictions on paths in thefollowing partonRegular Path Queries. [AG08, vRHK+16] Apartfrom testingfor theexistence of apath,pathqueriescan alsobeused to retrieve nodes connected by apath.Asanexample, this allowsustoretrieveall nodes on paths

28 3.3. Path andNavigational Queries

of theform friends-of-friends in asocial network, either with arestricted lengthoreven with arbitrarylength. Friends-of-friends,orfriends-of-a-frienddenotes acommonlyused query in social networks. Startingfromagiven node v,the query navigates to alladjacent nodes with the restriction thatatraversed edgehas to denote afriend-relationship. After the friendsofvare explored, it continues to theirfriends. Note here that,causedbythe lackofadefinitionfor this query,itiseitherused as explained where the result contains all friends andthe friendsofthe friends, stoppingatpaths of length2.Onthe other hand, friends-of-friends can alsodenote thequery thatdoesnot stop at pathsoflength 2but either at auser-specifiedlengthorwhen the full graph isexplored, meaning an unrestricted length. Assuming that we stop after twohops,the query does notonly containthe existing friendsofaperson,but alsothe friendsoftheir friends, whichcan be usedtosuggest new friendsortoanalyze thenetwork around aperson [AAB+17]. More generally,this operation traverses theneighborhood,oradjacency, of agiven node. This is an important operation and manygraph databasesprovide dedicated operators for suchqueries,that are alsodenoted as Adjacency Queries [AG18]. An exampleofsuchan operators would be ak-neighborhood one that returns theneighbors up to distance k fromagiven node [Ang12]. Takentogether, Angles et al. summarize theconcept behind path queriesasfollows: “Theideaofapath queryistoselectpairs of nodesinagraph whoserelationship is given by apath (ofarbitrary length) satisfyingcertainproperties.” [ARV19]. This canbe rewritten to deriveacompactform of apathquery P given in [AAB+17]: P = x −→α y , where α specifies therestrictionsapath between x and y hastosatisfy suchthat the node pair (x, y)isselected.What we lack until nowisanotion that allows us to express the conditions on thepath,i.e. α.

Regular Path Queries When looking forpossible notations for this, we probably end up with some variant of regular expressions over theset of edge labels,asthis form is usedinmost languages until now[LMV16, AAB+17]. Thisnotation is formalized as a RegularPath Query (RPQ) and wasintroduced in [CMW87]togetherwith thegraphical query language G. At its core, an RPQ “selects nodes connectedbyapaththat belongs to aregular language over the labeling alphabet” [LV12]. That tellsustwo things. First, that suchaquery onlyever selects pairsofnodes that are connected by apath.And second,that anodepair is only selected if theconcatenation of thelabelsonapath connectingthem form aword in the languageofthe regularexpression[AAB+17]. As asimple example of this, we can write the friends-of-friendsquery with unrestricted length forCarol in thecompact formofa path query as: + P = Carol −−fo−−llows−−→ y.

Thisretrieves all node-pairs (Carol, y) thatare connected by apath of lengthat least 1, with therestriction that all edges on thepathare labeledasfollows.Note thatwecan alsouse theKleene Star“∗”inour query,and, depending on thechosen

29 3. QueryLanguageFeatures

evaluation semantics,the query:

∗ P = Carol −−fo−−−llows→ y can either givethe same resultasthe one from above, or can additionallycontainthe node pair (Carol, Carol).Furthermore,wecan alsoquery for the unionofpaths via the “|” symbol.Asanexample,wecan search for all poststhat are either created or likedbyauser thatCarol follows as:

follows·(likes|created) P = Carol −− −−−−−−−−−−−−→ y where the“·”denotes aconcatenation. We can summarize thenotation of an RPQ as aqueryofthe form P = x −−re→g y ,where reg is aregular expression overthe alphabet Σthatcontains all possibleedgelabels [LMV16]. The pathqueryingfunctionalityofmost modern query languages is basedonthis notation as it provedtobeagood fit to specifyrestrictions or conditionsonthe inherently recursive natureofapath [vRHK+16]. As RPQshavebeen aroundsince the late 1980s and are used in languages until now, the class is well studied,especiallyalso regardingits complexity [Woo12, Bar13, LMV16]. Over time,RPQs have been extended in manyways. An RPQ as explained untilnow can onlytraverse edges in aforward direction, but onecan also be interested in traversingthem backwards.This givesrise to anotable extension, namely the class of theTwo-way Regular Path Query (2RPQ) [CGLV00b, CGLV03]. 2RPQs add support for an inverse operator a− for a ∈ Σthat denotesthat theedgelabeled a is to be traversed backwards. Let’s look at an example of thisinoursmall social network fromFigure 3.1. We can queryfor thefriendsofall users thatliked apost p with the following two-way regularpathquery:

− P = p −−like−−−−−−s ·follo−→ws y. Starting fromagiven node p,the post in our case,the query follows all incoming likes- edges in abackward manner and from thereall outgoing follows-edges in aforward manner. There are further variantsofRPQs thatfor exampleadd variables[Woo90]ornode tests [BPR12]topath queries. We will introduce some of these as we progress with our explanation of navigational queries. As RPQsallowustoexpress path queriesand are usedinmost modern query languages,atleastinsome variant,the terms path query and RPQ are both usedtodenotethis class of queries. The difference is thatapath query denotessuchaquery in general, regardlessofthe underlying variantused by the query language, be it forexample asimple RPQ or a 2RPQ.Incontrast,the term RPQ is more relevant in theoretical literaturewhereitisimportanttostate and definethe variant thatisused. We will alsouse this approachfrom nowonand denote the generalclass of thesequeriesaspath queries and use thecorresponding theoreticaltermswhen we speak about theunderlying variants.

30 3.3. Path andNavigational Queries

3.3.2 NavigationalQueries Early graph query languages, likethe language Gthatwelookedatinthe previous chapter,allowustorestrictthe labels on apath,whichisenoughtoexpress the friends- of-friends queryfor example. However, one often wantstoexpress more sophisticated constraintsthat are not limitedtothe edgelabelsonthe path alone. With path queries as explained above, we can query for apath between two users in our social network over friends-relationships. If we want to query forsuchapath with thefurther restriction that aspecific user is visited on thepath, pathqueriesare not expressiveenoughaswehave no waytoexpress suchanintermediary node. Thisleadsustonavigational queries,also called navigational graph patterns (ngps),thatallowfor acombination of pathqueries and basic graph patterns. When looking at it from theperspectiveofapattern, we are nowable to replaceanedgeinapatternbyapath that is given by aregular expression [AAB+17]. Therefore, we can write thequery thatrequires aspecific user n to be on the path as follows:

friend+ friend+ x n y

We have already seenanother basic example of suchanavigational query whenwe + introduced the languageG.The expression w on thequery graph Q2 in Figure 2.6 is a regularexpression thatisembeddedinabasic graph pattern. As paths in ngpsare given as regular expressions over edge labels, and theseare nothing else then RPQs, we can alsolookatngps from theperspective of regularpathqueries. From there, ngpsare an extension of RPQs andform the class of Conjunctive Regular Path Query (CRPQ) [CM90, FLS98]. Queriesinthatclass candeal with intermediate nodes on paths [BLR11], which is exactly what we needed in ourmotivating example. The term “conjunctive” in CRPQ comes from the fact thatbyallowing intermediate nodes on apath,weessentially query forthe conjunctionofmultiple paths, eachgiven by an RPQ,thatshare the intermediary node(s)[LMV16]. Takentogether, anavigational graph pattern is agraph patternwhere nodes are either constants or variables, and edges can be labeledbyconstants,variables or RPQs[AAB+17]. Similar to basicgraphpatterns and the queryinFigure 3.2, thepatterns in asystem builtonthe propertygraph model canbefurtherenriched by elements from the property graph.Therefore, we can for example also restrictthe nodes to have aspecific label,as we did in the example with bgps in said figure.

FurtherVariants CRPQs, and navigational graph patterns in general, received lotsofattention throughout the last decades as they formthe base of manymoderngraph query languages [TKL19]. Thismanifested in thetheoreticalstudy of this class [CGLV03, BLR11, Bar13]aswell as the invention of newvariations of it.Asexplained until now, the addition of pathqueries is limited to basicgraph patterns. Therefore, anatural variation also allowspathqueries

31 3. QueryLanguageFeatures

on complexgraph patterns, forming complexngps (cngps).However,wewanttonote that this differentiation, and theclass of cngps, is notalwaysenforced, most likely since the term ngp often denotes thewhole class and all itsvariations. In themore theoretical terminologyofRPQs, we can alsoadd the inverse operator of 2RPQstoCRPQsand derivethe class of conjunctive 2PRQs (C2RPQs,also 2CRPQs)[CGLV00a]. U2CRPQs in turn denote afurtherextensionthatalso includesunions. However, all of thesevariations are stillnot expressive enough for several queriesin modern use cases and applications. Under thedefinitionsofRPQsand theirvariants until now, the evaluation of thequery P = x −→α y on agraph G, P (G), contains thenode pairs (x, y)thatare connected by apath that satisfies α.However, we cannot output the path or process it,for example by comparing it to othersinthis evaluation.The class ExtendedConjunctive RegularPath Query (ECRPQ) [BFL12, BLLW12]tackles this problembyallowingpathstobenamed, includedinthe outputand compared to others [LMV16]. Thisessentially raises paths to first-class citizens and as of nowseems to be one of themost promising variantsfor currentand future graphquerylanguages. [Bar13] There is amyriadoffurther variantsusing multiple concepts, but, as this is not thefocus of thethesis, we will onlyintroducetwo moreapproaches. In all variantsintroduced until now, we are limited to asingle recursiononapath viaaregular expression. Nested regularexpressions (NREs) [PAG10, BPR12], originallydeveloped to querydatainthe SemanticWeb,alsoallowfor anotherform of recursion.They are an extension of 2RPQs3 with the addition of abranching operator · that allowsRPQs to be nested. Theclass RegularExpressionswithMemory(REM) [LV12, LMV16]onthe other handcovers a differentuse case, namely that of the integrationofsocalleddata values on apath. A data value in this case is another word foraproperty of anodeinthe propertygraph data model.Anexample of this would be the age of aperson in asocial network. The basicidea is thatone canspecify whenadata value is stored and can than use these stored values. This allowsfor queriesthat arenot only limited to restrictionsofthe path via aregular expression, but can also deal withthe data valuesofthe nodes on thepath. Assume that we are interested in avariation of thefriends-of-friends query where we onlytraverse to friendsofthe same age. With REMs, we areabletostore the data value (age) of thefirst node on thepath, then traverse an outgoing friends-relationship, read the stored data value and compare it to thenew one.

3.3.3 Query Evaluation:Semantics, Outputand Complexity Nowthat we have introduced path and navigational querieswewill takeacloser look at theirevaluation in general as well as some semantics,possibleoutputs of suchqueries and finishwith anoteonthe evaluation complexity. We limit ourselves to pathqueries in this sectionasthe evaluation of navigational ones can be derived fromthis, together withthe evaluation of bgps (and possiblycgps). 3Note that they are aclass of pathqueries in theiroriginal form, therefore there are no patterns in these queries.

32 3.3. Path andNavigational Queries

Queryevaluation and semantics

As mentionedabove,the evaluation of an RPQ P = x −→α y on agraph G contains the node pairs (x, y)thatare connectedbyapath that satisfies α.ECRPQsonthe other hand alsoallow us to retrievepathsasaresult of aquery. From thepractical perspective of adatabase system however, the system has to find all paths that match agiven pattern before it canprojecttothe desired elements. As an example, let’slookatthe evaluation of theRPQ P fromabove.Inafirststep, the systemhas to find all paths in thegraph thatsatisfy α,and canthen projecttothe start and endnodes of thesepaths to get (x, y) as output. Therefore, when looking at theevaluation semantics of pathqueries from a practical perspective, it does notmakeadifferencewhether thequery language allows the output of paths or not.Toexplain the evaluation semantics in this section, we will look at the paths that are found during theevaluation of aquery in adatabasesystem.

Before we can explain theevaluation of pathqueries, we needaformal definitionofa path and itslabel in agraph G.Apath π is asequence n0e1n1e2 ...nk−1eknk | k ≥ 0 where each ei is an edge between thenodes ni−1 and ni.Asbrieflymentionedbefore, the evaluation of apath querycontains onlypaths whoseconcatenation of theedge-labels form aword in thelanguage of theregular expression. Lab(π)denotesthe label of π, ∗ 4 given as Lab(π)=a1a2...ak−1 ∈Σ ,whereeach ai denotes thelabel of ei and ai ∈ Σ . The evaluation of apath query P = x −→α y on agraph G,denotedasP(G), contains all paths in the databasewhose label Lab(π)satisfies α.Note that our definition of apath alsoallows forpaths of length0that containonly the node n0.Ifweuse theregular ∗ expression that acceptsall words overΣ,Σ∗,inanRPQ P = x −−→Σ y ,wedonot impose anyconstraintonthe path. Therefore, the evaluation of this query alsocontains the + emptypath with k =0and thesinglenode n0.[AAB 17, ARV19]

Thisdefinition allowsustoevaluate path queries on graphsbut alsoleadstoapotential problem: the evaluation can result in an infinitenumberofpaths. As an example, we can look at an extension of thefriends-of-friends query (with unlimited length) where we are not interested in the transitive closure of thefriends but in the poststhat they liked. Starting form afixednode n2(the userCarol), this querycan be written as:

+ P = n2 −−fo−−−llows−−−−·like→s y.

If we evaluate this query on thegraph G fromFigure 3.1, we get an unlimited number of paths caused by the cycle between Carol and Bob. The evaluation P (G)contains, among others, thefollowing paths that are given by theirnodes and edges as well as theirlabel:

4As areminder, Σcontainsall possible edge labels.

33 3. QueryLanguageFeatures

π Lab(π) n2 e3 n3 e4 n1 follows·likes =follows1·likes n2 e3 n3 e2 n2 e1 n1 follows2·likes n2 e3 n3 e2 n2 e3 n3 e4 n1 follows3·likes n2 e3 n3 e2 n2 e3 n3 e2 n2 e1 n1 follows4·likes ......

Similar to the differentsemantics when evaluating graphpatterns,there are multiple onesfor theevaluationofpathqueriesthat differmainly in their handling of duplicates in the evaluation P (G). We will nowdescribe fourcommonevaluationsemantics in use by graph querylanguages [AAB+17].

• Under arbitrary path semantics,the paths in P (G)are notrestricted in anyway. Therefore, all paths that satisfy therestrictionsofthe path query are presentin the evaluation. If we look at our example queryabove,the evaluation contains an infinite number of paths causedbythe loop in theoriginal graph thatsatisfiesthe condition in P .Asitisnot feasibletoenumerateaninfinite number of paths, one solution is to restrict them in some way. In some scenariosweare onlyinterested in whetherthere is suchapathornot,and the system canstop as soon as thefirst suchpathisfound. Anotherpossibilityistoreturnnot thepaths, but thepairs of nodes thatare connected by them,whichresultsinafinite number of node pairs. Thisresults in an outputasdefinedfor RPQs. There are further possibilities to dealwith this problembut the main takeawayisthat P (G)may containaninfinite number of paths under such an unrestricted semantics.

•The shortest path semantics restricts the paths in P (G)tobeshortestpaths only. That means, onlypaths of minimal lengthbetween anytwo nodes thatsatisfy the conditionsinP are contained in P (G). In the evaluation above, onlythe first path is valid under this semantics as all further ones are notpaths of minimal length between thenodes n2and n1.

• In case of the no-repeated-node semantics,apath thatsatisfiesthe conditions in P is onlyincluded in P (G)ifnonodeappears more thanonce on the path. These paths are alsoknown as simple paths and theevaluationofRPQsunderthis semantics hasbeenstudied extensively [MW89, ACP12, LM13], which cannot be saidfor manyofthe other semantics.Using this semantics,only the first path giveninthe table aboveisvalidasatleastone node appears morethanonce on theothers.

• The evaluation underthe no-repeated-edge semantics contains onlymatching paths where no edgeoccurs more thanonce. In our evaluation of P (G)above,only the first twopathssatisfy this as thelatter ones repeatedly takethe same route.

34 3.3. Path andNavigational Queries

Output Nowthatwehaveseensome evaluation semantics for pathqueries, we will brieflylook at differentforms of possible outputsfor such queries.Again, we look at thisfrom apracticalperspective, where auser mayfor exampleonly be interested in asimple true/false-answer and not in nodes, properties or paths. In suchacase, it is enough to output aboolean value.Asanexample, the query P = a −→α b with given nodes a and b asks whether the nodes are connected by apath satisfying α and can be answeredbya booleanvaluedepicting whether P (G)isempty or not.Inothercases,the start and end nodes (a and b in the previous example)are notgiven but we are searching forthem, i.e. we want to find the nodes connected by apath satisfying α.The outputofsucha query contains thesenodepairsand thereforecoincides with the definitionofthe output of RPQs. Apartfrom theseoutputs thathave afixed arity,wecouldalso be interested in the paths themselves whichare inherently of variable length. As we have seenbefore, including all pathsthatsatisfy aqueryunder thearbitrary path semanticscan result in alarge, and potentially infinite, number of paths. Furthermore, adatabase system offering such outputs also has to dealwith paths of potentially unlimited length. Oneapproachto get acompactrepresentation of theoutput is to return agraph consisting of thepaths thatare contained in theevaluation of thequery.This in turn would allow for query composition as aquerytakes agraph as input and would output agraph as well. We will look at this in more detail in Section 3.6. However, this is not supported by anymodern databasesystem or graph query languageexcept in the researchlanguage G-CORE as of now. Generally,the representation of paths in theoutput of aquery differsfromsystem to system as thereisnoconsensus for this until now. [AAB+17].

Complexity Evaluating apath queryonagraph is acomplex undertaking in general.Caused by theirrecursivenature,the system often needstoload, or at least go through sizable data space. Forexample, asearch for all node pairs connected by apath that satisfies a regularexpressionunder the simple path semantics (no-repeated-node) is NP-complete in the graph size[MW89, Bar13]. Thishas led to thedevelopment of manysemantics basedonthe arbitrary-pathone, thatfor instance stop as soon as one pathisfound. Barcelò showedthatsuchasemantics leads to tractablecomplexity when evaluating RPQs[Bar13]. However, this does notmean that all modern querylanguages and databasesystemsuse avariantofarbitrarypathsemantics as we will see in our analysis. [AG18,AAB+18]

3.3.4Queries: Nowthatweunderstand theconceptsand evaluation semantics of navigational queries, we will look at some examples from theLDBC SNB. In our explanation of pathqueries, we started with reachability queries that test for theexistence of apath between two

35 3. QueryLanguageFeatures

nodes.Wealso mentioned the possibilitytoconstrain suchapath,for examplebyonly allowing edges with aspecific label.The query IC13 (interactivecomplex 13) is an exampleofthat, with the addition that we do notonly test for theexistence of sucha path, butare also interestedinthe path itselfand the length of it.

Given two persons by theirid, findthe shortestpathbetween them in the subgraph induced by the knows relationships. Furthermore return the length of the path, 0ifboth ids are equaland -1 if there is no suchpath.(7)

Figure 3.5 depictsthe query as anavigational graph pattern that uses aslightly different notation forthe regularexpression knows∗,namely“knows∗0..”. The “∗”signalizesa path of some length where 0.. gives thelowerbound. Likewise, knows∗1..3woulddenote apath of lengthatleastone and maximumthree,whereall edges have to be labeled with knows.Note that we are not abletospecify that we are onlyinterestedinthe

Figure 3.5: Query 7(IC13) as agraphicalquery pattern. Source: The LDBCSocial Network Benchmark, page 54 [LDB20].

shortest pathinthis representation.The same limitationapplies to the representation as an RPQ: ∗ P = person1 −−kn−−ows→ person2.

In our introductionofpathqueries, we thenmovedontoadjacencyqueriesthat explore the surroundings of agiven node. Query IC1isanexample of suchaquerythat examines the adjacency of aperson.

Given aperson, find persons with agiven first name in theiradjacencyover knows relationshipsofatmost 3steps. Return thesepersons, their distance as well as summariesoftheir workplaces and place of studies. (8)

Figure 3.6 depictsthis query as anavigational pattern. As an advantage of sucha representation, we can get afeelingofthe difficultyofaquery in termsofaccesses to different elements, likethe ones on City, Company and Country here. Nowthatwehaveseenanexampleofboth, areachabilityand an adjacency query, we will look at general navigational query. Query IS2 gives us thelast messagescreated by auser,the originalpostsinthe conversationsaswell as thepersonsthatcreated these posts. An application could then use this informationtoanalyze relationships between thesepersonsfor example.

36 3.3. Path andNavigational Queries

Figure 3.6: Query 8(IC1) as agraphicalquery pattern. Source: The LDBCSocial Network Benchmark, page42[LDB20].

Retrievethe last 10 messages created by agiven person. Foreachofthose messages, return thatmessage, theoriginal post in its conversation as well as theauthor of that post.(9)

Thisisanavigational queryaswehaveapath of variable lengthbetween themessage and theoriginalpost in itsconversation. Whatmakes this query abit challenging is the factthat amessage can alsobeapost. In this case, theoriginalpost of theconversation will be the same message and should occurtwice in the output. Figure 3.7depicts this queryasanavigational pattern.

Figure 3.7: Query 9(IS2) as agraphicalquery pattern. Source: The LDBCSocial Network Benchmark, page57[LDB20].

Summary: These features,namelygraph patterns, paths and theircombination in navigational queriesform thecoreofmost,ifnot all,moderngraph query languages. In all example queriesthatwelookedatuntil now, we usedqueriesthatretrievedata fromagiven graph.Suchqueries form asub-language of ageneral querylanguage that is denoted as DQL. As briefly mentioned in the introduction of querylanguages in Section 2.4, a general querylanguage alsocontains operationstomanipulatethe data and sometimes even theschema. These sub-languages are in turn denoted as DMLand DDL and we will nowlookatthem,their underlying mechanicsand potentialproblems.

37 3. QueryLanguageFeatures

3.4 Data Manipulation

Queriesthatmanipulatethe data allow us to change thegraphbyadding nodes or edges, but also by removing or updatingexisting ones. These graph transformations are specified in aqueryusing primitivesthat are basedonthe previously introduced patterns and paths [AG18]. Therefore, queriesthatmanipulatethe data are expressedsimilarly to queriesthatretrievedataasbothrelyongraph-orientedoperations[AG08]. In order to updateorremove an entity, we first have to identify it. This canbeproblematic in some scenarios. Think about asocial network thatcan containmultiplepeople with the same first andlastname. To avoid an identification problem, manygraph databases use globally unique IDs that are associatedtoeachentity, similarlytothe world of relational databases. Therefore, if one wantstoremove aperson with thename of “John Doe”,wecouldfirst query forall suchpersons, select theIDofthe one that we want to delete and then send theDML query to remove the entity. Anotherthing thatwehave to keep in mindwhen removing nodes are theirrelationships to other nodes. When we remove suchanentity, allincoming and outgoingedges become invalid as theyare left with onlyone adjacentnode. It is up to thesystem to either removethese stale relationships together with the node or to manipulate them in order to achieveavalidgraphagain. There is no correctbehaviorfor all scenarios.While some systems remove suchedgeswiththe node, othersprovide explicit operators to either delete onlythe node or alsothe adjacentrelationshipsand there areevenproposals for algorithms that can deal with stale edges [SW14]. Insome scenarios, the removal of a node could alsoimplythat other, often adjacent, nodesshould be removed. Let’s look at the relationships between aforumand the postsinthis forum in the data model of the SNBinFigure 2.7. If we are to remove apost,all comments that reply to this post hang in the air.Similarly,removing aforumrendersall postsinthatforum to be without a container. Apartfrom the removal of entitieswecan alsoinsert and updateentities using such queries. This allowsfor adynamic change of thedataitself and,assuming that the database does notuse aschema as is thecase for some graph databases, alsothe data model. Let’s look at our smallsocialnetwork where anodeisidentifiedasapersononly by its User label. If we update suchanentityand change thelabel to Person,we are left with adatamodel containing both,nodes labeledasUser and Person.This gives us aglanceatthe extreme flexibilityofsuchsystems as oneisnot boundtoany schema.Onthe otherhand, querying acompletely schema-less graphbecomes nearly impossibleasfor example persons can be labeled as User, Person, Man, Woman ....This problemdoesnot only concern node labels but also edge labelsand properties in propertygraphs as we couldstore the name of aperson in asingle propertyorsplit it up into multiple ones for example. That means, that even if thedatabase system does notenforce aschema it is in theinterestofthe users to agreeonaset of labelsand propertiesand to use them throughout the database. Theflexibilitythat remains then is the adaptabilitytofuture changeswhereanew label can be seamlesslyadded to either

38 3.4. DataManipulation existingentities or newones and can then be usedinfuture queries. We do nothave theseproblems in systems relying on aschema as theschema defines theavailable labels and potentially alsothe properties. However, we have to adapt the schemabeforewecan introduceanew label, property or entity.

3.4.1Queries: Nowthatwehavehighlighted some of thepotential problems and typesofqueries that are possible with this sub-language,wewill look at afew examples. In thesetting of our social network, addinganew persontothe graph isanimportant operation.However, we do not only want to add anodecontainingthe information about the personbut alsorelateittoother entities in asingle query.The query INS15does exactlythatbyadding anodeaswellasedges depicting their location, interests and so on.

Addanode denoting aperson and connect it to existing nodes depicting their location, interests and place(s) of work and/or study.(10)

Figure 3.8 depictsthis query as agraphicalpattern and thevarious parameters (pre- ceded by adollar-sign)thatwill either be set as properties or are needed to buildthe relationships.

Figure 3.8: Query 10 (INS1, IU1 in [AAA+20]) as agraphicalquery pattern. Source: The LDBCSocial Network Benchmark, page61[LDB20].

As an exampleofaquerythatdeletes one or multiple entities we have to look at a snapshot of version0.4.0 as deletions are not presentinearlierversion. Thequery DEL7 removes acomment togetherwith the corresponding relationships as well as all further comments thatreply to theinitial one.

Delete acomment, itsincidentedges and recursivelyall comments replying to the intitialone. (11)

5Remember: the insertionqueries INS1-INS8 in the snapshot of the most recentversion0.4.0 [LDB20] equal the queries IU1-IU8 (interactiveupdate) in the stable version 0.3.2 [AAA+20].

39 3. QueryLanguageFeatures

Figure 3.9 depictsthis query as agraphicalpattern. We can see that theremovaltriggers the invocation of arecursivedeleteonthe replying comment(s). Thisinturndeletes also theiredges until there arenodangling commentsleft.Therefore, this query potentially touches many entitiesand, depending on the level of support forsuchtransitiveremovals fromthe systemand thequerylanguage,can be quite sophisticated.

Figure 3.9: Query 11 (DEL7) as agraphicalquery pattern. Source: The LDBC Social Network Benchmark, page 87 [LDB20].

The SNB does not containqueriesthat update an already existingentity, be it anodeor an edge.Tothis end, we include Query12that updates thepropertyofanedge, namely the year aperson graduated at auniversity.

Updatethe year (classYear)agiven persongraduated at agiven university. (12)

3.5 Data Definition

When we introduced thedifferentdata models in Section 2.3,wehighlighted that some of them support aschema while others do not. An exampleofsuchaschema can be seen in the RDF data model in Figure 2.2. We alsomentionedthat the widely usedproperty graph datamodel does not support aschema out of thebox.However, even if adatabase system and theunderlying data model do notsupport aschema, it is in theinterestof the users to agree on some sortofaschema as we have motivatedinthe previoussection. Therefore, manygraph databases allow, or even require, the specificationofsome sortof schema or restrictionsonthe graph,evenifthey are builtonthe propertygraph data model. Using astrong type system where labels, properties and theirdatatypes as well as possiblerelationships are specified in theschema limits the flexibility. This lossis traded forperformance as aschema can be usedtooptimizequeryexecution, improve access to entities or even reduce thespace needed to store them [DXWL19]. Apart from performanceimprovements, aschema also allowsfor securityand privacy features

40 3.6. Further Features

[WD18]. One canlimit for example the access to parts of thegraphfor certainusersand hide other parts from them. In applications dealing with datainaknown and structured form, using apre-definedschema representing this structure does notlimit the flexibility while providing the aforementioned performance and securitybenefits.Therefore, we cannot derive asingle best setting regarding theuse of aschema for all applications. The varying trade-offs between theflexibilityand havingatleast some sort of schema are one of thereasons for theexistence of thevarious graph databasesand often amajor differentiation between them. Even if we limit ourselves to databases and languages relying on aschema we endupwith multiple variants of suchschemata. Whereone language for exampleallows an edge to be related to its reverse edgeasinGSQL,we cannot express this relationshipinother languages likeCypher. While DML queriesallow us to manipulate the data graph,DataDefinitionLanguage (DDL)queriesare usedtomanipulatethe schema. Thisincludesthe initialcreationof the schemaaswellasadaptions lateron.

3.5.1Queries: The SNB does not containany queries thatmanipulatethe schemaasitassumes the pre-definedmodel from Figure2.7.Inour comparisonofthe query languageswewill describewhether alanguage supports or relies on aschema as well as thetypeofthe schema. As an exampleofsuchaDDL query we try to add astreettoour data model in Query 13, meaning thatweadapt it suchthatapersonisnot directly located in acity but in astreetwhichinturn is located in acity.

Addastreet entitytothe data model andchange therelationship Person[0..*]-isLocatedIn->[1]City to Person[0..*]-isLocatedIn->[1]Street[0..*]-isLocatedIn->[1]City. (13)

3.6 Further Features

The first three featuresthat we introduced, namely structure independent,pattern matching and navigational queries, are themajor building blocks used in many graph query languages. These featuresallowustoquerydataand are alsoused in DML queries to modify it. In adatabase that supports aschema, DDLqueriesare usedtomodify it. Depending on therepresentation of theschema, they can alsobebuilt on top of these three features as one couldfor examplespecify aschema using apattern. Apartfrom these featuresthatform thebasis and group possible queries, more general features of graph query languages have been identified and we will nowlookattwo of them. These two featuresare more acharacteristic of alanguage and tell us something about its

41 3. QueryLanguageFeatures

expressiveness in certain use cases.The LDBCTechnical User Communityidentified theseastwo of themajor issues in existinggraph query languages [AAB+18].

• Composability The first one, composability,denotes that theoutput or result of aquerycan be usedasinputtoanother query in the same language.This allowsfor nested subqueries as in SQL but alsofor alinear composition of querieswherethe first part of aqueryselects some datawhichisthen passed on as theinputtoasecondquery. Acomposablequery languageencourages amodularapproachand interoperability between queriesaswecan plug one query into another one.This in turn allows for an abstraction of problems as we can decomposebigger queries into smaller pieces where theinputs and outputs are compatible [AAB+18]. The inputisonly compatible to theoutput if they have the same,oratleasta compatible, datamodel [ARV19]. As brieflymentionedinthe introduction of query languages,agraphquery takes agraphasinput but in many languages outputs notagraph but atablecontainingvalues[Ang18]. Therefore, queriesare not composableinthese languages.The paperthatintroduced the researchlanguage G-COREgoesevenone stopfurtherand statesthat “current querylanguages do not provide full composabilitybecause they output tables of values, nodes or edges.” [AAB+18]. Following this argument,agraph query languageisfully composable if it outputs theresult of aqueryasagraph.Unfortunately,asofnow this is not supported by anyofthe currentgraph query languagesexcept by G-COREitself.

• Pathsasfirst-class citizens The second feature deals with the representationofpaths in thedata model. We already mentionedthis feature when we introduced the class of ECRPQin Section3.3.2. Paths are raised to first-classcitizens forECRPQs as they can be named, included in the output and compared to others [LMV16]. An entity in any is saidtobeafirst-class citizenifitsupportsthe operations thatare generallyavailable alsotootherentities. If it cannotbereturned or stored however, it is reducedtoasecond-class citizen[Sco06]. Consequently,raisingpaths to first class citizens allowsustoadd labels andproperties not only to nodes and edges butalsotopaths [AAB+18]. These changesrequire amore sophisticated data model,asthe modelsexplained untilnow cannotdeal with stored paths. As of now, thefeatureisagain primarilysupported by the researchlanguage G- CORE.Toachievethis,G-CORE uses an extension of thepropertygraph data model, namely the Path Property Graph (PPG) data model,that allows paths to be stored and enriched with properties andlabels. Figure3.10depicts asmallsocial network in this model where node labels are depicted as shapes and edge labels as lineswith either an arrow or rhombus.The noveltyinthis data model lies in the lowerleftpart where we findtwo stored paths,one fromJohn to Alice and one from JohntoCeline. As theusersonthe start and endnodeonbothofthese

42 3.6. Further Features

Figure 3.10: Asmallsocialnetwork in the path propertygraph data model thatincludes stored paths. Source: G-CORE ACore for Future Graph Query Languages, Figure 5 [AAB+18].

paths liveinthe same city, namely Houston,wecouldlabel them for exampleby sameCityRelation.Furthermore,wecan store thelengthofsuchapath as a propertyofitsuchthat aquery can directly access this attribute without needing to compute it.

43

CHAPTER 4

Query LanguageAnalysis

Nowthatwehaveintroduced the featuresthatform theconceptualcore of all graph query languages, we analyze 5contemporarylanguages based on those features.The 5 languagesare:Cypher as it is perhaps the most well-known propertygraph language, Gremlin forits imperativeapproach, PGQL for itspowerful path patterns, GSQL forits use of aschema and nativelyparallelapproachand G-CORE, thatbuilds upon theother languages andoffers full composability.Wedid not only choose those languages fortheir differentfeatures and approaches,asnearly everymajor contemporary graph database supportsatleastone of those query languages. To achieveacomprehensiblecomparison, we analyzethe languages oneafter another. For eachlanguage, we start with ageneral introductionthat includes the core principlesofthe languageand giveexamples forthe structure of aquery. We then analyze thelanguages in depth basedonthe previously identifiedfeatures.Wegothrough the implementations of thequeriesfrom Chapter 3, highlightdifferencestoother languagesand look at the expressiveness of alanguage by the supportfor specificfunctions as well as potential shortcomings.Althoughweimplementedall queriesfrom theprevious chapter in the 5languages, we limit ourselvestothose thatare differenttoother implementations in this thesis.Furthermore, thequerydefinitionsinthe SNBexpect an implementation to rename eachvariableinthe result. We omitthis forbrevityonmost queriesinthis chapter as this does notfurther our cause of comparingthe languages.Nonetheless, a full implementationofall queries can be foundonour GitHub repository1.

1https://github.com/martin-kl/Diploma_Thesis_01526110_code

45 4. QueryLanguageAnalysis

4.1Cypher

Cypherisperhaps the best-knownquery languagefor propertygraphs and focuses heavily on patterns. The language is used in multiple databasesand projects, with the version in the Neo4j database [Neo20]and the one from theopenCypher project[ope18]being the most influentialones. As thelanguage is not yetstandardized, it is stillsubject to change and theversions and implementationsofitdifferinsome areas.Unless specifically mentioned, theconceptsand functions apply to all currentimplementations of thelanguage, especiallyalsoCypher 9, thecurrent versionofopenCypher, andthe version in theNeo4j database. [AAB+17] Whereas programminglanguages like and other graph query languages likeGremlin are imperativeinnature,meaningthatthey focus on thedescription of theprogram’s operation and in termsofquery languagesonhow to compute theresult, Cypherisa declarativelanguage. Therefore, it focuses on what aqueryshould retrieveinstead of howitachieves the result.This is one of theareaswhere CypherissimilartoSQL, that is declarativeaswell. Apartfromthat, many keywordsinCypherare inspired by SQL to ease the transitionfor userscoming from theworld of relational databases. Being adeclarativequerylanguage and using manyofthe same keywordsasSQL, it comes onlynatural that thegeneral structureofaCypher query resemblesthe one of an SQL query.Aquery in both languages consists of clauses that are chained together [ope18]. In contrast to SQLhowever, Cypher queriesare conceptually structured linearly. Therefore, we can thinkofthe executionofaqueryasstarting with the evaluation of thefirst clause by generating aresult for it, whichinturn is passed on as inputtothe second clause and so on. Thislinear combination leads to another difference as it does notallowthe projection(the RETURN statement,Cypher’s variant of SQL’s SELECT)tobeatthe beginningofthe queryasfirst clause,ithas to be at the end of it. [FGG+18b] Algorithm 4.1 depictsasimple Cypher query consisting of threeclauses, one in eachline. The MATCH clauseisthe entry pointofaquery and contains thebasicgraphpattern (bgp) thatthe query is searching for.Together with the WHERE clausethatfurtherrestricts the pattern, this query searchesfor relationships between theperson JohnDoe and other nodes.Parentheses “()”inthe patterndepictanode whereas edges are depicted by “-[ ]-”. As we can see in our example, thesebrackets cancontaineitherasingle word thatdenotes avariable(rel, dst)oravariable together with afilter on alabel that is given after thecolon. (pe:Person) forexample matches onlythese nodestothe variable pe thatare labeledasPerson.[AAB+17]

1 MATCH (pe:Person)-[rel]- (dst) 2 WHERE pe.firstName=’John’ AND pe.lastName=’Doe’ 3 RETURN p, rel, dst Listing 4.1: ACypherquery thatselects theneighborhood of thepersonJohn Doe.

We can alsosee theconceptuallylinear structure of aqueryand observe thatitcan be evaluated oneclause after theother, starting fromthe topmost that selects all such

46 4.1. Cypher

patterns, followedbythe restrictiononthe variable p to the personJohn Doe, to the projection in the RETURN clause. However, asystemimplementing thequery languagecan re-order clauses as longasthe semantics of thequeryisretained[FGG+18b]. Therefore, the linear structure holdsonly conceptually and an implementation is not bound to the order of clauses whenevaluating aquery. That allowsanimplementation to usethe restrictions given in the WHERE clausealready whensearching forpatterns in the MATCH clause, thereforeimproving the evaluation efficiency.The strong dependencebetween the WHERE and MATCH clauses lead to ashort form thatallows arestriction on properties alsointhe pattern. Thequery in Algorithm 4.2 is equivalenttothe previous one as it contains therestrictionsonpropertiesinthe “{}”brackets. [ope18] 1 MATCH (p:Person {firstName: ’John’,lastName: ’Doe’}) -[rel]- (dst) 2 RETURN p, rel, dst Listing 4.2: Query 4.1 in amore compact notation.

Nowthatwehaveseenasimple example of aquery, we will look at theevaluationof queriesand clauses and theinformation passing between clauses in moredetail. Aclause resemblesafunctionthat, analogously to SQL, operates on an inputtable and produces atableasoutput. In other words, eachclause gets an input table,applies afunctionon the valuesinthattable, and produces an output table thatinturn is the inputtable forthe nextclause.These tablesare alsocalled binding tables in Cypher as they bind variablesfrom the query to elementsinthe graph or other values. However, the definition of aqueryconsisting of clauses that operateonthese tablesleadstoaproblem,asa Cypherquery in an abstractedsense outputsatablebut takes apropertygraph as input. We can generatethe outputasatableunderthe definition of aclause, but until now eachclauseexpects atableasinputaswell. That is especiallyalso true forthe first clauseinaquery, meaning thatsuchaclause getsanemptytableasinput. The MATCH clauseisthe main representativefor aspecial group of clauses regardingthese binding tables 2,asthey usually takeanemptytableasinputand canpopulateitwith entries fromthe propertygraph. Therefore, we needsuchaclause in every query thatinteracts with thegraph, even if we simply selectall nodes in the database as does thequery in Algorithm 4.3. Thisalso highlightsthe status of patterns as an integral partofCypher. [FGG+18b] 1 MATCH (n) 2 RETURN n Listing 4.3: Asimple Cypher query thatselects all nodes.

Linear Composition and theWITH clause Apartfrom the conceptually linear flowbetween clauses in asingle query as seenuntil now, Cypher also supportsthe WITH clausethat allows foralinear composition of queries. As an example, thequery in Algorithm 4.4 startsatJohn Doeand in afirst step traverses

2If we restrict ourselves to the DQL, MATCH is the only suchclause, but there are others like MERGE thatare part of the DML.

47 4. QueryLanguageAnalysis

to all persons m connected to him. Sinceweare onlyinterestedinthe personwith the alphabeticallylastfirst name, we have to order them by theirfirst name in descending order before we can select the top result. We then want to pass thatpersonasinput to the second part of thequery,namelythe second MATCH.However, ORDER BY is a sub-clauseinCypher thathas to followaprojecting clause, meaningeither RETURN or WITH.Tothis end, we have to use the WITH clausetospecify which variables are passed on from thebinding tableofthe first part of thequery to the second MATCH. Furthermore, onlythese variablescan be used in the ORDER BY clause, that in turn is evaluated before the second MATCH as it is asub-clause of WITH.Wehighlight this dependenceofORDER BY on WITH in thealgorithm by the indentation of line 3, which is howevernot required or motivatedbythe Cypher guidelinesand should onlyhelpthe reader in thiscase. Thesecond MATCH clauseinthe query nowuses both, the input variable m it got from theprevious part and thepropertygraph,topopulate its binding table with entriesfor thevariable o and returnproperties of them. [ope18]

1 MATCH (p:Person {firstName: ’John’,lastName: ’Doe’}) -[rel]-(m:Person) 2 WITH m 3 ORDER BY m.firstName DESC LIMIT 1 4 MATCH (m) -[]- (o:Person) 5 RETURN o.firstName, o.lastName Listing 4.4: ACypherquery thatcontains alinear composition of queriesvia the MATCH clause.

Grouping One well-known keyword from SQLhoweverismissing in Cypher, namely GROUP BY.The functionalityisinstead provided directly by the RETURN clause, as eachnon-aggregate functionisimplicitly takenaskey to group by.This canbeseeninAlgorithm 4.5 where n becomes thegroupingkey andwecountthe number of neighbors o.

1 MATCH (n) -[rel]- (o) 2 RETURN n, count(*) LIMIT 10 Listing 4.5: Asimple Cypher query showinganalternativetoSQL’s GROUP BY.

Apartfrom grouping in thelast clause, we can alsouse WITH to specify the group(s) and the aggregate operation.This allowsustowork with theresultofthe aggregation, for example by filteringout parts of theresult.Asanexample, the query in Algorithm 4.6 selects allfriends of John Doethat have at least tenoutgoing relationships 3.Itincludes some newconcepts, startingwith adirected edgewherethe directionisgiven by an arrow (“<-[]-”or“-[]->”). Note howeverthatthe data model of Cypherdoesnot support undirected edges at all,meaningthatwecannotcreate or store undirectededges. The pattern -[]- thatweusedinthe queries until nowallows us to test for theexistence of some directed relationship, regardlessofthe direction. Furthermore, thereare two anonymous edges and one anonymous node in this query.Theseare entities thathave no

3Note again that theedge label KNOWS depicts afriendship relation in thebenchmark.

48 4.1. Cypher

variable in theircorrespondingbrackets and can thereforenot be accessed later on in thequery or included in theoutput. In this scenario however, we are not interested in the edges that match the directed edge-pattern, but only in the amountofsuchedges. Cypheralsoprovidesashort form forsuchanonymous edges: “--”for edges where we do notspecifythe directionaswell as “<- -”and “-->”for directedones.The WITH clauseisneeded in this query as we want to furtherlimitour selection to friends thathave at least 10 suchedges. To this end, we groupbythe friends f and call the aggregate operation COUNT in the WITH clausethat in turn forwardsthe resulting table to the WHERE clause wherewecan use the previously computedvalues.

1 MATCH (p:Person {firstName: ’John’,lastName: ’Doe’}) -[:KNOWS]- (f:Person) -[]-> () 2 WITH f, COUNT(*) AS out 3 WHERE out >= 10 4 RETURN f.firstName, f.lastName Listing 4.6: ACypherquery thatselects all friends of JohnDoe that have at least ten outgoingrelationships.

Nowthatwehave seensome of thegeneral principlesofCypher, we will go over the queriesfrom Chapter 3and highlight advantages and disadvantages compared to other graph query languages. We took mostimplementationsofthe benchmark queriesfrom theirreference implementationsonGitHub4.Toensure the correctness of thequeries, we executedall of them on aNeo4j CommunityEdition 4.1.0database. However, unless stated otherwise,the queriesuse onlythese functions and proceduresthat are also availableinCypher 9.

4.1.1Structure Independent We have already seenhow we can formulate Query 1: either by restricting the matched nodes in the WHERE clauseasinAlgorithm 4.1 or by using the short form as in Algo- rithm 4.2. Cypheralso supports the parameterization of queriessuchthatthe usercan pass argumentstothe query.Anargumentisthen accessible by preceding its namewith adollar sign as we can seeinAlgorithm 4.7.

1 MATCH (p:Person {firstName:$firstName, lastName:$lastName}) 2 RETURN p Listing 4.7: Query 1asaCypherquery.

The implementation of Query 2inCypherisgiven in Algorithm 4.8. As mentionedinthe introductionofthe query in Chapter 3, it returns thecreationdateofagiven message and either thetextual content in case there is one, or the image file(thatisalsosaved as string).This differentiation can be seen in lines 4-6 of thealgorithm whereaCASE expression distinguishesbetween thetwo cases. To determinewhether amessage has the

4https://github.com/ldbc/ldbc_snb_implementations

49 4. QueryLanguageAnalysis

property content,Cypherprovidesthe EXISTS function. Thisfunctioncan be used on nodes and edges to testfor theexistence of properties. 1 MATCH (m:Message {id:$messageId}) 2 RETURN 3 m.creationDate AS messageCreationDate, 4 CASE EXISTS(m.content) 5 WHEN true THEN m.content 6 ELSE m.imageFile 7 END AS messageContent Listing 4.8: Query 2(IS4) as aCypherquery.

Before we takealook at theimplementation of Query 3, we have to look at Cypher’s capabilities regardingcollections suchaslists.The benchmark states that thelanguages spoken by aperson are stored in alist of strings and Cypherindeed supports lists as well as maps.Lists canbecreated simply by surrounding elementsofthe same type by squarebrackets as in [0,2,4] or by using the collect aggregation. Algorithm 4.9 depicts aCypherimplementation of Query 3. After selecting all persons in line 1, we use SIZE to get thenumberofentries in the speaks propertyofanindividual Person (whichisalist of strings). Thereafter,Cypher implicitly groupsall of thesevalues and calculates theaverage over thesizes. Apartfromthe aggregate operation AVG,Cypher alsosupports MAX, MIN, COUNT, SUM as well as functions to calculate the standard deviationand percentiles. 1 MATCH (p:Person) 2 RETURN AVG( SIZE(p.speaks )) Listing 4.9: Query 3asaCypherquery.

We have already mentionedthe collect functionand we will nowgobrieflyoverthe query in Algorithm 4.10 to giveanexample of it.The query selects all entities (edges and nodes) that are directly related to acontinent whosename starts with ’A’.Atthe endofthe first query part we collect thenames of thesecontinentsintothe list conts that is passed to the UNWIND clause. There, thelist is dissectedagainand each element in the list is passed on as cont to the remaining querypart,meaningthatthis results in one execution of thelatter MATCH for eachentry in the list. Thematches from these executions are then implicitlybrought togetherand we can access all of them in the RETURN clause. Note that thereare simplerwaystospecify this query and we onlyuse this as it allowsustoshowboth, collect and UNWIND in one query. 1 MATCH (c:Continent) 2 WHERE c.name STARTSWITH ’A’ 3 WITH collect(c.name) as conts 4 UNWIND conts AS cont 5 MATCH (c:Continent {name: cont}) -[r]- (n) 6 RETURN c, r, n Listing 4.10: ACypherquery thatselects all entities thatare directly related to a continent thatstarts with ’A’.

50 4.1. Cypher

4.1.2 Pattern Matching We have already seenthatpatterns areused as main building blocksinCypherqueries. Caused by the fact thatwehavealready given multiple examplesofqueriesthat contain simple bgps, we skip the code of Query 4and continue with the similarlooking Query 5. Algorithm 4.11 shows an implementation of this query. It includes an ORDER BY clause thatworks as expected fordeveloperscoming from SQL, meaningthatthe results are first ordered by the friendshipCreationDate in descendingorder and, on identical valuesonthatvariable,onthe personId in ascendingorder. 1 MATCH (n:Person {id:$personId}) -[r:KNOWS]- (friend) 2 RETURN 3 friend.id AS personId, friend.firstName, friend.lastName, 4 r.creationDate AS friendshipCreationDate 5 ORDER BY friendshipCreationDate DESC,toInteger(personId) ASC Listing 4.11: Query 5(IS3) as aCypherquery.

To understand the implementation of Query 6inAlgorithm 4.12, we first have to introduce the OPTIONAL MATCH clause. Similarly to MATCH,wecan specify apatterninthis clauseand the databasesystem searches for matches in thedata. However, if there is no suchmatch or onlyapartial one,meaningthatfor exampleonly some variablesofthe OPTIONAL MATCH clausecan be matched,all unmatched ones are settonull.This resemblesthe semantics of outer joininSQL. In ourexample,wefirst searchfor apattern between agiven message m,areplying comment c and itscreator p.Totest whether p knowsthe personthat created the original message m,weoptionally matchthe path from m to its creator and furthermoreoveraKNOWS edge to p.However, we do not knowifthe KNOWS edge existsasthis is exactly what we want to find out.Therefore, we can then use a CASE expression in the RETURN clause to test whethersuchamatchwas found. 1 MATCH (m:Message {id:$messageId}) <-[:REPLY_OF]- (c:Comment) 2 -[:HAS_CREATOR]-> (p:Person) 3 OPTIONALMATCH (m) -[:HAS_CREATOR]-> (a:Person)-[r:KNOWS]-(p) 4 RETURN 5 c.id, c.content, c.creationDate AS commentCreationDate, 6 p.id AS replyAuthorId, p.firstName, p.lastName, 7 CASE r 8 WHEN null THEN false 9 ELSE true 10 END AS replyAuthorKnowsOriginalMessageAuthor 11 ORDER BY commentCreationDate DESC,replyAuthorId Listing 4.12: Query 6(IS7) as aCypherquery.

Evaluation Semantics Regarding the evaluation semantics of abgp, Cypheruses relationship isomorphism as adefault.That resembles ourno-repeated-edge semantics where no twoedgevariables can be matched to the same edgeinasingle match. This avoidsapotential infinite result in patterns with variable length[AAB+18]. It is envisioned thatfutureversions of

51 4. QueryLanguageAnalysis

Cypher, i.e. Cypher10underthe openCypherproject, allow the user to switchbetween ahomomorphism-based, ano-repeated-node and thecurrentno-repeated-edge semantics on aquery-by-querybase.[GJK+18] However, we can influence thesemantics of aquerybyour choice of thequery structure even today. As an example, assume that we are searching forall friends-of-friends (of lengthexactly 2) of agiven person n.Wecan then collectall thesefriendsinalist and test whether n is in that list or not.The first example in Algorithm 4.13 contains a single MATCH clauseand,caused by the no-repeated-edgesemantics,returns false as the same edgewill not be matched to both edges in the pattern. However, the versionin Algorithm 4.14 returns true as thereare no two edges in asingle MATCH but they are split up intomultiple ones. 1 MATCH (n:Person {id:$id})-[:KNOWS]- (friend) -[:KNOWS]- (foaf) 2 WITH n, collect(foaf) as foafs 3 RETURN n IN foafs //returnsfalse Listing 4.13: Acypherquery thatreturns false as thesame edge will not be matched to both edges in the pattern.

1 MATCH (n:Person {id:$id})-[:KNOWS]- (friend) 2 MATCH (friend) -[:KNOWS]- (foaf) 3 WITH n, collect(foaf) as foafs 4 RETURN n IN foafs //returnstrue Listing 4.14: Acypherquery thatreturns true as thesame edge will be matched to the edges in both MATCH clauses.

Apartfrom that, Cypher also provides UNION and UNION ALL clauses that allow for acombination of results frommultiple queries.Regardingthe evaluationsemantics of suchqueries, Cypher uses abag semantics which is not problematic as we knowthat the evaluation of asinglequery results in afinite number of matches. [AAB+17, ARV19]

4.1.3 Path Cyphersupportsasubset of RPQsasitallows forvariablelengthpaths. These paths can be further restrictedbyasingle relationship type or adisjunctionofsuch[FGG+18b]. In the most general form, we can abbreviateanotherwise unrestricted path with three edges and fournodes by (a) -[*3]- (b).Furthermore,wecan specify an upperbound y on thelengthvia [*..y],alowerbound x via [*x..] and acombination of both via [*x..y].This canbeextended to alsoinclude edgelabels. As an example, we can search for the friends-of-friends of length3via (a)-[:KNOWS*3]-(p).Besidespaths of variable length, adisjunctionoflabelsisgiven as (a)-[:KNOWS|WORK_AT]-(b). However, Cypher doesnot supportfull composition of regular expressionsoveredge labels and its path querying functionalityistherefore notasexpressiveasRPQs. Apartfrom restrictions on paths, Cypheralsoallows paths to be assignedtovariables. As an example, we can assign apath to avariableasinAlgorithm 4.15. Thisallows

52 4.1. Cypher

us to furtherrestrictapath, for example in a WHERE clause, and to return it.Cypher providesfunctions like nodes and relationships to access theseentities on apath in order to restrict or returnthem as well as length to get thelengthofapath.

1 MATCH p=(n:Person {id:$id}) -[:KNOWS*..5]-> () 2 WHERE length(p) >2 3RETURN relationships(p), length(p) LIMIT 10 Listing 4.15:ACypherquery that demonstrates named paths (paths that are assigned to variables).

Algorithm 4.16 depictsanimplementation of thereachabilityquery7.This is an interesting query as it searches for theshortestpathbetween two nodes with arestriction on theedges thatcan be traversed. Cypher providesthe shortestPath function for exactlythat, meaning that thefunctionfinds“asingle shortest path between two nodes” [ope18, Neo20]. However, it is notspecified which path is returned when there are multiple shortestones. As we do not knowifsuchapathexists at all in our case,weuse the function in an OPTIONAL MATCH so that we can later test whether suchapath was found or not.Apart fromthe shortestPath function that returns asingle shortest path, Cypher also providesthe allShortestPaths functionthat returns all shortest paths between two nodes.

1 MATCH (person1:Person {id:$person1Id}), (person2:Person{id:$person2Id}) 2 OPTIONALMATCH path =shortestPath((person1)-[:KNOWS*]-(person2)) 3 RETURN 4 CASE path IS NULL 5 WHEN true THEN -1 6 ELSE length(path) 7 END AS shortestPathLength; Listing 4.16: Query 7asaCypherquery.

Algorithm 4.17contains an implementation of theadjacency query 8. Thequery incorpo- rates multiple concepts and clauses that we explained before and at afirstglance looks quite complicated. However, this comes mostlyfrom the factthat it starts with asimple patternthatgathersfriends up to distance 3and then proceeds by collecting further information about them in multiple steps.Asthere canbemore than one pathoflength 1to3between two persons thatknoweachother, we use the shortestPath function to get asingle one. After excluding the start personinline 2, we pass the required informationtothe next query partvia the WITH clause. Thispart is then evaluated oncefor everyfriendfrom the first part and gathersinformationabout the place the friend is livingatand optionallythe place(s) he or she studied at.Information about asingle universityisgathered in amanually created list in line 12. As aperson can have relations to none,asingleormultiple universities, we iterate over them and collect theinformationinalist using collect.The informationiscombinedinanother WITH clause, forwarded to the next query part where information aboutthe work-relationships of thefriend are added and finally returned. Thisquery shows howwecan iteratively

53 4. QueryLanguageAnalysis

gather information and that theinformationwewanttofinally return hastobepassed to theend via WITH clauses.

1 MATCH p=shortestPath((person:Person {id:$personId}) -[path:KNOWS*1..3]- ( friend:Person {firstName:$firstName})) 2 WHERE person<>friend 3 WITH friend, length(p) AS distance 4 ORDER BY distance ASC,friend.lastName ASC,toInteger(friend.id) ASC 5 MATCH (friend) -[:IS_LOCATED_IN]-> (friendCity:Place) 6 OPTIONAL MATCH (friend) -[studyAt:STUDY_AT]-> (uni:Organisation) -[: IS_LOCATED_IN]-> (uniCity:Place) 7 WITH 8 friend, 9 collect( 10 CASE uni.name 11 WHEN null THEN null 12 ELSE [uni.name, studyAt.classYear, uniCity.name] 13 END 14 ) AS unis, 15 friendCity, 16 distance 17 OPTIONAL MATCH (friend) -[workAt:WORK_AT]-> (company:Organisation) -[: IS_LOCATED_IN]-> (companyCountry:Place) 18 WITH 19 friend, 20 collect( 21 CASE company.name 22 WHEN null THEN null 23 ELSE [company.name,workAt.workFrom, companyCountry.name] 24 END 25 ) AS companies, 26 unis, 27 friendCity, 28 distance 29 RETURN friend, distance,friendCity.name, unis, companies

Listing 4.17: Query 8asaCypherquery.

We skip theimplementation of Query 9asitdoesnot provideany new insights and can be written as arelativelysimple navigational query.

Evaluation Semantics Similar to the evaluation semantics of pattern matching queries, Cypher uses ano-repeated- edge semantics whenevaluating path queries[DXWL19]. Therefore, the evaluation of suchaquerywill never takethe sameedge inapath, whichinturn avoids paths of unlimited length. Regardingpossibleoutputs of aquery, Cyphersupportseverything fromfixed arity outputs likeBooleans, single properties and entities to variable arity outputslikepathsand lists.However, we are not abletoreturn agraph in Cypher as of now, although somethinginthat regardshould be supported in Cypher10[FGG+18b].

54 4.1. Cypher

4.1.4 Data Manipulation Nowthatwehaveseenavarietyofqueries thatselectdatafrom the graph,wewill look at DMLqueries. As mentionedinChapter3,queriesthatmanipulatedataare expressedsimilarly to ones that onlyretrievedata. Thiscan be seeninthe simplified implementation of Query 10 in Algorithm 4.18 that uses thesame patterns also to create nodes and relationships.The query inserts anew personnodevia the CREATE clauseas well as arelationship labeled IS_LOCATED_IN to connect thenew node to the citythe personisliving in. All of that is doneinthe first twolines and thecodeinthe following linesthen creates avariablenumberofrelationships to already existingnodes. We use UNWIND to unroll thelist of tags, study and work information andeachsuchentry is then passed to one MATCH-CREATE combination.Asaninteresting side note, we have to include count(*) in the WITH clauses to avoid an unexpected behavior: if we pass forexample two tagsintagIds,this results in two evaluationsofthe MATCH-CREATE combination in lines 5and 6whichinturn would result in two calls of the WITH clause in line 7. By adding count(*) in thatclause, we implicitly group on the singlekey p which in turn is theonly value thatisthen passed to the next query part.

1 MATCH (c:City {id:$cityId}) 2 CREATE (p:Person {id: $personId, firstName:$fn, birthday: date($bd), creationDate: timestamp($cd), emails: $emails}) -[:IS_LOCATED_IN]-> (c) 3 WITH p 4 UNWIND $tagIds AS tagId 5 MATCH (t:Tag {id: tagId}) 6 CREATE (p) -[:HAS_INTEREST]-> (t) 7 WITH p, count(*) AS dummy1 8 UNWIND $studyAt AS s 9 MATCH (u:Organisation {id: s[0]}) 10 CREATE (p) -[:STUDY_AT {classYear: s[1]}]-> (u) 11 WITH p, count(*) AS dummy2 12 UNWIND $workAt AS w 13 MATCH (comp:Organisation{id: w[0]}) 14 CREATE (p) -[:WORKS_AT {workFrom: w[1]}]-> (comp) Listing 4.18: Simplified versionofQuery 10 (INS1) in Cypherthatinserts onlysome properties.

Apartfrom CREATE,Cypheralsoprovidesthe MERGE clausethatensures thatagiven patternexists in agraph. If the patternalready exists, it works similartoMATCH as the variablesare bound to the existing and matched entities.Otherwise,the entities are first createdand thenboundtothe variables. Cypherprovidestwo clauses to remove entities from thegraph: DELETE removesasingle node or edge whereas DETACH DELETE removesanodetogetherwith all incident edges. Regarding the removalofconnected nodesfrom the graph,stale edgesare not allowedat anytime.Therefore, aquerythat deletes anodewithout alsoremoving its relationships will fail. That alsomeansthat the removal of aconnected node will failvia the DELETE command. Using DETACH DELETE however, we are abletoremove connected nodesand

55 4. QueryLanguageAnalysis

trigger recursiveremovals via ashort query.This canbeseeninthe implementation of Query 11 in Algorithm4.19.

1 MATCH (comment:Comment {id: $commentId}) <-[:REPLY_OF*]- (replies:Comment) 2 DETACH DELETE comment, replies Listing 4.19: Query 11 (DEL7) as aCypherquery.

Apartfrom inserting and deleting nodesand edges, we can alsoupdatetheir properties, add newones or remove existingones. An exampleofsuchanupdate can be seeninAlgo- rithm 4.20 where we change an existingproperty. The SET command is alsoused to insert anew property,meaningthatCypher does nottestwhether the property classYear already existsbut simply savesitwith the new value,overwriting apotentially existing one or creatingitanew.Toremove aproperty from anodeoranedge, Cypher provides the REMOVE clause. NodesinCypher’s data model can have multiple labels where the same commands are used to update,add or remove them.Edges howevercan onlyhave one label that is setatedgecreation and cannot be updated or removedlater on. 1 MATCH (:Person {id:$personId}) -[r:STUDY_AT]-> (u:University {id:$univId}) 2 SET r.classYear=2020 Listing 4.20: Query 12 that sets (updates or adds) the property classYear.

4.1.5 Data Definition Cypherinits original versiondoesnot support aschema and “was originally conceived in adynamically typed,schema-lesscontext”[FGG+18b]thatworks on asingle graph only. That is stilltrue in the currentversion governedbyopenCypher[ope18], but other implementations like Morpheus: Cypherfor Apache Spark5, SAP HANAGraph6 and CypherinNeo4j support at least some kindsof(schema) constraintsand more than one graph.Wewill nowlookatthe constraints available inNeo4j as thesewill likely alsobeavailable in future versions of openCypherasthere exists aCypherImprovement Proposal7 forthem in openCypher. Unique node property constraints ensurethatthe valuesofapropertyonanodewith aspecific label are unique. This constrainthowever does notenforce thatthe propertyhas to existonall nodes with this label butonly that if thepropertyexists,its value is unique. Node property existenceconstraints allow us to specify thatall nodes with aspecific label have to have aspecific propertyand relationship property existence constraints are the counterpart for edges. Finally, node key constraints ensurethat aset of properties has to exist on each node with agiven label and thecombination of thepropertyvalues hastobeunique.Wecouldenforce at least parts of aschema using thelatterthree constraints. However, they are onlyavailable in the EnterpriseEditionofthe Neo4j database whereas theopensource Community Editionislimited to uniquenodepropertyconstraints.

5https://github.com/opencypher/morpheus 6https://www.sap.com/products/hana.html 7https://github.com/opencypher/openCypher/pull/166

56 4.1. Cypher

Even if we were to use Cypher as in Neo4j’sEnterprise Edition,wecouldnot implement Query13asthereisnoway to specify allowed patterns. The onlything we can do is to change theexisting relationships in away thatthey comply with the the query.Tocope with thelackofaschemaingeneral, Cypherprovidesfunctions like EXISTS,thatcan alsotestfor theexistence of patterns, or the OPTIONAL MATCH clause. However, having aschema would allow asystem to fully type checkqueries up-front andadd additional optimization potentialfor thequeryplanner and theexecution [FGG+18b].

4.1.6Further Features

• Composability: We have already seenthatCypher supports linear composition of queriesvia the WITH clause. If we look into (nested) subqueries, we are again limited to versions of Cypherdifferentfrom openCypher.There is howeveraCypher ImprovementProposal8 thattargetsversion10ofthe languageand introduces this feature in the openCypherversion. We will nowlookatanexample of acorrelated subquery as it is currentlysupported by the versionofCypher in Neo4j.Asan example, assume that we want to calculate the number of persons thatare older than John Doe. Algorithm 4.21 depictsanimplementation of this querywhere we use the “CALL {}”clause9 to evaluate the subquery.Again,wehave to use WITH to specify whichelements of the binding table from the outer query are passed to the subquery.

1 MATCH (p:Person {firstName: ’John’,lastName: ’Doe’}) 2 CALL { 3 WITH p 4 MATCH (o:Person) 5 WHERE o.birthday

However, Cypher allows us to formulate many suchqueries in away thatavoids subqueries altogether. The queryinAlgorithm 4.22 is equivalenttothe previous one but achieves this via two (uncorrelated)matches that are then relatedtoeach other in the WHERE clause. We use WITH in this implementation to group by p and countthe matches of younger persons before we return them.

1 MATCH (p:Person {firstName: ’John’,lastName: ’Doe’}) 2 MATCH (o:Person) 3 WHERE o.birthday

8https://github.com/opencypher/openCypher/pull/100 9Note thatthere is also a“CALL”clause available in both,Neo4j’s and openCypher’sversion, that invokesaprocedure. The “CALL {}”clause thatinvokesasubquery howeverisonly available in Neo4j’s version.

57 4. QueryLanguageAnalysis

4 WITH p, COUNT(o) as youngerCount 5 RETURN p.firstName, p.lastName,youngerCount Listing 4.22:Analternativeversion of thequery fromAlgorithm 4.21 that avoidsthe subquery.

Neo4j’s versionofCypher furthermore allows for existential subqueries using a combination of WHERE and EXISTS.Unlikeexplained above,queries nested through WHERE EXISTS can access variablesfrom the outer scopeevenwithout WITH.On the other hand, suchqueries canoftenbereformulatedinawaythatthey avoid existential subqueries altogetherbyusing amore sophisticated pattern, OPTIONAL MATCH or specifying therequirements in the WHERE clause.

Apartfrom that, Cypher in its currentstate is notacomposablelanguage as a query outputs atablebut takes agraph and an (empty) tableasinput. There is howeveragain aCipherImprovementProposal10 to include this in Cypher 10 via achangeofthe binding tables. Theplan is to adaptthese tablessothat they notonly include tabular information but alsoone or multiplegraphs, therefore becoming table-graphs [FGG+18b]. Thiswould enable query composability butalso requires achangeofother structures as Cypher9cannot deal with multiple graphs. There is already one implementation of Cypherthat supportsthese features,namely Morpheus,the version for . • Pathsasfirst-class citizens: We mentioned before thatCypher allowspaths to be matched and returned. However, this is not enough for it to treat them as first-class citizens as we cannot save paths or add propertiestothem and to the best of our knowledge,there is no attempt to change this as of now.

As we have seeninthis section, Cypher focusesheavily on patterns andprovidesasyntax thatallows forcompactqueriesinmanycases.Itprovidesamyriadofmathematical and string functions that we didnot introduceaswell as aggregating and graph-specific functions like shortestPath, nodes, labels.The versionofCypher in Neo4j also includestemporal functions for dates, times including time zones and durations as well as spatial functions. Apartfrom thesefunctions that are includedinthe language,wecan extend Cyphervia user-defined functions. These functions aredeployedintothe database and can then be called similar to Cypher functions in queries. TheNeo4j database also supports user-defined procedures thatcan takearguments and interactwith the database before returning aresult.InNeo4j,these functions and proceduresare written in Java and thecompanyprovidessome of them in bundles. AwesomeProcedures On Cypher (APOC)11 and abundle especially fordata scientists12 are examples for them. 10https://github.com/opencypher/openCypher/pull/241 11https://neo4j.com/labs/apoc/, https://github.com/neo4j-contrib/ neo4j-apoc-procedures 12https://github.com/neo4j/graph-data-science/

58 4.2. Gremlin

4.2Gremlin

UnlikeCypherthatisahigh-leveldeclarativelanguage, theGremlin query language [Tin20]isalow-level language that offersimperativeaswellasdeclarative constructs [Rod15]. Although it is afunctionallanguage at itscore,itismore imperativeinnature and focuses on graph traversals instead of pattern matching [AAB+17]. However, apart fromits focusontraversals and thereforenavigational queriesinthe imperativestyle, Gremlin providespattern matching featuresinadeclarativeconstruct [TPAV19]. From the earliestprototypes onwards, Gremlin extensively usesXPath to provide complex graph traversals [Lin18, AAB+18]. As brieflymentionedinthe introductionofGremlininChapter 2, Gremlin denotes not onlythe querylanguage butalsothe traversalmachine GTM usedinthe TinkerPop framework. Aquerywritten in the Gremlin querylanguage is compiled to generatea traversal thatisthen executed on theGTM. In the imperative traversalmode, that is alsousedinsome sense to match patterns as we will seelateron, theuser specifies so called motifs,that are traversal instructions containing multiple steps.These steps are then executed by a traverser,aninstance of a traversal,inthe GTM.Inotherwords,a single traversalhas aset of traversers attachedtoitthatwalk on agraph according to the steps fromthe query.The result of suchatraversalthen contains all locations where atraverser halted.From nowon, we mean thequery language and not thetraversal machine whenwespeak about Gremlin. [Rod15, AAB+17] Before we can examineour first query on thethird line in Algorithm 4.23, we have to createatraversalsource on agraph. Suchasource is needed to createatraversalthat in turn resemblesaquery and interacts with agraph. After connectingtosome graph and assigning it to the graph variableinline 1, we createthe traversalsource g in line 2. Thistraversalsourcecan be usednot onlyfor asingle query but forall queriesonthe graph.The queryinline3selectsall properties(values)ofJohn Doe. Similar to the linearstructureofCypher queries, aqueryinGremlin is alwaysevaluated fromleft to right.Starting fromthe traversalsource g,wefirst generateatraversalthatstarts at the verticesinthe graph via V().Intuitively,there is nowone traverser on eachvertex in the graph andthe followingsteps specify howthey proceed.Ifthe query wastoend here, meaning that it onlycontains g.V(),weget all nodesasaresult sincethere is one traverser on eachnodeafter the executionofV().Inour query in Algorithm 4.23 however, we continuewith a has() step to remove all traverserswherethe corresponding element (a node in our case) does not have the given key/value property. Therefore, we are left with traversersononly those nodes thathavea’firstName’/’John’ property. Note thatthis alsoremovesall traversers on nodes thatdonot resembleaperson as thesedo not have a firstName property. After repeating this step to also remove traversers on nodes thatdonot have the ’lastName’/’Doe’ property, we selectall propertyvalues of the node with values. Note that thecommands can be run as given in thealgorithminthe Gremlin Console, an interactive terminal that connectstothe GTM and allowsfor ad hocqueries to be

59 4. QueryLanguageAnalysis

executed. Forqueries in applicationshowever, one normally usesanimplementation of Gremlin in aprogramminglanguage and interacts with aGTM from there.Wewill go into more details regardingsuchimplementations and theirdifferencesdetaillateron. 1 gremlin> graph =TitanFactory.open() 2 gremlin> g=graph.traversal() 3 gremlin> g.V().has(’firstName’,’John’).has(’lastName’,’Doe’).values() Listing 4.23: An exampleonhow to connecttoagraph,create agraph traversalsource on thegraph and spawn atraversalinthe Gremlin Console. Thequery then selects the propertiesofnode(s)with the name John Doe.

The query in Algorithm 4.24 selects thenames of all friendsofJohn Doe. In the previous query,weremovedall traversers on nodesthatdonot resembleaperson implicitlyby requiring the property firstName.Wecan alsostate suchalabel requirement explicitly via the hasLabel() step. After thesteps in line 1are executed,weare leftwith a traverser on the node representing John Doe(assuming that thereisonly one suchnode). The next step, out(),isavertex step that allowsustotraverse overanoutgoingedge to an adjacentvertex, denoted as outgoingadjacentinGremlin.Inour query,welimit the traversaltoknows edges (edgeslabeled with “knows”).Incase thereare multiple such adjacentvertices, this stepessentially spawnsnew traversers andassigns eachofthose verticesone traverser. Apartfrom moving to outgoingadjacentvertices,Gremlin also supports in() to move to incoming adjacentvertices (adjacent vertices connectedover incoming edges) and both() to move to both, incoming and outgoingadjacentones. In the last line of this query, we select thefirstand last name of the friends of JohnDoe. 1 g.V().hasLabel(’person’).has(’firstName’,’John’).has(’lastName’,’Doe’) 2 .out(’knows’) 3 .values(’firstName’, ’lastName’) Listing 4.24: AGremlin query that selects the names of John Doe’s friends.

As we have seeninthis query,weare forcedtothink in agraph-perspectivewhen writing aqueryinGremlin [Rod15]. Untilnow,this perspectiveismostly limited to verticesas we can selectand filterthem and move fromone to another. However, we can alsomove to edges whichcan be seeninAlgorithm 4.25. Thisquery selectsamap containing all key/value pairs of outgoing knows edges of JohnDoe via valueMap().The outE() step is again avertex step butunlikethe ones mentionedbefore, movestooutgoing incident edges.Similartothe vertex steps fromabove,Gremlin alsoprovidessteps like inE() to move to incoming incident edges and bothE() to move to both, incoming and outgoingones. All steps mentioneduntil nowcan be invokedontraversers on vertices. Furthermore, steps like outV(), inV(), bothV() and otherV() can be usedon traversers on edges that allow us to move from an edge to avertex. 1 g.V().has(’firstName’,’John’).has(’lastName’,’Doe’).outE(’knows’).valueMap() Listing 4.25:AGremlin query thatselects properties of outgoing knows edges of John Doe.

60 4.2. Gremlin

All of our queriesuntil noware composed as aconcatenation of steps which lookssimilar to ageneral programming language. And indeed,Gremlin is sometimesdenoted as a “programminglanguage for property graphs”[AAB+18]. Thissets itapart from most other querylanguages,especiallyalso Cypher, and allowsusfor exampletostore the result fromone query in avariableand use it later on in adifferentquery. Algorithm 4.26 contains some examples for this whereeachline depicts asingle query.The query in the first line assigns thenodedepicting JohnDoe to the variable john.Inordertoassign it, we have to use the next() step that triggersthe evaluation.This is needed as Gremlin usesalazy evaluation on most queries. We were abletoomit this stepinthe previous queriesasqueriesentered in theGremlin Console are evaluated automatically after pressingEnter.However, thatdoesnot work when assigning aqueryresult to variables, where we have to manuallytrigger theevaluation of thequery to assign theresult.Now thatwehave areference to this node, we can use this variable in the followingqueries. As an example, we directly start fromthe node john in the query in line 3and traverse to verticesconnected over knows edges and select their firstName. Anotherarea where Gremlin differs from most query languagesisthe support for lambdas. Lambdasteps like map(), flatMap() and filter() allow for custom functions that can be usedtorestructureorfilter data.Although these steps “represent the foundational constructs of the Gremlin language”[Tin15, Tin20], they should be avoidedifpossible as they cannot be inspected and optimized by the compiler. Nonetheless, having theability to use lambdas in case there is no lambda-lessstep directly providedbyGremlin allows auser to essentially run anyfunction. Therefore, it is alsonot surprising that Gremlin is aTuring-completelanguage [WD18]. The queriesinlines3,4and 5are equivalent and while thefirst versionuses Gremlinfunctions, thelatterones use the lambdastep map(). 1 gremlin>john =g.V().has(’firstName’,’John’).has(’lastName’,’Doe’).next() 2 //the following three queries are equivalent: 3 gremlin>g.V(john).out(’knows’).values(’firstName’) 4 gremlin>g.V(john).out(’knows’).map(values(’firstName’)) 5 gremlin>g.V(john).out(’knows’).map {it.get().value(’firstName’)} Listing 4.26:Multiple Gremlin queriesthatassign and use avariableand demonstrate the use of lambdafunctions.

Each queryuntil nowstartedwith the V() step, that spawnsatraversalonthe vertices fromthe traversalsource g.Gremlin alsoallows atraversalthat startsonthe edgesvia the E() step as can be seeninAlgorithm 4.27. groupCount() is astep thatcounts howoften an object has been partofatraversaland cantherefore notonlysummarize labels but also anyotherinformationlikethe namesorages of persons. 1 g.E().label().groupCount() Listing 4.27: AGremlin traversalstarting fromthe edges that generates asummary of the edgelabels.

61 4. QueryLanguageAnalysis

We can furthermoreuse union() to combinethe resultsofmultiple traversals or query parts. As we can see in Algorithm 4.28, we can save areference to an object using as() and can later on in thequery accessitvia select().This query selects apersonwith agiven id13 and combinesthe first name of thepersonwith the number of outgoing knows edges in the union() step. The laststep, fold(), aggregates these two valuesintoalistand emits it.

1 g.V().has(’iid’,’person:933’).as(’person’). 2 union( 3 select(’person’).values(’firstName’), 4 outE(’knows’).count() 5 ).fold() Listing 4.28:Agremlinquery thatoutputs informationabout aspecific personand counts the number of outgoing edges.

All queriesinthis chapter were testedonaTitan 1.0.0 database14 thatnativelysupports the TinkerPop graph stack and thereforealsothe Gremlin query language.However, as Titan is no longer maintained (but afork of it,JanusGraph15,is),itdoesnot use the most recent version of TinkerPopand thereforeGremlin [Tin20]but the older version 3.0.1-incubating [Tin15]. As we ranall queriesonTitan, they are compatible with version 3.0.1and do not use additions that were introducedinlaterversions. An exampleofsuch an addition is the propertyMap() step that extends valuesMap() to alsoinclude the id and thelabel of an entity.However,all of thecore functionalities and steps are alsoavailable in version3.0.1.

Apartfrom the database we will brieflyexplain the workingsofthe Gremlin Console, whythereare implementationsofGremlin in multiple programming languages and which one we use in our queries. The TinkerPopprojectprovidesareference implementation of Gremlin in Java 8: Gremlin-Java.However,usersare freetoimplementavariant of Gremlin in anyother programming language as longasthe language supportsfunction composition and function nesting. Suchanimplementationprovidesthe step functions and is responsible for thetranslation of queriestoGremlin thatinturn canbe executed on aGTM.Apart fromthe implementation in Java,TinkerPop also provides an implementation in : Gremlin-Groovy.This language variant is used in the Gremlin Console.Therefore,the Gremlin queries we have seen untilnow can all be run on everyGroovy-systemthat includes the required libraries forGremlin.However, the only notabledifferencesbetween thesetwo language variantsregardsthe use of single versus double quotes forstrings(Groovyallows single quotes, Java requires double quotes), and asmallchange in syntax on anonymous entities.The followingqueriesare implemented in Gremlin-Java.Wepooled the benchmark queriesfrom twoGitHub repositories: the

13We use the property iid in our datamodel for our custom ids as id() is anativeGremlin function that accesses theidgenerated by Gremlin and not the one generated by theLDBC datagenerator. 14http://titan.thinkaurelius.com/ 15https://janusgraph.org/

62 4.2. Gremlin

original implementation from PlatformLab16 and afork and extension of this from Anil Pacaci17.

4.2.1 Structure Independent We have already seenhow aperson node with agiven first andlastname can be selected, which is exactlywhat the query in Algorithm 4.29 does. Nonetheless, we givethis as an exampleofaquerywritten in Gremlin-Java,wherethe onlydifference to queries from before is the usage of doublequotes forstringsThatmeans, thequerycan be executed directly in anyJavaapplication thatincludesthe Gremlin-Java libraries.Therefore, the usage of Gremlindiffers from the use of query languageslikeSQL in application code as SQL code is usually providedasastring thatispassed over an interface to thedatabase. In contrast,aGremlin query is implementedinthe programming language itself and compiledtoBytecode to be executed. The query also showsanextensionofthe has() step that can require anodetohave alabel (the first argument)and akey/valueproperty in asingle statement.

1 String firstname = "John"; 2 String lastname = "Doe"; 3 System.out.println(g.V().has("person", "firstName",firstname).has("lastName" ,lastname).valueMap()); Listing 4.29: Query 1asaGremlinquery in the Gremlin-Java language variant.

Anotherexample where we can see theinteraction between code written in Java and the Gremlin query is given in Algorithm 4.30. Whereas thequery itself is given in lines1-3, the result of thequery is accessedinthe followinglines where we checkwhether the post contains an imageFile or atextual content.Nonetheless, we run into alimitation of Gremlin when implementing this query as thedatamodel onlysupportsasingle label on vertices. In our case however, postsand commentsare both messagesand we labeled them either with “post, message”or“comment, message”when working with Cypher. Sincethe corresponding nodes in Gremlin canonly have asingle label, they are either labeled by “post”or“comment”, and we cannot directly search for a message. We achievethis functionalityinGremlin using the within predicate in a has() step to filternodes labeled with post or comment.Apart fromthat, the algorithm shows that accessingthe resultsfrom Java is not as straightforward as one mightthink since we need to repeatedly cast variablesoftype Object to List in order to get thefirst entryofthese lists.Onaside note,the T and P prefixes are onlyneeded in Gremlin-Java as arestriction from Java and can for example be omitted in Gremlin-Groovy.

1 Mapvalues =g.V().has(T.label,P.within("post", "comment")). 2 has("iid",messageID). 3 valueMap("iid", "creationDate", "content", "imageFile").next(); 4 List result = new LinkedList<>();

16https://github.com/PlatformLab/ldbc-snb-impls 17https://github.com/anilpacaci/graph-benchmarking

63 4. QueryLanguageAnalysis

5 result.add(((List) values.get("iid")).get(0)); 6 result.add(((List) values.get("creationDate")).get(0)); 7 if (((List)values.get("content")).get(0).equals("")) 8 result.add(((List) values.get("imageFile")).get(0)); 9 else 10 result.add(((List) values.get("content")).get(0)); Listing 4.30: Query 2(IS4) as aGremlin query in Gremlin-Java.Wecan see howresults of aquery canbeaccessed in Java.

Algorithm 4.31 contains twodifferentimplementationsofQuery 3. Forthis query,we have to access themulti-value property speaks of aperson.Gremlin’s data model allows forsuchmulti-value properties, meaning that asingle propertykey on avertex can have multiple valuesattached to it.The first version savesareference to atraversalinJava and iterates manually over thetraversaltotrigger thelazy evaluation and access the results.Inthe second version, we use twoqueries that calculate thetwo sums and then calculatethe average in Java.However,both versions compute the average in Java and not in the query directly.While Gremlin providessteps to group,countand compute the average, it is not clear howtoformulatethis with the access to amulti-value propertyin asingle query. 1 //Version 1: 2 Map map; 3 GraphTraversal> tr =g.V().hasLabel("person"). values("speaks"); 4 while(tr.hasNext()) { 5 map =tr.next(); 6 overall+=((List)map.get("speaks")).size(); 7 entries++; 8 } 9 double result=overall * 1.0 /entries; 10 log.info("Result of Query 3(avg. spoken languages): " +result); 11 12 //Version 2: 13 long sumLangs =g.V().hasLabel("person").properties("speaks").count().next(); 14 long sumPersons =g.V().hasLabel("person").count().next(); 15 double res =sumLangs * 1.0 /sumPersons; Listing 4.31: Twoversions of Query 3inGremlin.

The two queriesinthe second variant implicitly group in the count() step. Apartfrom this, Gremlin providesagroup() step where theuser can specify thekey to group by.Al- gorithm4.32 contains some examples of group-by in the setting of asimple socialnetwork where usersare stored as verticeswith a name and year property. The queryinline 1 uses groupCount() together with a by() step that specifies thekey to group by. by() is not an actual step as it has no meaningonits ownand canonly be usedtogether with steps like groupCount(), group() or select() as we will see later on. The query in line 3isequivalenttothe one in thefirst line,but shows howwecan manually trigger the groupingand specify thekey in the following by() step. Sincewewanttocount

64 4.2. Gremlin

the persons foreachyear (groupingkey), we endthis query with another by(count()). Takentogether, we can conclude that agroup-byoperationthataggregates theval- ues in the groupsisgiven as group().by(grouping_axis).by(reducing_step). Apartfrom this, the same construct canalso be usedtoprojectinstead of reduce via group().by(grouping_axis).by(projection_axis).The query in line 6pro- videsanexample of this as it groups by year and returnsalist of personnames for each year.

1 g.V().hasLabel("person").groupCount().by("year") 2 //is equivalent to: 3 g.V().hasLabel("person").group().by("year").by(count()) 4 5 //this createsamap with the year being the key and the names the values: 6 g.V().hasLabel("person").group().by("year").by("name") 7 8 //by can also containatraversal, groups by the number of incidentedges: 9 g.V().group().by(bothE().count()) Listing 4.32: Grouping operationsinGremlin queries.

Gremlin allowsustonot onlygroup by apropertyoralabel butalsoonthe resultof atraversal. To this end,wecan have atraversalinside anyofthe twooccurrences of by() in agrouping operation,wherethe resulteitherspecifies thekey to group or the projection. An exampleofthis is given in line 9wherewegroupall verticesbytheir number of incident edges,bothincoming and outgoing. Furthermore, by() alsoaccepts functions in the reducing_step.Similar to other query languages, Gremlin provides common aggregate functions like min, max, mean, count and sum.Together with lambdafunctions, we can essentially provideany functionasafunctiontoreduce-by (aggregate function),evenifitisnot providedbyGremlinout of the box.

4.2.2Pattern Matching All queriesuntil nowused imperativetraversals thatare at thefocus of Gremlinas query language, especiallyfor navigational queries. However, Gremlin alsoprovidesthe declarativepattern matching step match() thatcan be usedinany traversal. That allowsaquery to switchfromimperativetraversals to declarativepatternmatching and backagain in asingle query and with thesame traversalsource [Rod15]. In contrast to other graph query languages like Cypher, Gremlin alsouses traversals and thereforea navigational approachtoencode the structure of apattern[AAB+17]. The navigational approachhoweverresultsinpattern matching queries thatare often notascompactor easy to understand as in higherlevel,declarativelanguages like Cypher.Caused by the focus on navigational queries, some pattern matching queriescan be writtenevenwithout match() as we willsee in the followingtwo algorithms. The first queryinAlgorithm 4.33 is suchanexamplewherewedonot have to use the match() step to depict thepattern. This is especiallycaused by the fact thatweare looking forasmall patterninthis case as we are only interested in theconnection between

65 4. QueryLanguageAnalysis

a person and the city he/she is living in. As mentionedbefore, the as() step allows us to store areference to an entity thatwecan later access.This is notlimited to the patternmatching part. In this case, we select the personnodewith the given iid and store areference underthe name person.Wethen proceed by navigatingoveran outgoingedgelabeled by isLocatedIn and store that node as city.Now thatwehave those references, we can select the corresponding entities and project to some values. To this end,weappend one by() for eachentry in select to specifythe projection for the correspondingentity. An implementation of thesame queryusing thepattern matching step is giveninlines 6-9. The querystarts, similartomanynavigational queries, from thetraversal source by spawning atraversalonthe vertices. We then switchtothe patternmatching step via match().After saving areference to thestart node of thetraversal(person), we proceed as we would in anavigational query.Therefore, the implementation using pattern matching in this scenario looks very similartothe one using navigational constructs only. However, we will see amore sophisticated example lateroninAlgorithm 4.35. On a sidenote, thequery selects all propertiesofthe person and the iid of thecity. That does notexactly matchthe definitionofthe query in theSNB. However, the version of Gremlin that we use (3.0.1-incubating)doesnot provide asteptoprojecttocertain propertiesand we will showaway to dealwith thatinthe next query. 1 g.V().has("person", "iid",personId).as("person"). 2 out("isLocatedIn").as("city"). 3 select("person", "city").by(valueMap()).by("iid"); 4 5 //equivalentversionusing match: 6 g.V().match( 7 __.as("person").has("person", "iid",personId). 8 out("isLocatedIn").as("city") 9 ).select("person", "city").by(valueMap()).by("iid") Listing 4.33: Query 4(IS1) as aGremlin query.

As an implementation of Query 5usingthe patternmatching constructslooks very similar to theone fromthe previous query,wewill onlylookatthe implementation using traversalsinAlgorithm4.34.The traversalstarts at the person with agiven iid and then switches from that node to outgoing knows edges. Line2therefore operates entirely on thoseoutgoingedges and includes the order().by() steps that order them according to thedescending creationDate.Wethen move to the other vertex on the edge, theone thatwedid not camefrom, viathe otherV() step. As theoutput should containthree propertiesofthese vertices,wesave three referencestothem.Finally,we selectall thesereferencesand specify the projection-axisfor eachofthem.This achieves the same functionalityasthe project() step that wasintroduced in newer versions of Gremlin. 1 g.V().has("person", "iid",personId). 2 outE("knows").order().by("creationDate", decr).as("knows"). 3 otherV().as("a", "b", "c").

66 4.2. Gremlin

4 select("a", "b", "c", "knows"). 5 by("iid").by("firstName").by("lastName").by("creationDate") Listing 4.34: Query 5(IS3) as aGremlin query.

The implementation of Query 6inAlgorithm 4.35 is an exampleofapattern matching query where we use the match() step in amore sophisticated way. We have to start the patternmatching constructbysavingareference to the start node m.With that reference,weproceed with anavigationalquery in the rest of line 2. The commaatthe endofline 2intuitively endsthis traversaland anew one is started at thebeginning of line 3. We can either createanew reference,or, as we do here, re-use an existing one. Therefore,westart again at thenodethat we savedinmand proceed with another navigational query.Assoonaswehaveall referencesthat we need, we leave the match() step, thereforeswitching again to theimperativetraversalwhereweselectthe values thatshould be includedinthe output.Inthis case, we omitthe by() steps suchthat a reference to the entities themselves willbereturned. Thisisagainnot what thedefinition of thequery in the benchmark states. However, the query alsolacks thebooleanvalue thatdepicts whether the creator of theoriginal message (cr)and the creator of the replyingcomment (replyAuthor)knoweachother. This value could be generated by asimple checkinthe programming language whether cr is includedinfriends.When working with Gremlin, manyapplications aggregate and sortthe required information after theextraction, meaning that thequery often selects more information than needed and adata structureinthe programming language is then populated with the desired informationafterwards.

1 g.V().match( 2 __.as("m").hasLabel("post").has("iid",messageId).out("hasCreator").as("cr" ), 3 __.as("m").in("replyOf").hasLabel("comment").as("co"), 4 __.as("co").out("hasCreator").hasLabel("person").as("replyAuthor"), 5 __.as("replyAuthor").out("knows").hasLabel("person").fold().as("friends") 6 ).select("m", "co", "replyAuthor", "friends") Listing 4.35: Query 6(IS7) as aGremlin query. Note howeverthatweare notable to generatethe boolean value depicting whether cr and replyAuthor knoweachother insidethe query.

Nowthatwehave seenexamplesofpatterns in Gremlin, we will go oversome general remarksand furthercapabilities regarding patterns. Generally,Gremlin supports bgps, filters (forexample the where() and has() steps), projection to some extent but alsocgps using steps like union() [AAB+17, AAB+18]. Apartfrom the unionoftwo traversals we can alsospecify branches viathe choose() step and even cyclic occurrences via repeat().Sincetraversals are used inside thedeclarative matching step to encode the pattern, we can use allofthoseconstructs alsowhen describing patterns. That enables expressive patterns and thedocumentationevenstatesthat it allows for“patterns of arbitrary complexity”[Tin15]. Furthermore,asGremlin uses function composition

67 4. QueryLanguageAnalysis

and nesting, and match is nothing but afunction, we can recursivelynest match() steps.Taken together,wecan saythatpatterns in Gremlin add adeclarativeflavor to the language whichhowever relies heavily on imperativeconstructslikethe traversals. Compared to other graph query languages, this approachtogetherwith the language being low-leveland having acompactsyntax resultsinqueriesthat areoften not easy to read[HP13]. Evaluation Semantics In contrast to Cypher, Gremlinuses thehomomorphism-based bag semantics when evaluatingapattern[AAB+17]. Therefore, the matches are not restricted which can resultininfinite resultsets. That is alsoone of thereasons whyGremlin usesalazy evaluation on most queries. Usershave to be aware of this when writing queriesasthe evaluation of an unrestricted querycan be non-terminating,depending on theconcrete implementation of the language-variant.

4.2.3 Path As brieflymentioned before, Gremlin focuses on navigational queriesand providesmany constructstospecify and traverse paths.Traversals are at thecore of everyquery and we specify path-patterns viasteps on thesetraversals. Apartfrom simple,non-repeating path-patterns,Gremlin alsosupportsarbitrary or fixed iterations of traversals via the repeat() step. Therefore, Gremlinsupportsfull RPQs and is even moreexpressive as it allowsiterations of wholetraversals [AAB+18]. We can alsoreturn paths in their owndata structure,which is essentially alist of objectsand therefore similartoCypher. However, as thedatamodel does not support paths,wecannot augment them with propertiesorstore them.[AAB+17] Reachabilityquery7asks for thelengthofthe shortestpathbetween two person nodes. Gremlin does not provide afunctionthat selects such apath but we can find notesonthis on theRecipes page on theTinkerPop website18.Before we limitourselves to the shortest path in Algorithm 4.36, we start with adescription of the path in general. Startingat the node representing one person, we use repeat() together with until() to specify apath of arbitrarylengthover knows edges. The simplePath() step ensures that the traversers do not repeat theirpaths, thereforeavoidingcycles.Gremlin alsoprovides a path() step that does not restrictthe traversers and a cyclicPath() step that ensures that thetraversers do repeattheir paths.Inour case, we loop untilwefind the vertex with the given iid.Weuse path() to get that pathand count thenumberof verticesonitvia count().Sinceour understandingofapathlengthusuallyregards thenumberofedges, we subtract 1. Thisquery selects asingle shortestpathaswepick the first onevia next() and thefirstpath will alwaysbeashortestone in Gremlin. Note howeverthat this implementation cannot deal with thetwo special casesgiven in the definitionofthe query: if person1Id=person2Id,this implementation will not return 0but searchfor apath with lengthatleast1.However, this case canbeeasily handled

18https://tinkerpop.apache.org/docs/current/recipes/#shortest-path

68 4.2. Gremlin

in the programming language by acomparison of thetwo parameters.Furthermore, if the two personsare notconnected over knows edges, thequery willeithertestall simple paths or run intoatimeout depending on theconcreteimplementation of the language-variant. Whereas some versions of Gremlinprovide access to thenumberof iterations via loops() and thereforeallowfor abounded repetition, we didnot manage to use this functioninour version. 1 length =g.V().has("person", "iid",person1Id). 2 repeat(out("knows").simplePath()).until(has("person", "iid",person2Id)). 3 path().count(Scope.local).next(); 4 length =length -1; //as lengthcontains the number of vertices Listing 4.36: Parts of Query 7(IC13) as aGremlin query.

The queryinAlgorithm 4.37 integratesthe patternmatching functionalityofGremlin with apath-query. We start with aselection of thepersonand move up to three times (the argument of times() specifies an upper bound) over an outgoing knows edge. In line 3, we removeduplicates via dedup() to avoid cycles and limitourselves to nodes with thegiven firstName.Now thatwehave areference p to those persons, we switch to the declarativepattern matching style andgathermore informationrelated to p. If we compare this query to its implementation in Cypher in Algorithm4.17,this version looks quitecompact. However, this query does notfully matchthe definitionofthe query given in thebenchmarkand is onlygiven to get afeelingfor suchscenarios19. The implementation given here for exampledoesnot includethe distance between the startingpersonand the personwith the given firstName.Furthermore, thequery belowmatchesonly on thosenodes p where thecompletepattern matches, i.e. when anode p hasoutgoing isLocatedIn, workAt and studyAt relationships.Asnot every personhas to have work and study related information in ourdata model, we used OPTIONAL MATCH for thisinthe Cypher query.However,there is no step in Gremlin thatmatches apatternonly if it exists. We can achievethis functionalitybysplitting the query into multiple ones, where thefirst one collects thefriendswith the given first nameand theirlocation. Passing this information on to subsequentqueries, we can then checkwhether work and study related information existsand,ifapplicable, collect this. 1 g.V().has("person", "iid",personId). 2 repeat(out("knows")).times(3). 3 dedup().has("firstName",firstName).as("p"). 4 match( 5 __.as("p").out("isLocatedIn").as("locationCity"), 6 __.as("p").out("workAt").as("company").out("isLocatedIn").as("cc"), 7 __.as("p").out("studyAt").as("uni").out("isLocatedIn").as("uc") 8 ).select("p", "locationCity", "company", "cc", "uni", "uc").by(valueMap()); Listing 4.37: Parts of Query 8(IC1) as aGremlin query.

19As forthe otherqueries, an implementationofthis query as it matches the benchmark’sdefinition canbefound in our GitHub repository.This implementationhoweverisnotably longer with around 150 lines of code, uses multiple, more sophisticated Gremlin queries and aggregates all information in Java.

69 4. QueryLanguageAnalysis

The query in Algorithm 4.38 does not containany new concepts apart from the limit() step that,asthe name suggests, limits the number of traversers fromthatpoint on. We use this step to output informationregardingthe 10 mostrecent messagesfrom the user, by first sortingthem by their creationDate in descending order and then limitingthe result to 10 entries.Again, thequery as given here cannot deal with aspecial case,namelythatone of themost recent messagesisapost and not a comment.In thatscenario, thequery would not match the message at all as a post has no outgoing replyOf edge. We couldachievethis by splitting it into multiple querieswherewe first select the 10 most recent commentsorposts, and gather therestinanother query dependingontheir type.

1 g.V().has("iid", personId). 2 in("hasCreator").as("message").order().by("creationDate", decr). 3 repeat(out("replyOf").simplePath()).until(hasLabel("post")).as("post"). 4 out("hasCreator").as("op"). 5 select("message", "post", "op").by(valueMap()).by("iid").by(valueMap("iid", "firstName", "lastName")). 6 limit(10); Listing 4.38: PartialimplementationofQuery 9(IS2) in Gremlin. Note that this implementation cannot dealwith the case thatarecent message is a post.

Evaluation Semantics Under default settings, Gremlinuses arbitrary path semantics where paths in the evalu- ation are not restricted.Therefore, the evaluation of aGremlin query cancontainan infinite number of paths but also paths of infinite lengths. As brieflymentionedbefore, we can alsoswitchtono-repeated-node semantics using the simplePath() step that avoidsnon-terminatingtraversals. We have alsomentionedthe cyclicPath() step thatuses aspecialsemantics where onlytraversers thatrepeat anyoftheirentities (nodes or edges) on theirpathare kept. That does not avoid non-terminatingtraversals as suchaquery canforever loop overacycle in the graph ifthe length is not restricted. Regarding possible outputs of queries, GremlinbehavesverysimilartoCypher as it supports fixed arity outputs likesingle values, propertiesorentities as well as variable arity outputslikepaths and lists.[DXWL19]

4.2.4 Data Manipulation Anodecan be inserted in atraversalusing the addV() step that takes alist (or arrayin Gremlin-Java)asinput and createsanode. The array can be used to setthe label,idand propertiesofthe entityatcreation timeaswecan see in Algorithm 4.39. Propertiescan be added to an existingentityeithervia areference as in line 4orinatraversalasinline 5. Similarly,wecan remove apropertyeitherfrom areference (line6)orinatraversal (line 7). Note howeverthat there are two differentsteps for this: whereasproperties and entities are deleted via remove() whenusing areference, drop() providesthe same functionalityfor traversals.Regarding the insertion of edges (lines9-10), we first select the target node and connect the existing node to the target via the addEdge(

70 4.2. Gremlin

) step. Note that this can alsobedone in atraversalbut since the syntax for this differs heavilybetween theversion we are using andnewer ones,weomitsuchan exampleand refer thereader to thedocumentation. Thequery given belowdepicts only asmallpart of Query10and the full query is again giveninour GitHub repository. 1 Object[]keyValues = new Object[]{T.label, "person", "iid",personId, " firstName",firstName}; 2 Vertex person =g.addV(keyValues).next(); //store reference to vertex 3 4 person.property("lastName",lastName); //add propertyvia reference 5 g.V(person.id()).property("cd", "test"); //add propertyvia traversal source 6 person.property("lastName").remove(); //remove property via reference 7 g.V(person.id()).properties("cd").drop(); //remove property via trav. source 8 9 Vertex place =g.V().has("iid",cityId).next(); //get target vertex 10 person.addEdge("isLocatedIn",place); //add edge 11 [...] Listing 4.39:Simplified versionofQuery 10 (INS1) in Gremlin that inserts onlysome propertiesand asingle edge.

UnlikeCypher, there is no wayinGremlin to remove avertex without alsoremoving its incident edges.Tothatend, the drop() step, whencalled on avertex,automatically removesall incidentedges together with the vertex. This step is alsousedtoremove propertiesaswehave seen above, and can also be used to remove edges. As we can see in Algorithm 4.40, thesyntax is notably more complicated compared to Cypher’swhen triggeringthe removal of adjacententities. In thisexample,weselectthe initialcomment in the first line and passthis through union() via the identityfunction __() in line 3. We repeatedly move to adjacentvertices that are connected over incoming replyOf edges and emitthem suchthat all those verticesare contained in theresult of union(). 1 g.V().has("iid",commentId). 2 union( 3 __(), //identity function to pass the node withiid=commentId 4 repeat(in("replyOf")).emit() //emit all replies, not only the leaves 5 ).drop(); Listing 4.40: Query 11 (DEL7) as aGremlin query.

Propertiescan be updated the same wayasthey are inserted: using the property() step. When evaluating property("key","value"),the system willfirst removeall propertiesunderthe key on thegiven entitybefore thenew key-valuepropertyisadded. Therefore, thequery in Algorithm 4.41 can simply set thenew property without having to delete the previous value. 1 g.V().has("iid",personId).outE("studyAt").as("e"). 2 otherV().has("iid",univId). 3 select("e").property("classYear", "2011") Listing 4.41: Query 12 as aGremlin query.

71 4. QueryLanguageAnalysis

On aside node, Gremlin supportsmulti-valuepropertiesonvertices, meaningthata vertex canhave multiple propertyvaluesunderthe same key.When addingorupdating suchapropertyusing the versionofGremlin in Titan, we can specify that thevalues under agiven keyare stored as alist (allowingduplicates),asaset (noduplicates)oras asingle value (ruling out multi-values on that property key). Apartfrom multi-value ones, Gremlinalsosupports meta-properties, thatare propertiesonproperties. This can forexample be usedtostore access permissionsoncertainpropertiesthat canbechecked by an application as we can see in Algorithm 4.42.

1 v=g.V().has("name", "John Doe").next(); 2 g.V(v).properties("name").property("permission", "all"); //set permission 3 g.V(v).properties("name").hasValue("permission", "all"); //checkpermission Listing 4.42: Example of thecreationand access to meta-properties in Gremlin.

4.2.5 DataDefinition The datamodel used by Gremlin is adirected, attributedmulti-graphwhereeachentity can onlyhaveasingle, immutable label.Therefore, we cannot change thelabel of an existingedgeorvertexafter its creation.Gremlin alsoallows us to connecttomultiple graphs. Atraversalhowever can onlyaccess asingle graph as it is spawned froma traversalsource (g in our examples),thatinturn is associatedwith agraph as we have seeninAlgorithm 4.23. Regarding aschema, Gremlinasspecifiedbythe TinkerPop projectdoesnot provideany steps or functions to view, create,update or delete aschema and is thereforecompletelyschema-less. Eachvendor that integrates theTinkerPop framework to become TinkerPopenabled canchoosewhether aschema is required, supported, or if it is aschema-lesssystem.Tothis end,avendor can providefunctions to access and modify aschema or other constraints.

4.2.6 FurtherFeatures

• Composability: The versionofGremlin usedinour examplesdoesnot support multiple occurrences of either the V() or E() steps and thereforeno(linear) composition of queries. However, there is support forthis in newer versions as in lines1-2 in Algorithm 4.43. We can alsonest multiple traversals as we can see in lines4-6, where next() triggersthe evaluation of theinnertraversal(line 5) that returns alist of strings which in turn is usedasparameter in theouterone thatreturnsall persons with the same firstName as theperson with id 102 or 103.

1 g.V().has("name", within("John", "Mary")).as("person"). 2 V().hasLabel("software").as("software").select("person","software") 3 4 g.V().has("firstName", within( 5 g.V().has("iid", within("102","103")).values("firstName").fold(). next())

72 4.2. Gremlin

6 ) Listing 4.43:Example of aqueryusing multiple V() steps (as it is supportedinnewer versionofGremlin) and of aquery with anested traversal.

As brieflymentionedbefore, we can alsonest manysteps,asastep is essentially afunctionand Gremlin supports, andevenrelies on, function composition and nesting.Apart fromsuchapplications where we convert the output for exampletoa string or alist of strings and compare properties, Gremlinqueriesare generally not composable.That restrictionmightcome from thedesignofthe language,where a queryisrepresented by atraversalthat is lazily evaluated by default. Furthermore, the setting of Gremlinasalanguage that is implementedaslanguage variant and embeddedintoaprogramming language places thequery language in adifferent position compared to most other querylanguages.Thatfor exampleallows us to split abigger queryintosmallerones, save the outputs in datastructures of the programming language and even outsourcesome computation or querylogic to the programming language. Thisallows foracompletely differentway of writing queries,composing them and relaying theirlogic to oneanother.

• Paths as first-classcitizens: Similar to Cypher, Gremlin providesthe possibilitytooutput paths in theirown data structure. However, paths are not raised to first-class citizens as thereisno support forthem in the data model whichmeans that we cannot save them in the model.

We have seensome of theOLTPfunctionalityofGremlin and itsratherunusual approach as alow-level querylanguage that is moreimperativeand merges with the implementing programminglanguage. Thisapproachhoweverallows vendors to augment thelanguage with further functions that arenot part of theversion in theTinkerPop framework.If we look at Gremlinasadomainspecific language builttotraverse graphs, and using the fact thatwecan implement ourown language-variant, anatural extensions leads to Domain SpecificLanguages (DSL) thatabstract fromthe underlying graph structure to the domain itself.Asanexample,wecan write asocial DSL thatabstracts from the general structure of agraph to the domain of asocial network. This allowsusto introducedomainspecific steps that allow for more compact queriesaswe cansee in Algorithm 4.44. Here, we assume that theDSL providestwo new steps: persons() thatstarts traversers on thenodes, filtersontheir label andonthegiven name,and knows() thattraverses to adjacent personsand filtersonthe given name.

1 g.V().has("person", "name", "John Doe").out("knows").has("person", "name", "Mary Ann") 2 g.persons("John Doe").knows("Mary Ann") Listing 4.44: Example of two equivalentqueries: first in Gremlin and then in ahypothetical social DSL.

73 4. QueryLanguageAnalysis

Apartfrom this, Gremlin providesotherinteresting functions that setitapartfrommany graph query languages. The subgraph() step, that produces an edge-induced subgraph fromatraversal, is suchanexample.One could think thatthis step makes Gremlin at least somewhat composable, butthis is notthe case as atraversalsource can only query asingle graph.Inother words, we have to create anew traversalsourceonthe subgraph before it canbequeried. Most functions of thelanguage that we mentionedare especiallyrelevantinOLTPworkloads. However, the imperativenature of Gremlinmakes it “more suitablefor expressinggraph analysisalgorithms such as PageRank, Betweenness Centrality,etc”[vRHK+16]. To thisend,there is adedicatedAPI in the TinkerPop frameworkthat provides manysuchfunctions. As we focusonOLTP workloads, we do notlookintothese functions in moredetailand refer thereader to thedocumentation formore information on this.

4.3PGQL

We have already seenadeclarativequery language that aligns closelywith SQL:Cypher. The Property Graph Query Language (PGQL) wasinspiredbyboth, SQLand Cypher, and takes thealignmenttoSQL astep furtherasit“aimstofollowSQL syntaxwhere possible”[Pla18]. Thisdoesnot come as asurpriseasthe language is developed by Oracle and offered in some of their products. Unlike Cypher queries that are conceptually structured linearly, PGQL queriesuse theselect-from-where structure inherenttoSQL. As PGQL is not standardized, Oracle aims to keep thelanguage in sync with the standardizedfeatures of SQLbyprovidingthe same query structure, the same clauses and most functions [HLP+19]. As aside note, there is also ongoing work insideISO/IEC JTC 1/SC 32’s working group 3, the same group thatisworkingonthe new GQL standard, to add graph queryingfunctionalitytoSQL itself under SQL/PGQ.Although there is no final specification of this functionalityasofnow,PGQL aimstoprovide patternmatching capabilities that roughlyresemblethe ones on this extension [HLP+19]. PGQL,among other languages like Cypher or G-CORE influencethe creation of this addition and it is thereforepossible that afuture versionofSQL has capabilities similar to PGQL regarding graphqueries. Similar to SQL,aqueryinPGQL is at least composedofaSELECT and FROM clausethat can be followedbyoptionalclauses like WHERE, ORDER BY or GROUP BY [SHvR+16]. Graph patternsform the basis of PGQL andcan be specifiedinthe FROM clauseusing one or more MATCH clauses [vRHK+16]. As patterns are specifiedusing an ASCIIart representation similar to Cypher, these MATCH clauses closelyresembletheir counterparts in Cypher[FGG+18b]. The SELECT clauseinturn resemblesits counterpart in SQL as it allowsuserstospecify theprojections or aggregations that are contained in the output of aquery.Algorithm 4.45 shows asimple PGQL query and itsselect-from-where structure that matchesasingle node p:Person. 1 SELECT p.firstName, p.lastName 2 FROM MATCH (p:Person) 3 WHERE p.id =123

74 4.3. PGQL

Listing 4.45: Asimple PGQL query thatretrieves the first and last name of the person with agiven id.

Anothersimilaritybetween PGQL and SQLaswell as Cypherregards theevaluation of aquery. We have already mentionedthataCypherquery outputs atablecontaining bindingsform query variablestoelements. Oracle’s PGQL works analogously as it returns tablescontaining variablesand theirbindings. These tabular result setsclosely resemble the onesfrom SQLqueries, whichallows PGQL queriestobenested inside SQLqueries [vRHK+16]. As the graph query language extends the relationalquerylanguage also with graph specificdatatypes like vertex, edge and path,variables of thesetypes can be includedinthe output.Nonetheless, theoutput is alwaysstructured in atabular formand suchgraph specific typescan onlybeincludedasentries in that table.Using thesegraphrelated types, we can for exampleoutput verticesand theirids and labels via the query in Algorithm 4.46. On aside note, verticesaswell as edges can be augmented with none, one or multiple labels inthe data model usedbyPGQL.Assuming that an entity hasonly asingle label, we can use the LABEL functiontoretrieveit. However, this function produces an erroronentities with none or multiple labels.The LABELS function on theother hand produces aset of labelsand candeal with all casesbyreturning an empty set on entities without labelsand aset containing thelabel(s)onentities with at least one.

1 SELECT id(p), labels(p) AS lbl, p FROM MATCH (p:Person) LIMIT 5 Listing 4.46: APGQL query thatselects theids, labels and the verticesthemselves of 5 persons.

When looking at the MATCH clauses in thequeriesinAlgorithm 4.47, we can see the close resemblance to the syntax of patterns in Cypher. Similar to Cypher, anonymous nodes are depicted viaempty () brackets and anonymous edges viaempty [] brackets or by omitting them altogether. Furthermore, edges in the data model usedbyPGQL have to be directed but we can match edges in both directions using an any-directed edge pattern (“-[]-”or“--”) as in thefourth query.Incontrast to Cypherhowever, there is no compactsyntax to specify restrictionsonentities likeinAlgorithm4.2, as this would notbecompatible with the goal of following the SQL syntaxwhere possible. Apartfrom the patternmatching part in the MATCH clause(s),most of thefunctions and clauses usedoutside of MATCH are coming directly from SQL. That includes for examplethe dedicated GROUP BY clauseaswecan see in thefirst two queries 20.Another resemblance to SQL regards the composition of queriesusing either existentialsubqueries via (NOT) EXISTS or scalar subqueries as in the third and fourthquery.Ifwecompare the fourth query to the implementation of thesame queryinCypher in Algorithm 4.4,

20Remember: Cypher for example takes adifferentapproach to grouping as the result is either grouped implicitly or on thekey specified in the WITH clause.

75 4. QueryLanguageAnalysis

we can observethe difference between theconceptually linear structure in Cypherwhere queriescan be evaluated linearlyfrom top to bottom,and thestructureofPGQL and SQL where innerqueries are often evaluated before outer ones. Furthermore, we can parameterize queriesusing abind variable “?”aswedowith the lastName in line 16.

1 SELECT p.age, COUNT(*) FROM MATCH (p:Person) GROUP BY p.age 2 3 SELECT label(e) AS lbl, COUNT(*) 4 FROM MATCH () -[e]->() /*anonymousnodes*/ 5 GROUP BY lbl ORDER BY COUNT(*) DESC 6 7 SELECT p.name AS name, 8 (SELECT COUNT(o) FROM MATCH (p) -> (o)) AS outgoingConnections, 9 (SELECT COUNT(DISTINCT(o)) FROM MATCH (p) -[:KNOWS]- (o)) AS knows 10 FROM MATCH (p:Person) 11 12 SELECT o.firstName, o.lastName 13 FROM MATCH (m) -- (o:Person) /*short form of anonymous, any-directed edge*/ 14 WHERE m.firstName IN ( 15 SELECT m1.firstName FROM MATCH (p:Person) -- (m1:Person) 16 WHERE p.firstName = ’John’ AND p.lastName =? /*bind variable*/ 17 ORDER BY p.firstName DESC LIMIT 1) Listing 4.47: Four PGQL queriesthatshowthe resemblancebetween PGQL and SQL.

As PGQL is onlyimplementedinsomeresearchprojects and commercial productsof Oracle, we did not manage to test thequeriesinthis chapter on adatabase.However, we usedaparsingsoftware fromOracle21 to ensure thatthe queriesare syntactically correct.

4.3.1Structure Independent We omitthe code of Queries 1and 2hereasboth of them use only avery simple pattern matching part thatmatches asingle node and therestofthe queryisSQL code. To implement Query 2, we use the CASE construct thatwewillalso see in Algorithms 4.49 and 4.54. RegardingQuery 3, we runintoaproblemofthe data model usedbyPGQL as, similar to Gremlin’sdata model, it does not support lists.Therefore, we cannot store the languagesspokenbyapersoninalist in the first place.Wecouldconcatenatethe languages in astring,store this stringand split it up again whenretrieving it to count the number of entriesfor this query.Asthis would be done in aprogramming language and not in PGQL, we cannot implement this querydirectly in the querylanguage. If we look at thefunctions provided by PGQL to aggregate dataand work with numeric or other datatypes, we find many functionsthatare alsoavailable in SQL.Wecan for exampleuse SUM, COUNT, AVG, ABS, CEIL(ING), FLOOR, ROUND on numeric data typesbut alsoaregex checkeronstrings. PGQL furthermore addsvertexand edge functions like LABEL(S), ID, IN_DEGREE and OUT_DEGREE where thelatterones are not structure independent but nonethelessinterestinginagraph querylanguage.

21https://github.com/oracle/pgql-lang

76 4.3. PGQL

4.3.2 Pattern Matching We have already seen parts of thepattern matchingfunctionalityand the resemblance to the Cypher syntax in the previousexamples. As Query 4contains onlyavery basic graph pattern, we omitits implementation and continuewith the query in Algorithm 4.48. The patterninthis query connects a Person to some node overasingle knows edge. We can see in line 1that edges can have propertiesand we can access them in thesame way as propertiesonvertices. Furthermore, we can order theresult using thesame syntax as in SQL:inanORDER BY clause.

1 SELECT f.id AS personId, f.firstName, f.lastName, e.creationDate AS friendshipCreationDate 2 FROM MATCH (p:Person) -[e:knows]-> (f) WHERE p.id =? 3 ORDER BY friendshipCreationDate DESC,personId ASC Listing 4.48: Query 5(IS3) as aPGQLquery.

To implement Query 6, we have to checkfor theexistence of an edge between two nodes. While we use the OPTIONAL clausefor this in our Cypherimplementation,there is no equivalentclause in Gremlin and we could notimplementthis in asingle query. PGQL alsolacks adedicatedclause forthis but we can simulatesuchanoptional pattern using an optional pathexpression[AAB+18]. As we cansee in Algorithm 4.49, we specify the fixed pattern in thefirst MATCH clauseinline 3and use asecond MATCH forthe optional pattern. In ourcase, the connection is optional which is depictedbythe question markafter the edge. We test for theexistence of this edge using a CASE predicate in line 2and, depending on that result,set avaluefor the knowsAuthor variable. This implementationshows thatthere canbemultiplecorrelated MATCH clauses in asingle query.PGQL alsosupportsuncorrelated ones that are evaluated as aCartesian product of theindividualevaluations.

1 SELECT c.id, c.content, c.creationDate, p2.id, p2.firstName, p2.lastName, 2 CASE k IS NOT NULL WHEN true THEN true ELSE false END AS knowsAuthor 3 FROM MATCH (p1:Person) <-[:hasCreator]- (m:Message) <-[r:reply_of]- (c: Comment) -[:hasCreator]-> (p2:Person), 4 MATCH (p1) -[k:knows]-?(p2) 5 WHERE m.id =? Listing 4.49: Query 6(IS7) as aPGQLquery.

Evaluation Semantics Earlierversions of PGQL used ano-repeated-node semantics as default and included special keywordstoswitchtoahomomorphism-based semantics [vRHK+16]. From version1.0 onward, ahomomorphism-based semantics is used as default as it translates better to and from SQL, among other things [vR17]. Nonetheless, we can switchtoan isomorphism-based semantics using the ALL_DIFFERENT predicate. As this predicate can deal with verticesaswell as edges, we can achieveno-repeated-node,no-repeated-edge and even no-repeated-anything semantics.

77 4. QueryLanguageAnalysis

4.3.3 Path The graph query language extends SQL not only with apatternmatching functionality but wasalsothe first propertygraph query language with full support of RPQs[Pla18]. To this end,userscan specify namedpathpattern macros at thebeginning of aquery. These patterns can be referencedlater on in thequery,eitherwhen specifying other path patternmacros or in a MATCH clause. Suchnamed path patterns were first proposedin PGQL [FGG+18b]and cancontainarbitrary regularpathpatternsand alsoanoptional WHERE clause. The WHERE clauseallows us to not onlyaccesslabelsonthe path but also specify requirementsonpropertiesofboth, verticesand edges [vRHK+16]. Thisisneeded formore sophisticated patterns in the propertygraph data model as we otherwise cannot compare properties alongapath.Onthe other hand, this exceeds the functionalityof RPQsand we land in the areaofREMs thatallowfor exactlythat, namely to compare data valuesonapath. Nonetheless, PGQL supports onlyasubset of REMsasitis“hard to come up withasyntaxfor REMsthat is declarative”[vR17]and PGQL is adeclarative language. SincePGQL furthermore providesapath data type thatallows us to outputs paths,the query language alsosupportsECRPQs[Bar13]. Algorithm 4.50 contains aquerythat uses apath pattern macro. The pattern is specified and named eq_voltage_hop in thefirstline and used in thethird one viaareference on that name.Forward slashes “-//-”instead of squarebrackets “-[]-”onedges denote thereachabilitysemantics. Thissemanticsisuseful when we are only interested in whetherthere exists at least one pathbetween two vertices and not in all those paths. We will seelateronthatwecan alsouse this semantics on other,non-macro, paths. However, path patternmacros can onlybeused with this semantics. In this query,weuse afunctionalityofREMs as we compare the voltage propertiesofnodes on thepaths.

1 PATH eq_voltage_hop AS (n:Device) -> (m:Device) WHERE n.voltage =m.voltage 2 SELECT y.name 3 FROM MATCH (x) -/:eq_voltage_hop+/-> (y) 4 WHERE x.name=’generator_x29’ Listing 4.50:APGQLquery using apath pattern macrowith data comparison. The query selects all devices reachable from generator_x29 via apath where eachdevice hasthe same voltage.Source: GQL-StatusUpdate AndComparison to LDBC’s Graph QL proposals,slide 21 [vR17].

Although this is afunctionality thatsetsPGQL apart from many other graph query languages, we will notuse it in thefollowing queriesaswedonot need them for theSNB queries. As we can seeinAlgorithm 4.51, PGQL providesaSHORTEST function that matches asingle shortestpaththatsatisfiesthe pattern. We can computethe length of thepathbycounting the number of edges on it as we do in line 1. In this query, we are not sureifsuchapath existsatall, whichiswhy we use a CASE expression to test whether thevariable e is null22.

22Note that thespecification does not state if e would indeed be null if there is no suchpath.This would have to be tested andwecan only assure that the query is syntactically correct.

78 4.3. PGQL

1 SELECT CASE e IS NULL WHEN true THEN -1 ELSE COUNT(e) END AS shortestPathLength 2 FROM MATCH SHORTEST ((p1:Person) -[e:knows]-* (p2:Person) ) 3 WHERE p1.id =?AND p2.id =? Listing 4.51: Query 7(IC13) as aPGQL query.

Apartfrom queryingfor asingle shortestpath, we can alsoquery forthe “TOP K SHORTEST”ones as in Algorithm 4.52. Thisquery furthermore specifies arestriction on the weight value(s) of theedges on thepathdirectly insidethe pattern. As aside note, PGQL alsoprovidescheapest pathfindingbasedonacost function via“(TOP K) CHEAPEST”.

1 SELECT src, ARRAY_AGG(e.weight), dst 2 FROM MATCH TOP 3 SHORTEST ((src) (-[e]-> WHERE e.weight >10)* (dst) ) Listing 4.52:APGQL query thatqueriesfor the3shortest paths between src and dst where eachedgealong the path hasaweight greater 10. Source: PGQL 1.3 Specification [Ora20].

As mentionedwhen we analyzedQuery 3, PGQL does not support lists.Thereis howeverone exceptionfor query outputs:the horizontal aggregation function ARRAY_AGG aggregatesthe given valuesintoanarray/list.Whereasverticalaggregation functions work similartoonesinSQL, whereagroup of valuesfrom differentrowsare aggregated, horizontalfunctions aggregateoverapath. However, we cannot aggregate overarbitrary variablesorpropertiesonapath whichwouldbeneeded to collect study-and work- related informationinQuery 8. Algorithm 4.53 contains some parts of thequery that we will brieflygothrough. We use ARRAY_AGG to show howwecan aggregate multiple occurrences of thesame property,for exampleuniversitynames,inalist.Furthermore, {1,3} specifies alower and upperbound for thenumberofknows edges that can be traversed. Apartfrom suchspecific bounds, we canuse “*”for paths of anylength (including 0),“+”for paths of lengthatleast 1and “?”for paths of length0or 1.

1 SELECT /* [...] */ 2 ARRAY_AGG(u.name) AS friendUNames, ARRAY_AGG(es.classYear) AS friendUYears, ARRAY_AGG(uc.name) AS friendUCities 3 FROM MATCH SHORTEST ((p:Person) -[e:knows]-{1,3} (f:Person)), 4 MATCH (f) -/es:studyAt?/-> (u:University) -/:isLocatedIn/-> (uc:City), 5 /* [...] */ Listing 4.53: Parts of Query 8(IC1) as aPGQL query.

Algorithm 4.54 contains asimplifiedversion of Query9where we integrate reachability semantics into alargerpattern. Furthermore, this algorithm and alsothe previous one showthat we can use reachabilitynot only on apath pattern macro but alsoonsimpler paths directlygiven in MATCH.

1 SELECT m.creationDate AS creationDate, po.id, op.id, op.firstName,

79 4. QueryLanguageAnalysis

2 CASE m.content IS NOT NULL WHEN true THEN m.content ELSE m.imageFile END, 3 FROM MATCH (p:Person) <-[e:hasCreator]- (m:Message) -/:replyOf*/-> (po:Post )-[:hasCreator]-> (op:Person) 4 WHERE p.id =? 5 ORDER BY creationDate DESC LIMIT 10 Listing 4.54: Simplified versionofQuery 9(IS2) as aPGQL query thatselects only some properties.

Evaluation Semantics PGQL avoidspaths of unlimitedlengthand an unlimitednumberofpaths in the evaluation by allowingpattern quantifiers only when using reachability semantics or inside SHORTEST patterns. Therefore, quantifiers like “*”or“+”but alsoquantifiers with custom lowerand upper boundscan only be used inside reachability demarcated paths, where they are given inside thedemarcation: “-/var:label/-”, or using standard path syntax inside“(TOP K) SHORTEST”, where they are given outside thedemarcation:“-[var:label]-”. We have already mentionedthat PGQL also provides cheapest pathfinding butthis also limits the found paths eithertoa singleone or to thetop k. Apartfrom thesebuilt-in functions thatuse either ashortest or cheapest pathsemantics,wecan implementour ownsemantics and use itinPGQL via auser-defined function.

4.3.4Data Manipulation With the introductionofversion1.3 in March2020, PGQL introducedthe graph modi- fication clauses INSERT, UPDATE and DELETE.Inearlierversions, thequery language waslimitedtoDQL queries. Algorithm 4.55 depictsasimplified versionofQuery 10 where we omitmost properties for brevity. Compared to theCypher implementation of thequery in Algorithm 4.18 where we couldpass alist of Ids and unrollthem inside the query using UNWIND,thereisnosuchfunctionalityinPGQL. Therefore, we cannot pass avariablenumberoftag IDs for example.Toachievethis functionality, we have to write multiple queriesorinclude asingle EDGE clausefor eachedge thatistobeinserted. Eitherway,the query hastobeadapted depending on thenumberofedges that are inserted,whichisnot necessaryinCypherwherewecan use the same query to insert a variable number of edges. Apartfrom that, the query usesthe DATE and TIMESTAMP typesthat are available next to TIME and further date/time related typesfor time zones. 1 INSERT VERTEX p LABELS (Person) 2 PROPERTIES (p.id=1234, p.birthday=DATE ’1990-01-01’,p.creationDate= TIMESTAMP ’2020-01-01 16:15:00’), 3 EDGE e BETWEEN p AND c LABELS (isLocatedIn), 4 EDGE ew BETWEEN p AND co LABELS (workAt) PROPERTIES (ew.workFrom=1990), 5 EDGE et BETWEEN p AND t LABELS (hasInterest) 6 FROM MATCH (c:City), MATCH (co:Company), MATCH (t:Tag) 7 WHERE c.id=? AND co.id=? AND t.id=? Listing 4.55: Simplified versionofQuery 10 (INS1) in PGQL that inserts onlysome properties.

80 4.3. PGQL

Similar to Gremlin, the removal of avertex alsotriggersthe removal of allincidentedges of that vertex. Therefore, we can remove acomment,all incident edges and replying commentsvia the query in Algorithm 4.56. 1 DELETE c, r 2 FROM MATCH (c:Comment) <-/:replyOf*/- (r:Comment) 3 WHERE c.id =? Listing 4.56: Query 11 (DEL7) as aPGQL query.

Propertiescan be updated similartoSQL using the UPDATE clauseincombination with SET.Asthe implementation of Query 12 combinesfunctions of SQLand Cypher, we omitithere andrefer the interested reader eithertoour GitHub repositoryorthe PGQL specification.

4.3.5 DataDefinition PGQL usesapropertygraph data model with support for multiple graphs. Each graph hasaname and we can specify which graph we want to query in the ON clause. This clausecan be omitted if adefaultgraph is set, whichhowevercannot yetbedone through PGQL itself but onlythroughAPIs on the database system.Each MATCH clausecan be augmented with an optional ON that specifies the graph that thepattern is matched on as we can see in Algorithm 4.57. In all our queries, we assume that we are working on asingle graph that is setasdefault graph.Eachpropertygraph consists of vertices and edges that can have zeroormore labelsand zero or moreproperties. As mentioned before, edges in thatdata modelhavetobedirected. 1 SELECT p.name,q.name, a.number 2 FROM MATCH (p) -[:knows]-> (q) ON social_graph 3 MATCH (a:Forum) ON other_graph WHERE a.name=’test’ Listing 4.57: APGQL query thatqueriesfor two persons fromthe social_graph and aforumfrom the other_graph.

Similar to Cypher and Gremlin, PGQLdoesnot support aschema. However, we can createagraph outofdatafrom existing tablesvia the “CREATE PROPERTY GRAPH” statement.Usingthis statement, we specify thesource tablesfor verticesand edges, the labels for the entitiesand theirpropertiesaswell as theway theywill be connected via edges. In other words, if we assumethat there are tablesrepresentingverticesand tables representing edges, we can use this statementtobuild apropertygraphfrom thatdata. Therefore, we specify thegraph schemathatiscreated fromthe tabular data, butthis schema is only usedonce to create the graph,and not enforcedlater on wheninteracting with it using DMLqueries. This setting,wheredatafrom (relational) tablesisused to createaproperty graph,makes sense if we look at theposition of PGQL in Oracle’s products. There,itisused to query propertygraphs where the underlying dataisoften23

23UsingPGX,the system canswitchtoanativegraph storage when thedata is storedin-memory.In this setting,iteffectively becomes anative graphstorage.

81 4. QueryLanguageAnalysis

stored in arelational database [BPG+19]. In other words, PGQLprovidesagraph-view of datathatisstoredinrelational tables. Algorithm 4.58 contains an exampleofsuch astatementthatcreates apropertygraph named social_network with datafrom 5tables. The graph will be structured as specified in Query 13, where a person is livingat a street which in turn islocatedin a city.

1 CREATE PROPERTY GRAPH social_network 2 VERTEX TABLES ( 3 persons 4 LABEL person PROPERTIES (person_id, firstname, lastname), 5 streets LABEL street PROPERTIES (street_id, name, length), 6 cities LABEL city PROPERTIES ARE ALL COLUMNS 7 ) 8 EDGE TABLES ( 9 livingat /* table name */ 10 SOURCEKEY (person_id) REFERENCES persons 11 DESTINATIONstreets LABEL isLocatedIn PROPERTIES (housenumber), 12 islocatedin /* table name */ 13 SOURCEKEY (streets_id) REFERENCES streets 14 DESTINATIONcities NO PROPERTIES 15 ) Listing 4.58: Parts of Query 13 as aPGQL query.Note that this createsanew property graph as there is no waytoadapt an existinggraph.

4.3.6 FurtherFeatures

• Composability: As we have seeninsome of our examples, PGQL supports acomposition of queries throughsubqueriessimilartoSQL. We have also mentionedthataquery returns atablecontaining values. Although thesetables can containgraphspecific types likevertices, edges and paths, thesevaluesare alwayscontained in thetable. As a PGQL query takes agraph as input and produces atable, it followsthat the query language is not composable[AAB+18]. In thesetting of PGQL as aquerylanguage that is used almost solelyinthe environmentofOracle’sdatabases,whichare mostly relational ones,itisunlikely thatafuture versionofthe language discards thetabular result format as this would entail thatPGQL queriescannot be nested insideSQL queriesany more. On theother hand, the paper thatintroduced PGQL mentionsagraph data type that“allows for graph construction and querycomposition”[vRHK+16]. Graph construction denotes constructs that allow us to createagraph from one or multiple existingones. According to thepaper,wecan construct agraph in the SELECT clause. Thisinturn would mean that suchaqueryreturnsagraph, and indeed, the paper mentions that a(sub)querycan return a graph.Unfortunately,thereis no trace of suchconstructsorthe graph type in the language specifications and the parser did notaccept such constructs.

82 4.4. GSQL

• Pathsasfirst-class citizens: By providingpathpattern macros, where users can specify and name complex path patterns, and allowing fordatacomparisons alongpaths PGQL is moreexpressive thanCypherregardingthe path queryingcapabilities.Although we can name paths, compare them to some extent andoutput them,they are not raised to first-class citizens as we cannot store paths on their own.

As we have mentionedinthe introductionand seen in the examples, we candescribe PGQL as aquerylanguage that follows theSQL syntax on non-graph related parts and Cyphersyntax on most graph related parts of aquery. With the addition of “Oracle Spatialand Graph” (thatincludesPGQL as graph query language) in version12ofthe Oracle Database, many SQLuserscan use thesefeaturesintheir Oracle databases. On the other hand, thelanguage lacks standardization and support by other vendors [TPAV19]. That can alsobeseenwhen searching for information about thequery language online, where we are mostly limitedtoOracle’s ownwebsiteand we have found no community aroundthis language on anyforumormailinglist.

4.4 GSQL

GSQL is the query language used in the TigerGraph database.Similar to PGQL, it is heavily inspired by SQL and extendsthe relationalquery language to graph databases by allowing graph-specific primitivesinthe FROM clause[RH19]. Queriesthatdonot use suchgraph-specific primitives become standardSQL queries [DXWL19]. To achieve this, the syntax of SQL is extendedmainlyinthe FROM clausewhereuserscan specify graph patterns andnavigational pathpatterns [DWX18]. Furthermore, GSQL usesthe type system from standardSQL and extends it with vertex, edge and graph types. These typesare needed as we have to declarethe schemausing alabeled property graph data model before we can either import data or start querying. Such aschema defines the vertex and edge types, theirpropertiesand the typesofthe propertiesand how entities are related to eachother [Tig]. Theschema is stored apart from the data graph and is enforced, meaningthatwecannot updateorinsert data thatwould violatethe schema. Therefore, theschema of agraph in GSQL hasthe same closed world property as aschema in SQL. GSQL relies on such astrict schemadefinition as it is especiallydesignedtosupport large graphscontaining billions or trillions of nodesand edges. This comes fromthe design principle of the TigerGraph database as a“native parallel graph database”[DXWL19] for “tomorrow’s big data and analytics”[HLP+19], where performanceonlarge graphs is important. To support thesepotentially huge graphs, thedatabaseallows agraph to be distributed overmultiplemachines. Queries on distributedgraphsare processed in parallel on themachinesbeforethe localvalues are aggregated. Thisapproachresembles the MapReduceprogrammingmodel,and GSQL is indeed designed in away to not only ease thetransitionfor SQLdevelopers butalso for NoSQLdevelopers with aMapReduce

83 4. QueryLanguageAnalysis

mentality[DXWL19]. Coming back to the schema: having this specified beforehand allowsthe databasetouse an optimal storage format which in turn results in lessspace needed to store the data comparedtoother schema-lessdatabaseslikeNeo4j [RH19]. Furthermore, theschema allowsfor faster access and is used to plan and optimizequery executions in order to achievehighperformance parallelism [HLP+19]. As mentioned, thedatabase is especiallydesigned for bigdataand analytics, and thereforefor OLAP workloads. To this end, the querylanguage provides not onlyOLTP functionalityasdoesSQL,but alsoiterativeand multi-pass algorithms for analytical workloads like PageRank or recommender systems. These algorithms canbeimplemented using imperativecontrol flowprimitives like IF/ELSE, FOREACH and WHILE loops, and queriesbehave similar to functions as they can invoke other queriesorrecursively themselves.Together with thesupport of user-defined functions, GSQL is aTuring- complete language and thereforemore expressive thanlanguages like Cypher that are notTuring-complete[WD18, DXWL19]. Despite the focus of TigerGraph on OLAP workloads, we include GSQL in our comparison as it alsoprovidesOLTPfunctionality, transactionalgraphupdates and shows howaquerylanguage that uses agraph schema differs from schema-lesslanguages. General syntax andhandlingofqueries The query in Algorithm 4.59 shows the select-from-where structure and the graph pattern in the FROM clause. Furthermore, we can see that we have to createaquery using CREATE QUERY before we can execute it.Once aquery is created, we can either INTERPRET it which results in aline-by-line translation, or INSTALL it,whereitwillbeoptimized and rolled out to the distributed machines. Furthermore, each installedquery hasits own RESTendpointthatcan be usedbyapplications to access thedatabase.Inour case, we createaquerywith the name ex1 thattakes two STRING parameters,isspecified forthe graphwith the name ldbc_snb andusesthe V2 syntax.GSQL’s path pattern matchingfunctionalitywas extended with theintroductionofTigerGraph v2.4:where earlier versions underthe V1 syntax,thatisstill the defaulttoday,allowfor patterns onlyoverasingle edgeinone SELECT,version V2 addssupport for multi-hop paths and repetitions. Furthermore, the syntax used to depict apath or an edge differsbetween the twoversionsand we will comparethemlater on. Apartfrom the syntax difference, we can seethataGSQL query functions likeacontainer forstatements. In this example, the query contains only asingle select-from-where (sub)query but as we will see in thenextalgorithm, aGSQL query cancontainmultiple SQL SELECT statements. Therefore, aGSQL query resemblesastoredprocedure that can containmultiple SELECT statementsaswell as control-flow primitives likebranches and loops [RH19]. When speaking about aSQL (sub)query thatiscontained in aGSQL query,wewill reference thequerytogether with thelinesofthatquery,asin: “the (sub)query in lines3-5”. Coming backtoour example andthatsubquery in lines3-5, we note thatthe pattern syntax in GSQL differs from theones usedbyCypher and PGQL in twomajor areas. First, parentheses are not used to depict anode, as nodesare in most cases not delimited

84 4.4. GSQL

at all.Instead, parentheses are used as delimiters on edges. And second,the placesofthe type and variable nameonboth, verticesand edges, are swapped.WhereasCypher and PGQL use a : syntax,GSQL reverses it to : as in Person:p.

1 CREATE QUERY ex1(STRING fN, STRING lN) FOR GRAPH ldbc_snb syntax v2 { 2 3 friends = SELECT friend 4 FROM Person:p -(KNOWS:k)-Person:friend 5 WHERE p.firstName == fN AND p.lastName == lN; 6 7 PRINT friends [friends.firstName, friends.lastName]; 8 } Listing 4.59:Asimple GSQL query thatselects thefriendsofaPerson with agiven name.

Accumulators In order to achieveefficientparallel computation, GSQL introduces the concept of accumulators and thecorresponding ACCUM clause. Using this clause, we canspecify computefunctions on nodesand edges that can be executedinparallelinorderto aggregate valuesintoaccumulator variables. These variablescan either be global or attachedtoavertex (local).They are typed, hold data and can be written to in parallel. In other words, an accumulatorisamutable mutex variable thatisshared among all threads that participateinthe executionofaquery. The data written to an accumulator is then combined using the specifiedcompute function. GSQL providesmanysuch functionsout of thebox and users canalso provide their ownfunction. Algorithm 4.60 shows the usageand difference of local and globalaccumulators as well as simple variableslike localVariable. SumAccum is aglobalaccumulator (depicted by @@) and thereforevisible to all threads.Asthe namesuggests, valueswritten to this variable willbeaggregated by summingthem.Apart from SumAccum,GSQL providesother accumulatortypes on numbers like MaxAccum, AvgAccum,logical accu- mulators like OrAccum, AndAccum and onestocreate collections like SetAccum and MapAccum.The second accumulatorinour example is alocal one (a single@), meaning thateachvertex selected by the query willhaveits own localRelationsCount.As we can see in line 7, we have to specify the vertex when writing to alocal accumulator. In contrast,globalaccumulators can be directly accessed as in line 8. After theevaluation of lines6-9, theglobal accumulator holds theglobal number of WORK_AT relationships and each Company vertex holds the localnumberofincoming WORK_AT edges. Note that, intuitively,wedonot increasethe counts directly by 1inlines7and 8but we add thenumber“1” to theaccumulator variables and thesenumbers are then aggregatedby summingthem.Therefore, we write to allaccumulator variables, regardlessofthem using sum, set, map or acustomfunction, using the“+=”syntax. Coming back to our query,eachprocess has itsown copyoflocalVariable that is notshared. Therefore, eachofthem will increment the localcopybyone and thequery willprint1in line 11.

85 4. QueryLanguageAnalysis

Apartfrom the accumulators, the query shows thatwecan save areference to entities as we do with the comps variable in line 6. As we select all companies c in this subquery, comps contains areference to all those vertices. We can use thesereferencestodirectly access theseentities forexample to checkproperties, printvalues, but also in other FROM clauses as in line 15. Unlikeinthe query in lines6-9 where we use apatternwith the Company and Person vertex typesand the WORK_AT edge type24 in the FROM clause, we have no reference to suchpre-defined typesinthe query in lines14-16. The latterquery startswith all entities in the comps variable,orders them by theirlocal accumulatorvalue and savesareference to thetop 10 suchvertices again to comps.Therefore, the variable comps contains the10companies with the most incoming WORK_AT relationshipsafter thatsecond query and we printthose in line 17.

1 CREATE QUERY ex2() FOR GRAPH ldbc_snb syntax v2 { 2 INT localVariable; 3 SumAccum@@globalRelationshipCount; 4 SumAccum@localRelationshipCount; 5 6 comps = SELECT c FROM Company:c-(

We ranall queriesinthis chapter on aTigerGraph DeveloperEdition 3.0toensure their correctness. The implementationsofthe SNBqueriesaswell as theschema were taken fromthe TigerGraph GitHub repository25.Weadaptedsome of them suchthatthey work on our schema or use syntax V2, and again:afull implementation of all queries given herecan be found on our GitHub repository.

4.4.1 Structure Independent To showthe differencebetween the two syntax versions in GSQL,weprovide an imple- mentation of Query 1inboth versions in Algorithm 4.61. When usingthe defaultsyntax (V1), we cannot start directly with all verticesofagiventype. To achievethis,wehave to explicitly assign those verticestoavariable(line 3, vPersons)and use this variable

24Thatare allpre-defined typesinour datamodel. 25https://github.com/tigergraph/ecosys/tree/ldbc/ldbc_benchmark/tigergraph

86 4.4. GSQL

in the FROM clause. Thisvariablecan be omitted in aqueryusing syntax V2 and we can directly start at all Persons(line 8).

1 CREATE QUERY query1(STRING firstName, STRING lastName) FOR GRAPH ldbc_snb [syntax v2] { 2 //using syntax v1: 3 vPersons={Person.*}; //initialized with all verticesoftype ’Person’ 4 res = SELECT p FROM vPersons:p 5 WHERE p.firstName == firstName AND p.lastName == lastName; 6 7 //using syntax v2, we do not need vPersons: 8 res = SELECT p FROM Person:p //can directly use vertex typePerson in V2 9 WHERE p.firstName == firstName AND p.lastName == lastName; 10 11 PRINT res; //regardless of syntax, print found vertex 12 } Listing 4.61: Query 1asaGSQL query.Note that we givethis queryonce using syntax V1 (lines 3-5) and once using syntax V2 (lines 8-9).

Algorithm 4.62 shows the use of control flowprimitives and howthey canbeintegratedinto aquery. In this query,wedonot know whether thegiven messageId represents theIDof aComment or a Post.Tomakethis distinction, we use the to_vertex_set(ID(s), type) function that fetchesall verticeswith the given ID(s) of thegiven type.The global accumulator seed holds the single ID and we test in line8whether this ID represents a Comment.Inthis case, we output thevaluesofthatvertex. Otherwise,wefetch the Post with thegiven ID, use anotherselect-from-accumquery to seteitherthe content or the imageFile as messageContent and output this.

1 CREATE QUERY is4(STRING messageId) FOR GRAPH ldbc_snb{ 2 SetAccum@@seed; 3 SumAccum@messageContent; 4 5 @@seed += messageId; 6 vComments =to_vertex_set(@@seed, "Comment"); //get set from id(s) 7 8 IF vComments.size() >0THEN //ID belongs to aComment 9 PRINT vComments[vComments.creationDate AS messageCreationDate, 10 vComments.content AS messageContent]; 11 ELSE //ID belongs to aPost 12 vPost =to_vertex_set(@@seed, "Post"); 13 vPost = SELECT v FROM vPost:v 14 ACCUM CASE WHEN v.content != "" THEN v.@messageContent += v.content 15 ELSE v.@messageContent += v.imageFile 16 END; 17 PRINT vPost[vPost.creationDate, vPost.@messageContent]; 18 END; 19 } Listing 4.62: Query 2(IS4) as aGSQL query.

87 4. QueryLanguageAnalysis

UnlikeinCypher and PGQL, GSQL’sdatamodel supports collections. We can therefore store thelanguages spoken by aperson in acollection and calculate theaverage size of those collections via the query in Algorithm 4.63. Apartfrom the count function usedhere, GSQL providesavarietyofmathematical, boolean, bit, string, date(time) and collection-related operatorsand functions. 1 AvgAccum @@avgLangs; 2 persons=SELECT p FROM Person:p 3 ACCUM @@avgLangs += count(p.speaks); Listing 4.63: Parts of Query 3asaGSQL query.

4.4.2 Pattern Matching As brieflymentionedbefore, GSQL queriesusing syntax V1 can onlytraverse asingle edge (1-hop queries) whereas in syntax V2 we can definemoresophisticated, multi-hop paths andrepetitions. Furthermore, we have to be carefulwhichversion we use as the syntax used to depict edges differsbetween them. Algorithm 4.64 depictssome of the differences. Edges can only be eitherundirectedorright directed using “>”outside the () brackets in V1. In V2 however, edgedirections are givennexttothe typesinsidethe brackets and they can be either undirected, right(“>”) or left (“<”) directed 26.Having the directions insidethe brackets allowsfor patterns in multiple directions as in line 3. Furthermore, syntax V2 allowsustomatch edgesofany type as in line 5. 1 Person:p -(LIKES:e)-> Message:m //syntax V1 2 //syntax V2: 3 Person:p -(KNOWS:e)-Person:m //undirected edge 4 Person:p -((LIKES>|:e)- Message:m //any-type, right directed edge Listing 4.64: Examplesofpathpatterns in syntax V1 andV2.

We omitthe implementationsofQueries4and 5asthesecontainnonew functionality. The implementation of Query 6inAlgorithm4.65 howevershows multiple accumulators and acustomtypeand we will go overthis in more detail. GSQL supportscustomtuple types, that are containersholdingafixed sequenceofbase typesasinline 2. These typesare often used to aggregate related values. We use our custom reply type in a HeapAccum where we order theentries on thelast name of thereply author(replyLN). The queryisstructured as follows: thevaluesofthe local knows accumulators are set to True forall persons thatknowthe original author in lines7-9. In lines11-17, we first match all1-hop comments and theirauthors and aggregate theirinformationin an ACCUM into localaccumulators (lines13-16).The CASE statement in line 16 tests whether thereply author knowsthe original author and setsthe knows value of the

26Note howeverthat an edge type in GSQL is either declareddirected or undirected and we can only matchwiththe corresponding pattern: if the edge type E1 is declaredundirected, we have to matchit via -(E1:e)-,ondirected edges we have to use adirected pattern -(E1>:e)- or -(.

88 4.4. GSQL

Comment correspondingly.Afterthesevaluesare aggregated, we use POST-ACCUM in line 18where we combine thevaluesinareply and add this to the global replyTop collection.

1 CREATE QUERY is7(STRING messageId) FOR GRAPH ldbc_snb syntax v2 { 2 TYPEDEF TUPLEreply; //custom type "reply" 3 OrAccum @knows; 4 SumAccum@rAFN, @rALN; 5 HeapAccum(100, replyLN ASC)@@replyTop; //aggregate results, ordered 6 //[...] convertmessageId parameter to vertex vMessage 7 accFriend = SELECT s 8 FROM vMessage:s -(HAS_CREATOR>:e1)- Person:t1-(KNOWS:e2)- Person:t2 9 ACCUM t2.@knows += True; //set knows=True on all Persons that know t1 10 11 accReply=SELECT s 12 FROM vMessage:s -(:e2)- Person:t2 13 ACCUM 14 c.@rAFN=t2.firstName, 15 c.@rALN=t2.lastName, 16 CASE WHEN t2.@knows THEN c.@knows += True END //set knows-value 17 POST-ACCUM @@replyTop+=reply(c.content, c.@rAFN, c.@rALN,c.@knows); 18 19 PRINT @@replyTop; 20 } Listing 4.65: Parts of Query 6(IS7) as aGSQL query.

On aside note, TigerGraph added beta support for conjunctivepatterns in version3.0 of the database.Theseconjunctivepatternshavetobecorrelated, meaningthatatleast one variable hastobeshared among anytwo suchpatterns. Nonetheless, this allows complicated patterns to be decomposed into smallerones and sophisticated patterns that cannot be given in asinglepattern. Evaluation Semantics When matching graph patterns, GSQL uses ahomomorphism-based bag semantics as default[DXWL19]. Therefore, multiple edges as well as vertices can be matched to the same variables(aliases).However, we have to keep in mind that aqueryinGSQL works differenttoqueriesinmanyother querylanguages as the SELECT clausecan onlytakea single alias/variableasargument.Asbrieflymentionedabove,wecan store references to avertex set in avariable, which is what we didinevery query until now. In the query in Algorithm 4.66 for example, we store areference to all selected Personsfrom the alias u in the variable per.However,wedonot even need this variable lateronas informationisaggregated in theaccumulators. Whenevaluating thepattern in line 4, GSQL willproduceamatchtablewhereeachrow representsone distinct path, including oneswhere e=f and p=u [Tig]. We cannot restrict thesematches but we can specify a listofaliases in the PER clauseinorder to group the entries in the matchtableonthe valuesofthose aliases.Inour example,welist all aliases in that clause, which results

89 4. QueryLanguageAnalysis

in one invocation of the ACCUM clausefor everydistinct (p,e,m,f,u) tuple,i.e. no grouping. Nonetheless, we can use PER to reduce invocations of the ACCUM clausewhich can change theperceived semantics as we often aggregate valuesinthe latterclause. However, we cannot switchfor example to an isomorphism-based semantics other than by specifying this requirementmanuallyinthe WHERE clause. 1 TYPEDEFTUPLE info; 2 vPerson={person}; //start person 3 per = SELECT u 4 FROM vPerson:p -(LIKES>:e)- (Post|Comment):m -( Listing 4.66: Parts of aGSQL query in syntax V2 that aggregates pathinformation.

4.4.3 Path As mentionedbefore, the data model usedbyGSQL supports both, undirected and directed edges. Furthermore, each directed edge in thedatamodel can have an inverse edge attachedtothem thatisautomatically synced with theoriginalone, i.e. it will be created, updated and deletedautomatically.Sinceall those edgetypes can appear in asingle graph,the authorsofGSQL extended2RPQs and denotethis extension as Direction-Aware Regular Path Expression(DARPE) 27.Algorithm4.67 contains some examples of thesyntax of these DARPEs.Aswecan see in thepattern in line 1, repetitions of patterns are specified via “*”and we can providelowerand upper bounds similar to other graph query languages. On the other hand, thereisno“+”for repetitions of at least one. However, this functionalitycan be achieved using alower bound of 1 as in line 2. Line3contains amore sophisticated pattern and shows the dotoperator thatconcatenates two edge patterns. Thissyntax simply removesthe nodes between the edges and replacesthem with a“.”. Note howeverthatall of this is onlysupported in syntax V2 as aquery in syntax V1 can only ever traverse asingle edge. [DXWL19]

1 Person:p -((LIKES>|.(IS_PART_OF>)*)- _:c //dot operator Listing 4.67: ExamplesofDARPEs in GSQL.

Algorithm 4.68 contains asmallpart of an implementation of Query 7. The fullimple- mentation is ratherlong as GSQL does notprovideafunctiontoget theshortestpath between twovertices. However, there is an example of such an implementationonthe TigerGraph website28 and our implementation uses asimilar approach. Generally,our 27Note thatRPQsare also denoted as Regular Path Expressions(RPEs).Tothis end, we can also thinkofDARPEs as DARPQs. 28https://docs.tigergraph.com/v/2.6/dev/gsql-examples/ classic-graph-algorithms

90 4.4. GSQL

query usestwo Breadth-First Searches(BFS) startingatthe two vertices (only one shown here), and stops as soon as anodethatwas alreadyvisited from theother BFSorthe target node is found Theexpansionstep in line 4accesses thedata graph and therestof the code in this excerpt works solely on local and global accumulators. Whenwelook at thecodebelow,itlooks more likeitwas written in ageneral programming language (WHILE loop, IF/ELSE statements, variable accesses) and not in agraph query language. However, this is exactly where we can see theexpressivenessofthe Turing-complete GSQL as we can essentiallywrite anyprogram in that language.

1 //[...] query setup 2 WHILE NOT @@found DO 3 @@next =0; 4 S1 = SELECT t FROM S1:s -(KNOWS)- Person:t WHERE t.@dist1 <0 5 ACCUM 6 IF t.@dist2 >-1THEN 7 @@found += True,@@dist12 += s.@dist1 +t.@dist2+1 8 ELSE 9 @@next += 1, t.@dist1 =s.@dist1 +1 10 END; 11 IF @@found OR @@next==0THEN BREAK; END; 12 //[...] anotherBFS starting from the other person Listing 4.68: Parts of Query 7(IC13) as aGSQL query.

We omitthe code for Query8as this query is again quiteextensiveand does not contain especiallyinteresting concepts. Algorithm 4.69 contains an excerpt of our implementation of Query 9. We will not go into much detail and include it mainly for thepathpattern in line 7. Apart from that, this query shows howwecan linearlycombine multiple queries. The result of theupper query is stored in thevertex set vMessage and used as starting pointinthe lowerquery (line 7).

1 //[...] 2 vPerson ={personId}; 3 vMessage=SELECT t 4 FROM vPerson:s -(*)- Post:t1 -(HAS_CREATOR>)-Person:t2 8 ACCUM //[...] 9 //[...] Listing 4.69: Parts of Query 9(IS2) as aGSQL query in syntax V2.

Evaluation Semantics To avoid intractable performance when evaluatingpathqueries, GSQL usesthe all-shortest pathsemantics. Thismakes checking for the“existenceofashortest path that satisfies a DARPE,and even counting allsuchshortestpaths [...]tractable”[DXWL19]. Regarding the output of aquery, we have already mentioned thatGSQL behavesdifferentthan many other querylanguages as we do nothave to specify the outputinthe SELECT clause

91 4. QueryLanguageAnalysis

but more often use accumulators. Nonetheless, we can PRINT and even RETURN single values, properties, but alsoentities and collections in accumulators. Thefunctionality thatislacking compared to theother query languages thatweanalyzed is the possibility to output or to use paths in thequery apart from thepathpatterns in the FROM clause.

4.4.4Data Manipulation GSQL is notonly aDQL but alsoaDML,and both sub-languages are inspired by SQL [DXWL19]. Thisallowsustoinsert, update and delete elementsinagraph,be it avertex,anedgeoraproperty. However, we can alsosee that thelanguage is still under developmentassome approacheslikethe patternmatching syntax are not yetfully supportedinDML queries. We refer the interested reader to the TigerGraph start guide about datamodification29 thatlists theserestrictions and howspecific DMLquerieshave to be structured. Nonetheless, we will go overthe insertion and deletionconstructs to showthe general structure. Algorithm 4.70 contains two INSERT INTO statementsand we can see theresemblance to the counterpart in SQL. The first statement creates avertex of type Person and sets all propertiesofthat type.Note that typeslike Person are declaredtypes and we have to set values forall declared properties. Edges canbeinsertedinACCUM or POST-ACCUM clauses as in lines2-4. In this example, we insert an edge of type STUDY_AT between the newly created personand the vertex u.Note that we ranintoaproblemwhen implementing the query as we did notmanage to insertavariable number of edges and theirpropertiesinasinglequery.Ifwetakethe STUDY_AT edges for example, we would liketoinsert the classYear property(the emptyattribute "_"inline 4) for eachedge. However, if we structure this insertion as givenbelowand the propertyvalues are given as BAG classYear,wedonot know which of the entries in the bagbelongsto the edgethatiscurrently inserted. Though one couldthink thatwecan passthem in abag of tuplesand iterate overthesetuples, or even better in a map, GSQL queriescan onlytakebase typesand sets/bags of base typesasarguments. Base typesinGSQL are integer as well as floatingpoint numbers, booleans andstrings, DATETIME (weuse now() to get thecurrentdatetime) and also VERTEX and EDGE. We have morepossibilities on local variablesand outputtypes as GSQL alsosupports MAP, JSONOBJECT and JSONARRAY in these cases.Nonetheless, we did not manage to implementthe insertion of avariablenumberofedges where eachedgehas theirown propertyvalue.

1 INSERT INTO Person VALUES (personId, firstName, lastName, gender,birthday, now(), locationIp, browserUsed, speaks, emails); 2 res = SELECT u FROM University:u WHERE u IN unis 3 ACCUM 4 INSERTINTO STUDY_AT VALUES(personId, u, _); //classYear empty: "_" Listing 4.70: Parts of Query 10 (INS1) as aGSQLquery.

29https://docs.tigergraph.com/v/3.0/start/gsql-102/adv/dml

92 4.4. GSQL

Similar to Gremlin and PGQL, theremovalofavertex triggersthe removal of all incident edges. Therefore, we can remove acomment,the incidentedges and replying comments via the construct in Algorithm 4.71.

1 R=SELECT c FROM vComment:c -(

Existing propertyvalues can be updatedbysetting anew value as in SET e.classYear="2012".AsGSQL usesafixedschema, we cannot howeveradd new or remove existingpropertiesorchange the labels of either verticesoredges in aDML query.

4.4.5Data Definition As mentionedinthe introductionofthis language, we have to declare theschema of a graph before we caninteract with it.Using the CREATE statement of GSQL’s DDL,we can createvertexand edgecontainers, thelatterfor directed as well as undirectededges. These containersdefine thetypeaswell as theattributes/properties and theirtypes. A graph is anamed collection of thesecontainersand containerscan be shared by multiple graphs. In other words, we can specifywhichcontainertypes areavailableand contained in agraph.This allowsfor multiple viewsofthe same dataaswecan for example create graphs gA and gB thateachcontainonly some type containersand thereforeinstances of onlythose containers. Apartfrom multiple graphs thatuse thesame containersand thereforeoffer differentviewsofthe same data, we can also create multiple graphson completelyunrelated containers and thereforeondifferentdata. Accesspermissions forusersonspecific graphstogether with the“FOR GRAPH xyz” statement that canbegiven on eachqueryasinAlgorithm 4.65 allow us to restrict or permit accesstoqueriesand thereforethe underlying datatospecific users[WD18]. In contrast to all theotherquery languagesthatwelookedat, the data model does not include adedicatedlabel.Althoughthe paperonthe propertygraph type system in TigerGraph [WD18]mentions alabel type,wecouldnot find anyreference to this in the currentversion of GSQL. However, we can use the names of thetypecontainers similarly to labels in other querylanguages, Comment:r in Algorithm 4.71 for examplerestricts the matchtoinstancesofthe Comment type container and resembles the functionality of labels. Algorithm 4.72 shows an implementation of Query 13. In this example, we change the existingschema using a SCHEMA_CHANGE jobthat creates the vertex type Street and extendsthe definitionofthe IS_LOCATED_IN edge type.Asthe original definition of thatedgetypealready includes FROM(Person) and TO(City),all we have to add is the Street type as possiblesource and target.Notethatrunning this jobdoesnot change anyexisting data and we have to adaptthe existing relationships to matchthe

93 4. QueryLanguageAnalysis

desired schemainaDML query.Apart fromthis version, we couldalso createanew graph thatcontains all existingvertexand edgecontainer typesand thereforethe existing data.Wecan then createthe Street type and anew edge type forthe relationships to those verticesinthe new graph.Therefore, we can have twoversions using mostly the same data: thegraphthat already existed beforehand without streets and thenew graph thatalso contains the streets.

1 USE GRAPH ldbc_snb //specifythe graph to work on 2 CREATE SCHEMA_CHANGE JOB add_street FOR GRAPH ldbc_snb{ 3 CREATE VERTEX Street(PRIMARY_ID id UINT,name STRING,length INT); 4 ALTER EDGE IS_LOCATED_IN ADD FROM (Street) TO (Street) 5 } 6 RUN SCHEMA_CHANGE JOB add_street Listing 4.72: Query 13 as aGSQL query.

On an interestingside note, thedata modelsupportsnot onlyinverses of directed edges but alsomultiple parallel edges of thesame type between thesame vertices. In order to distinguish between theseedges, we can specify a discriminator attribute,i.e. an attribute of theedgetypethat makes up theprimary keyofthe edgetogether with thekeysofthe endpoints. [DXWL19]

4.4.6 FurtherFeatures

• Composability: Composability in GSQL queriesworks differenttomanyother query languages. On the one hand, aGSQL query cancontainmultiple subqueries(select-from-where structures) as we useditinmanyofour examples. We can compose thesesubqueries linearlybysavingthe output of one selection (as avertex set)and using this setin anothersubquery as startingpoint,aswedid forexample in line 7ofAlgorithm4.69. Apart from this, we can also use theimperative/procedural natureofGSQL where aquerycan either call another query or recursivelyitself.Inthatsense,aquery is nothingelse thanafunctionand we canevenreturnvaluesusing the RETURN statement.While we can use the output of suchaquery as input on another one, GSQL is notfully composable as thisdoesnot hold for all queries.

• Pathsasfirst-class citizens: Unlikeall other query languages that we looked at, we cannot even output paths let alonesavethem in avariable. Therefore, paths are not raised to first-class citizens.

Although GSQL tries to followthe syntax of SQLinmanyways, it doesn’t go without notice thatthe language focuses on achievinggood performance on large graphs. To achievethis, GSQL usesastrong type system and requires the declarationofaschema thatallows forbetterquery optimization and potentially space savings. Thelanguage incorporates procedural concepts, is Turing-completeand therefore allowsfor complex

94 4.5. G-CORE

graph analyticsworkloads. As theTigerGraph databaseisdeveloped as aparallel graph database, GSQL queriesare designed in away thatallows foracomputationinmassively parallel processing (MPP)fashion [DXWL19]. All of this resultsinasyntax thatisnot as conciseasingraph query languages like Cypherand PGQL [RH19]. Compared to other graph query languages, GSQL providessome interestingfunctions like outdegree() and neighbors() out of thebox.Althoughthese functions are often needed in OLAP workloads, they can alsobeusefulinOLTPscenarios.Apart fromthe DQL, DML and DDLsub-languages, GSQL supportsaData Loading Language (DLL)thatallows forthe creation and execution of loading jobs. Again,these jobs are morelikely of interest whenusing thedatabase in an OLAP scenario to load for example the current, onlinedataintoanoffline graph for analysis. However, this can just as well be usedinOLTPscenarios to load datainthe first place.

4.5 G-CORE

The lackofastandard query languageand auniform data model forinterconnected data hasled to thecreationofatask force by the LDBC. Thistask force analyzed existing languagesand identifiedthree main challengesofgraph querying.In2016, members of theLDBC decidedtodesign anew graph query languagebased on those challenges, namely: achievingfullcomposability, treating paths as first-class citizens and capturing the core of available languages where possible[AAB+18]. The lattershould ease the transitionfor usersfamiliar with existinglanguages.Since Cypher wasidentifiedas the languagethat influenced the syntax of many graph querylanguages during the last decade, thenew language G-CORE usesanASCII-art syntax for patterns that is very similar to Cypher’s. AnothersimilaritytoCypheristhat it is ahigh-level query language with aclose alignmenttoSQL. Nonetheless, G-CORE does notonly include featuresfrom SQL and Cypherbut alsofrom other languages like path patterns fromPGQL. Unlike most graphquery languages, G-COREisnot implemented or usedinany database but is designed as acore for future languages that synthesizes desirable features. [Lin18] To addresstoother twochallenges, queriesreturn agraph and thelanguage uses an extension of thepropertygraph data model called path property graph (PPG) thatallows paths to be stored on theirown. We introduced this model in Section3.6 and revisit it in Section4.5.4. Whenwelookatthe query in Algorithm 4.73, we can see theresemblancebetween G-COREand Cypher as both usemainlythe same syntax on patterns, andSQL by the use of similar clauses, operators and functions. However, queriesare notconceptually structured linearlyasinCypherbut start similarly to SQL with the conceptually last clausethat generates the result.Asfor theinnerworkings of G-CORE,the MATCH clause populates abinding table,whereeachbinding is represented in arow thatbinds variables of thequery to entities fromthe graph [AAB+18]. Note that we can optionallyspecify agraph name for eachpatterninthe MATCH clause, allowing formultiple patterns on differentgraphs in asingle query.The CONSTRUCT clausethen takes thesebindings

95 4. QueryLanguageAnalysis

and createsagraph fromthe values. In our example, thequery returns agraphthat is structured as specified in line 1, where all labelsand propertiesfrom p and f are preservedand included in the resulting graph.

1 CONSTRUCT (p) -[:friend]-> (f) //return agraph 2 MATCH (p:Person) -[:KNOWS]- (f:Person) ON graph_name //optional source 3 WHERE p.firstName=?AND p.lastName =? Listing 4.73: Asimple G-COREquery thatreturnsagraphcontaining the Person with the given name and all persons they KNOW,and connects them via outgoing friend edges.

As thelanguage is not implementedinany database,weuse aparser that is available on GitHub30 to ensure thatthe queriesgiven hereand providedinour GitHub repository are syntactically correct.

4.5.1 Structure Independent

1 CONSTRUCT (p) 2 MATCH (p:Person) 3 WHERE p.firstName=?AND p.lastName =? Listing 4.74: Query 1asaG-CORE query.

To achievefull composability,queriesinG-CORE alwaysreturnagraph [AAB+18]. The simple queryinAlgorithm4.74for examplereturnsagraphthatcontains onlyasingle vertex, but we can alsoreturn forexampleall Personsberemoving line 3. The G-CORE paperhowever states that apotential implementation or an extension of thelanguage can alsouse the SELECT clausethat allows us to projectfrom graphsorgraphelements to tables. As theparser that we usetocheckour queriessupportsthe select clause and we needittoimplementmanyofour queries, we nowassume that thelanguage supports this clause. Note howeverthatthe SELECT clauseviolates thecomposability principle as suchaquerydoesnot return agraph and theoutput cantherefore notbeusedas inputonanother query.Tothis end,the papermentions thepossibility to includea FROM clauseasanalternativetoMATCH to alsoallowfor tabular inputdata. However, the clause is onlymentionedasapossibleextensionand we couldnot findany other reference. We do not need this clauseinour examples but we have to keep in mindthat outputsofSELECT queries potentially violate the composability principle. The queryinAlgorithm 4.75 uses the SELECT clauseand thereforeproduces atabular output.Aswecan see in line 5, G-CORE supports the same compact notation as Cypher for simple restrictionsonproperties.

1 SELECT m.creationDate AS messageCreationDate, 2 CASE WHEN message.content IS NOT NULL 3 THEN message.content ELSE message.imageFile

30https://github.com/ldbc/ldbc_gcore_parser

96 4.5. G-CORE

4 END AS messageContent 5 MATCH (m:Message {id =?}) //compact notation Listing 4.75: Query 2(IS4) as aG-CORE query.

Regarding the implementation of Query 3wewereconstrainedbythe missing support for multi-valued properties or functions on them in some languages. While Cypher and Gremlin support suchproperties, we were onlyable to implement the query in acompact wayinCypherand notinaGremlin traversal. PGQL’s data model on theother hand does notsupport multi-valued properties at all,and in GSQL we were again ableto implement this in arelatively compact way in using an AvgAccum.The data model of G-COREsupportssuchpropertiesaswellbut it is not cleartoushow to directly count the property values.Toimplementthe query,weuse aslightly differentapproachgiven in Algorithm 4.76. Caused by theassignmentofthe property speaks to thevariable s in line2,G-CORE automatically unrolls thevalues of themulti-valueproperty speaks. As an example,assume that thepropertycontains thetwo values {“en”,“de”} on the personwith id=123.Ifthe query did notcontainthe assignment, asingle rowinthe binding table contains theinformationabout that person, and all this informationis contained in thesingle column c as this is the onlyvariableapartfrom s in thatquery. Using theassignmenthowever,the table willhavethe two columns p and s and we will have one row [p=, s=“de”] and anotherone [p=, s=“en”].Wethen GROUP the entries on theperson IDs whichallows us to countthe languages perperson,and finally computethe average overthose counts using AVG.

1 SELECT AVG( COUNT(s) )asavgLanguages 2 MATCH (p:Person {speaks=s}) //unroll speaks into variables 3GROUP BY p.id Listing 4.76: Query 3asaG-CORE query.

According to theparser,G-CORE providessome functions fromSQL like SUM and MIN/MAX on numbers, COUNT on elementsaswell as conditionals using CASE or inclusion checks via IN.Wedonot find anyspecifically graphrelevant or otherwisenew functions, most likely caused by the fact thatit“is intentionally designedasasmalllanguage that provides akernel of graph matching and construction functionality”[AAB+18].

4.5.2 PatternMatching G-COREsupportsbasicgraph patterns in asyntax thatisvery similartoCypher’s as well as complex onesusing the UNION, INTERSECT and MINUS set operations. As each query returnsagraph, at least that’s howitisconceived in the paper, i.e. without the SELECT operator, theseset operationsare applied to two graph queries that eachreturn agraph and therefore work on graphs.Algorithm 4.77 showsanexample of an implicit use of theUNIONoperatorand of another property thatfollows from theclosedness of thelanguage, namely that views are possible. In this example, we assumethatthe

97 4. QueryLanguageAnalysis

initial graphisnamed social_graph.The query creates agraph view thattakes this graph andunions it with the construct in line 3. As this construct contains apattern thatalready exists (the same patternismatched in line 4), thequery simplyadds the property nr_messages to some knows edges in theinitialgraph. Theunionoperation is implicit as all patterns andreferencesgiven in the CONSTRUCT clauseare merged together to createasingle graph thatisthen returned. Apossibleresultofthis query can be seen in Figure 3.10 where theview addsthe nr_messages propertieswhereas the restalready existed in theinitialgraph. The queryfurthermore shows the OPTIONAL clausethatcan be useddirectly after a MATCH patternor, as in this example, after the WHERE clause. We use this clause to optionallymatch apatternovermessages and theirreplies between thetwo Person nodes n and m.Having thesepatterns in our binding table allowsustocountthem and we set this value on the nr_messages propertyofthe corresponding KNOWS edge (line 3). We alsoobserve that graph patternscan not onlyoccur in the MATCH and OPTIONAL but alsointhe WHERE clause. Whereas patterns in the other clauses are used to populate the binding table,patterns in the WHERE clauseconstrainthe matches and thereforethe entries in that table.Notethat this is alsopossible in Cypher, where we did howevernot use it in anyofour queries.

1 GRAPH VIEW social_graph1 AS ( 2 CONSTRUCT social_graph, //shorthand formfor UNION operator 3 (n) -[e]->(m) SET e.nr_messages := COUNT(*) 4 MATCH (n:Person) -[e:KNOWS]-> (m:Person) 5 WHERE (n) -[:IS_LOCATED_IN]-> () <-[:IS_LOCATED_IN]- (m) 6 OPTIONAL (n) <-[c1]- (msg1:Post|Comment), 7 (msg1)-[:REPLY_OF]-(msg2:Post|Comment)-[c2]-> (m) 8 WHERE (c1:HAS_CREATOR) AND (c2:HAS_CREATOR) ) Listing 4.77: AG-CORE querythatcreates agraph view.This is aslightvariation of the query given in:G-CORE ACore for Future Graph Query Languages, page7,lines 40-48 [AAB+18].

We skip Queries 4and 5astheseclosely resemble theirimplementationsinCypherand will brieflygooverthe query in Algorithm 4.78. Thisqueryuses asubquery to test whether thetwo authorsknoweachother (line 2) and converts the result of thesubquery to abooleanvalue using EXISTS.Although the definitionofthe PPG data model allows graph entities to be labeledwith aset of labels, including the emptyset,wefollowthe approachofLDBC’s query examplesand assume that all entities are onlylabeled with a single label.Nonetheless, G-CORE allowsustomatchonmultiple labels as in line 4, similar to howthis is possible in GSQL.

1 SELECT c.id, c.content, c.creationDate AS commentCreationDate, 2 p.id AS replyAuthorId,p.firstName, p.lastName, 3 EXISTS ( CONSTRUCT (m) MATCH (m) -[:HAS_CREATOR]-> (op:Person) -[r: KNOWS]-(p) ) AS replyAuthorKnowsOriginalMessageAuthor 4 MATCH (m:Post|Comment {id =?}) <-[:REPLY_OF]-(c:Comment) -[:HAS_CREATOR]-> (p:Person)

98 4.5. G-CORE

5 ORDER BY commentCreationDate DESC,replyAuthorId ASC Listing 4.78: Simplifiedversion of Query 6(IS7) in G-CORE.

Evaluation Semantics When evaluating abasicgraphpattern, G-COREuses thehomomorphism-based seman- tics. Regarding the evaluation of complex graph patterns, we did not find astatementto whether abag or setsemantics is envisaged.

4.5.3 Path Paths in G-COREare demarcatedwith slashes -//- instead of -[]- and thelanguage supports RPQs and by the supportofnavigational graph patterns also CRPQs[AAB+18, TKL19]. Furthermore, we can specify namedpathpatterns that are very similartopath patternmacros in PGQL and we giveanexample later on in Algorithm 4.82. Using these path patterns, we can compare propertiesalong apath which results in thesupport of a subsetofREMs similar to PGQL. The queryinAlgorithm 4.79 uses paths of variable lengthaswellasthe SHORTEST function that returns asingle shortestpathbetween thegiven nodes thatsatisfies the regularexpression. We switchtoreachability semantics by surrounding theregular-or named-path patternwith .The existentialsubquery inside EXISTS in line 2testswhether the nodes are connected over knows edges. If suchapath was found,wederivethe lengthofitvia length().Whereas the query in lines1-6 returns asingle number for thepathlength, thesecond versioninlines 9-11 returnsagraph that contains exactly that path. 1 SELECT 2 CASE WHEN EXISTS (CONSTRUCT () MATCH (p1)-/SHORTEST path <:knows*>/->(p2)) 3 THEN (SELECT length(path) MATCH (p1)-/SHORTEST path<:knows*>/->(p2)) 4 ELSE -1 5 END AS length 6 MATCH (p1:Person {id =?}), (p2:Person {id =?}) 7 8 //another versionthat outputs agraph: 9 CONSTRUCT (p1) -/path/-> (p2) 10 MATCH (p1:Person) -/SHORTEST path <:knows*>/-> (p2:Person) 11 WHERE p1.id =?AND p2.id =? Listing 4.79: Twoversions of Query 7(IC13) in G-CORE.

Compared to theimplementationsofQuery 8inGremlin, PGQL and GSQL, where we includedonly apart of thequeriesbecause they were ratherlong, and even to Cypher’s where we hadtochain multiple query parts together, theimplementation in G-CORE looks compact. As we can see in line 4ofAlgorithm 4.80, reachabilitysemantics and the defaultsemantics can be mixed in asingle patternand we can givelower (x)and upper (y)bounds for thelengthofpaths via *{x..y}.Toconstruct theresult, we start by selecting properties in the first line.Inorder to aggregate thestudy- andwork-related

99 4. QueryLanguageAnalysis

information, we use twosubqueriesinlines2and3.Eachofthose queriescontains a smallpattern and multiple paths canbematched to those patterns when evaluating the query.Wefirst collect the desired information from asingle path in astring before these strings are aggregated using GROUP_CONCAT.

1 SELECT friend.id, friend.lastName,length(p) AS distance, friend.birthday, friend.creationDate, friend.gender, friend.browserUsed,friend.locationIP ,friend.email, friend.speaks,friendCity.name, 2 ( SELECTGROUP_CONCAT(uni.name + ’’+studyAt.year + ’’+uc.name) MATCH (friend) -[studyAt:studyAt]->(uni:University) -[:isLocatedIn]-> ( uc:City) ) AS unis, 3 ( SELECTGROUP_CONCAT(company.name + ’’+e.workFrom + ’’+cc.name) MATCH (friend) -[e:worksAt]->(company:Company) -[:isLocatedIn]->(cc: Country) ) AS work 4 MATCH (person:Person) -/p <:knows*{1..3}> /-> (friend:Person) -[:isLocatedIn ]-> (friendCity:City) 5 WHERE person.id =?AND friend.firstName =?AND person <> friend 6 ORDER BY distance, friend.lastName,friend.id Listing 4.80: Query 8(IC1) as aG-CORE query.

We omitthe implementation of Query 9asitcontains only asingle navigational pattern. Evaluation Semantics When evaluating path queries,G-CORE uses ashortest-path semantics. Apartfrom queryingfor asingle shortestpathusing SHORTEST,G-CORE supports “KSHORTEST” and even ALL to get all paths between two nodes.Toavoid an infinite number of paths under ALL,this is limited to queriesthat returnasubgraph and not thepathsthemselves. Regarding the output of aquery, G-CORE provides more functionalitythanany other query language thatwelookedataswecan output everythingfrom fixed arity outputs using SELECT to paths and even graphsusing CONSTRUCT.

4.5.4 Data Manipulation&DataDefinition G-COREdoesnot support anystatements to manipulate data.Thatincludesnot only the update operatorbut alsothe insertion andremoval. Therefore, we cannot createa graph inthe first place or manipulateanexisting one using G-CORE as presented in [AAB+18]. However, it makes sense that thelanguage includesonly DQL functionality as it is meantasaproof of concept for future query languages and not for thedirectuse in adatabase.Therefore, the resultsofall our examples that usethe CONSTRUCT clause are simply returnedand notsaved in adatabase.Furthermore, the SET operation that we us forexample in Algorithm 4.77 does not change theexisting data but the property in the result of that queryalone. As for thedefinition of aschema, G-CORE doesnot supportitfor the same reason. Regarding the data model,wehavealready mentioned thatthe language usesthe PPG model. Thisisformally defined in [AAB+18]and allowsfor multiple named graphs. Edges in agraph have to be directed but we can query them in anydirection. Themajor

100 4.5. G-CORE addition in this datamodel is that paths are raised to first-class citizens.Tothatend, eachgraph has“a(possibly empty) collection of paths;whereapath is aconcatenation of existing, adjacent, edges”[AAB+18]. Moreover, eachsuchpath has itsown identifier, can be labeled andevenaugmented withproperties as we will see in Algorithm 4.82.

4.5.5Further Features

• Composability: Developing alanguage that supports fullcomposability wasone of themaingoals whendesigning G-CORE and is achieved with the CONSTRUCT clausethat returns agraph. In order to alsoallowthe output of paths while retaining closednessand therefore composability, paths have to be includedinthe data model whichinturn leads to thePPG model.Asmentionedbefore, the SELECT clauseviolates that principle as suchaquerynolongerreturnsagraph but tabular data that cannot be usedasinputonanother query.Assuming that we do notuse the SELECT clause, we can naturally compose queriesasthe output of one query canbequeried by another one.Wecan even use SELECT as longasagraph is specifiedineach ON().Note that we omitted the ON clauseinmanyexamples as we assume a defaultgraph in thesecases.InAlgorithm 4.81, we create agraph thatconsists onlyofvertices insidethe ON clause(lines 3-7).This queryuses anothersubquery to calculate the number of likesonapost and savesthis value in the likedBy property. The outermost query then usesthe graph constructedinlines3-7 as input forthe MATCH clause, matches thesenodes, selects thetop 10 on their likedBy value and returns that value together with the title.This example showsthat it is possibletocompose queriesand we can includethe SELECT clauseinsome scenarios.Nonetheless, we have to keep in mind that this query ultimatelydoes notreturn agraph and thereforecannot be used as input on another one.

1 SELECT p.title AS title, p.likedBy as numLikes 2 MATCH (p) ON ( 3 CONSTRUCT (p) 4 SET p.likedBy := ( 5 SELECT COUNT(per) MATCH (per:Person) -[:likes]->(p) 6 ) 7 MATCH (p:Post) ON social_graph 8 ) 9 ORDER BY numLikes DESC 10 LIMIT 10 Listing 4.81: AG-CORE query that demonstrates composability.

• Pathsasfirst-class citizens: As mentionedearlier, paths are raised to first-class citizens in thePPG data model. To showthisfunctionality, we provide an implementation of thequery from Algorithm 4.50 in G-CORE and extend it suchthat the paths are stored and not only returned. This implementation is given in Algorithm 4.82. The querystarts with

101 4. QueryLanguageAnalysis

the definitionofthe path pattern eq_voltage_hop.Incontrast to PGQL, slashes demarcate apath in G-CORE.This canbeseeninline 3wherewematch the 3 shortest paths between two nodes that satisfythe path pattern eq_voltage_hop and have alength of at least 1. As an extension to theimplementation of thequery in PGQL,webind the COST to variable c suchthatwecan use itlater on inthe CONSTRUCT clause. The default costofapath is given by the hop-countwhichis whywecan use this to get thelengthofapath. Apartfrom the defaultcost,users canalso define their owncostfunction forpaths. By specifying suchacustomcost function,wecan also query for the (top k) cheapestpaths. The CONSTRUCT clauseinthis query nowcontains one of themost important functions of G-CORE as we construct agraph with paths.Weprefixthe path variable p with @ to denotethatthe path p is astoredpath, i.e. it is not only returned in the graph consisting of verticesand edges but alsoonits own. To get a feeling on howsucharesultcould look like,Figure3.10contains agraph as well as twostoredpaths31.Wecan see that thepaths are stored apart from the vertices and edges. In our example, we label them by eqVoltage and add the distance property. Assuming that we already have agraph thatcontains suchstoredpaths, we can queryand analyze them, for examplebyaccessingthe nodes on thepath via nodes(path).

1 PATH eq_voltage_hop =(n:Device) -> (m:Device) WHERE n.voltage =m. voltage 2 CONSTRUCT (x) -/@p:eqVoltage {distance := c}/-> (y) 3 MATCH (x) -/3 SHORTEST p<~eq_voltage_hop*{1..}> COST c/-> (y) 4 WHERE x.name = ’generator_x29’ Listing 4.82: Asimilar querytothe one in Algorithm 4.50. The queryreturnsagraph withpaths, i.e. agraph thatcontains not onlyvertices and edges but also paths.

It canbeseeninmanyaspectsthatG-CORE wasinfluencedbycontemporaryquery languages and builds upon them. This results in the general approachusingclauses similar to SQL, apattern syntax thatclosely resemblesCypher’s and namedpathpatterns as in PGQL. Apartfrom the integration of existingconcepts, the language shows that it is possibletoachievefull composability and howpaths, “the most popular feature of graphs”[AAB+18], can be raised to thesame levelasverticesand edges. All of this is achieved while remainingcomputationallyfeasible, meaning that eachquerycan be evaluated in polynomial time withrespect to the sizeofthe data. Althoughwedid not highlight andanalyze the evaluation complexityinthisthesis,itisinterestingthat the simple pathsemantics in Cyphercan be NP-complete [FGG+18b]while the shortestpath semantics in G-CORE allowsthe languagetoremaintractable,bothwith respect to the size of thedata.

31Note thatthe graph in thisfigurehas nothing to do with this query or even the domain of thisquery andthe reference should solely help with the illustrationofsucharesult.

102 4.6. Further Languages

4.6Further Languages

We did analyze 5contemporarygraphquery languages:Cypher, Gremlin, PGQLand GSQLaslanguages used in industry andimplemented indatabase systems, and G- CORE as acore for future languages designed by industry and academia. Apartfrom theselanguages,there exists awide rangeoffurtherones, both from academiaand industry.Aninterestingexample is BiQL [DNR09], acomposablequery language that does not distinguish between edges and nodes. The output of paths and aslightvariation of thedatamodel allowsoutputs to be stored and queriedagain, thereforeachieving composability.Whatall of those languages lackhowever is standardization like SQL forrelational databasesorSPARQLfor theRDF model.Thereare currently two major projects in standardization bodies thatstrivetoachievestandardization forgraph querying functionalityand we will nowintroduce both of them.

SQL/PGQ As briefly mentioned in Section 4.3, there is ongoing work insidethe ISO/IEC working group responsible forSQL (SC32 WG3) to add graph querying functionality to SQL.This extension willbeadded in the next versionofSQL,possiblySQL:2020 or SQL:2021 [TvR19]. As it is an extension to therelational querylanguage, a tabular propertygraph datamodel where verticesand edges are both stored in SQL tablesisenvisioned. Tablescan be mapped to namedpropertygraphs wheretable names become labels, columns properties and primary-foreign-key relationships edges (although all of that canbecustomized) using an extended DDL. If we look back on the DDLfunctionalityinSection 4.3.5, this closely resemblesPGQL’s “CREATE PROPERTY GRAPH”statement. Indeed, examplesofSQL/PGQDDL statementsin[TvR19]showaverysimilarsyntax forthe creationofproperty graphs. In contrast to PGQLhowever, the graphschema willbeenforced as data is alwaysstoredinarelational database and thepropertygraph providessolelya graph-view of that tabular data. We are thereforerestricted by the data-types and structure of theunderlyingrelational database. Graphs are queriedinaMATCH clauseusing an ASCII-art syntax similartoCypher’s thatallows forfixed andvariablelengthpatterns as well asshortestpathqueries. Thisapproach, where data is stored in relational tablesand canbequeriedusing either standardSQL or in agraph view,entails that datamanipulation through standardSQL DML statements are automatically visibleongraphs overthat data. Although thereisnofinalspecification of this extensionavailable yet, adding propertygraph functionalitytoSQL bringsthe worldofgraphs, patterns and paths to the largegroup of SQLdevelopers. Furthermore, as this functionality will be standardizedwith SQL,the graph related parts can be influentialfor the developmentofother graph query languages.

GQL The same working group thatisresponsiblefor SQLand working on SQL/PGQ is also developing anew, independent property graph querylanguage. As briefly mentionedinSection 2.4, this language is called GQL and is set to become the

103 4. QueryLanguageAnalysis

standardizedgraphquery language. Similar to SQL/PGQ,there is no specification forthis available as of now, but apaper on thescopeand features[GFL+18] proposes an initialdesign of thelanguage and itsfeatures.Generally,GQL “is acomposable declarative databaselanguage for querying and managing property graphs that is intendedtobeusable both as an independent language as wellasin conjunction with SQL or other languages”[GFL+18]. During query evaluation,a driving table similartothe binding table in Cypher and G-COREholdsdata. A GQL query either returnsagraph created from thesevalues, similar to CONSTRUCT in G-CORE, or returnsdatadirectly in tabular formasSQL’s SELECT.Incontrast to G-CORE, GQL is closed under graphsand tables, meaning that aquerycan take agraphoratableasinput and output either of them32.This allowsGQL to be usedinconjunction with SQLand potentially other languages,and entails that it is composable.

Anotherinteresting feature regardsthe supported evaluation semantics.Asfor patterns, the language strives to allow userstoswitchbetween homomorphism- based, no-repeated node and no-repeated edge semantics.Regarding the evaluation of paths, reachability, shortest,cheapest andtheirtop Kvariants,all paths and even simple (no-repeated-nodes),trail (no-repeated edge) and acyclic pathsemantics should be supported. If the languageindeed supports all of these semantics,itwill essentially includeall versions thatweidentifiedinChapter3.

Coming backtoFigure 4.1, we can see that thelanguage supports complex path patterns similartothe onesinPGQL (and thereforeG-CORE) as well as an optional schema that is influencedbyGSQL.The figurecontains not onlyGQL and existing graph query languages but alsoSQL/PGQ.This is especiallyinteresting as GQL and SQL/PGQare developedbythe same working group and indeed, commoncore features will be specified in one standardand referencedbythe other [HLP+19]. The dependencies between the two projects canbeseeninFigure 4.2 where Read GQL for example specifiesgraphrelated functionslikepattern matching thatare needed in both projects and GQL Proper includesfunctions that are onlysupported by GQL.

Although currentlyavailable documents on GQLpromiseafeature-richgraph query languagewith compatibilitytoSQL,itremains to be seenhow much of this functionalityisincludedinthe finalspecification. And,asbriefly mentioned before, the specification aloneisnot enough forittobeacceptedand implementedingraph databasesand onlytime will tell if,and howfast,this happens.Nonetheless, we are todayclosertoaworldwhereastandardizedlanguage is broadly supportedin manygraph databasesthanwewere some 5yearsago, before the first initiatives towards standardization.

32More specifically,possiblein- and outputs include not only graphs and tablesbut also singular values or nothing.

104 4.6. Further Languages

Figure 4.1: GQL Lineage.Source:GQL Standards-Existing Languages, Petra Selmer [GQL].

Figure 4.2: Dependencies between SQLand GQL Projects. Source: SQL and GQL, slide 17 [HLP+19].

105 4. QueryLanguageAnalysis

4.7LessonsLearned

Nowthatweanalyzed fivegraph query languages, we wrap this chapterupbycomparing them on theirsupport for certainfeatures,their evaluation semantics and other general characteristics.Wealso includeguidelines on which languages fit which application scenario and for whom it makes sense to choose acertainlanguage. Table 4.1 provides an overviewofthe support forcertain graph related features in theanalyzedlanguages. The upperhalf of thetablescompares functionalityrelatedtothe data model,likethe support formultiplegraphs, undirected edges or multi-valueproperties. Thelowerhalf then focuses on querying related features like the support of branches (e.g. via the CASE statement), alinear or nested composition of queriesand whetheruserscan provide and execute theirown function. An overview of the supported evaluation semantics for the evaluation of patternsand paths canbeseen in Table 4.2. We nowgooverthe analyzedlanguages one moretime,highlighttheir characteristics, advantagesand disadvantagesand giveexamples for when touse them. Eachsection furthermore includesanexplanationabout what is missingtowards afull implementation of thequeriesfromChapter3thatare onlypartially or not at all implemented. Table 4.3 contains an overview about thequery implementation status. Cypher Cypherisahigh-level declarativelanguage that focuses heavily on patterns, provides many graph relatedfunctions out of the boxand allowsfor extensions viauser-defined functions. Being “the original declarative querylanguage for the property graph data model”[Pla18], the ASCII-art syntax forpatterns pioneeredbyCypher influenced other contemporary graph query languages likePGQL andG-CORE. Cypherwas first introduced in Neo4j 1.4 in 2011 [Lin18]and is therefore one of theolder contemporary graph query languages,also compared to theother languages thatweanalyzed. Thedevelopmentand governance of the languagebythe openCypherProjectallows other companies and bodies to implement the language intheir systems.Together with thefact that Neo4j is one of themost popularpropertygraph databasesand Cypher therefore relatively well known in the community,this led to the inclusionofCypher in products likeSAP HANA Graph, Redis Graph andMemgraph[FGG+18b]. Furthermore, Cypher uses aconcise syntax and aligns relativelyclosely with SQL,which easesthe transition for usersfamiliar with the relational query language.All of this leads to alarge communityaround the language, whichinturnresults in amultitude of forums, tutorials and guides as well as user-defined functions thatare availableonthe internet. Regarding the queriesintroduced in Chapter 3wewere able to implementall except the last one that is DDLrelated as Cypherdoesnot support aschema. The support of amultitude of graph-specific functions like shortest path finding or matching optional patterns allowsfor (subjectively) understandable queriesthat arerelatively easy to write, even fornon-expertsinthe language.Causedbythe focus on patterns and the rich set of features (not onlyDQL- but alsoDML-related), Cypher allows forexpressive graph queries andcan be used in many graphrelatedscenarios.Inotherwords, Cypher

106 4.7. Lessons Learned . - . t i- g d e- n- n- 1) be de ar ible tha hol tu nr ctio ov 3.0. he ye ca oss ot not un es), ht an. et op ic, sp grammin sn Lf can us )c er pat nt rtic oe erP pro query el l(s en DM ve trai nk .D can ks in hed be ab ag of no ta . la (Ti atc put el we tly cons ve et s. tha ons. de in not but emar dg (s rn. bo on ge am irec rsi s, no but as te rsi nse lR ee ua some but . ka ye ve gn tex rd ut se ve pat ar pe na er t: els, er ngl y. used l assi Lo ias, db ty lang io si he sul age. av rem lab hed our new . be DS Neo4j: re ed eral dit e/al gu nt ty Can ange rn See uery Ad tu *No clar *The ch *In ali abl matc *In ✓ in *I can gen *In lan *In hq ap gr CORE REM )* n* * ✗ yzed 0-n ✗ ✗ ✓ ✓ ⊂ ✓ ✓✲ ✓ ✗ ✓ ( ✗ ✗ G- 0- nal ea l RPE SQL th de DA ✗✗ 1* ✓✓ 1* ✓✓ ✗ ✓✓ in mo rying LG ta ures que REM da -n eat ✗✓ ✗✓ 0-n ✓✗ ✗✓ ✗✗ ⊂ ✓✓ ✓✗ ✓✓ ✗✓ ✓✓ ✓✓ PGQ nf tai lin cer em RPQ * * ✓✓ 10 ✗✗ ⊃ ✓✓ ✗✗ ✗✗ Gr for ) ort ✓ ✓ ) ✓ ) r9 j: 0: )* Supp eo4 .10: ✓ phe RPQ | (v.1 (v ✗ Cy ✗ (N ✗✗ 0-n 11 ( ✓✓ ✓✓ ✗✗ ⊂ ✓✗ ✲✲ ✓✓ ✗✓ ✓✗ ✗ ✗ ✓✓ ✓✓ ✗ * 4.1: l s y ble ls ss es rn ns hs re on i els rt be hes ops ries be ti ata) te zens Ta cla la DDL lo DML pe ert nc rap la lab guage ttern atu at (d osi que citi ve pa eg pro ge Fe de bra lan tp functio ub mp ss prop rying ed le no th ve co ed edges ltipl ds I-ar cla pa /remo que te t- ar alue mu osab d) th defin cted add firs /remo nes ASCI line r- lti-v pa me comp add ths use mu (na undire pa

107 4. QueryLanguageAnalysis

homo- isomorphism based morphism n-r-edgen-r-noden-r-anything shortest reachability Cypher ♦✜ Gremlin ♦✜ (✜) PGQL ♦ (♦)(♦)(♦)(✜)(✜) GSQL ♦✜ G-CORE ♦✜✜

Table 4.2: Evaluation Semantics used by the analyzedquery languages.n-r-x stands for no-repeated-x.A♦denotes that this semantics is used by defaulttomatch patterns whereas ✜ denotes thedefaultsemantics whenmatching paths.For paths in PGQL: variable lengthpathpatterns can onlyappearinreachabilitysemantics (using -//-)or inside SHORTEST for the shortest pathsemantics.InG-CORE,shortestpathsemantics is used by default butpath expressions demarcatedwith -/<>/- are evaluated using reachability semantics.(♦)and (✜)denotesthatwecan change to this semantics,in manycases by addingadedicatedstatementlike simplePath() in Gremlin to the query.

fully implemented partiallyimplemented notdoable Cypher 1-12 13: DDL Gremlin 1-6, 8-12 7: persons notconnected ★ 13: DDL PGQL 1, 2, 4-9,11, 12 3: multi-value-property /list 10: variable number of edges 13: enforcedschema GSQL 1-9, 11 -13 10: variable number of edges G-CORE 1-9 10 -13: DML&DDL

Table 4.3: Implementation statusofthe queriesfrom Chapter 3. The textaftersome query numbers stateswhat is missing fora(full) implementation. ★ Our implementation works onlyifthe two person verticesare connected over knows edges. In other words: Gremlin lacksafunctionality to testwhether two vertices are connected.

is asolidchoicefor anyapplicationscenario as longasaschema or moreexpressive path patterns are not needed.Furthermore, it is agoodideatouse Cypherfor its close alignmenttoSQL and thelarge community when first startingwith agraph query language. Nonetheless, one has to keep in mind that Cypherfor exampledoesnot support full RPQs let alonemore expressive path patterns, currentlylacks support foraschema or other constraints and for more sophisticated features like composability.

Gremlin In contrast to Cypher, Gremlinismore imperativeinnature. It usesacompletely different approachtoall the other languages that we analyzedasitmerges with aprogramming language that implements thequeryingfunctionality. Thisallows queriestonot only

108 4.7. Lessons Learned use graph traversalfunctionalitybut alsoconstructs of theimplementing programming language likegeneral loops.Furthermore, userscan achieveashortersyntax tailored to theirdomainusing aDomainSpecific Language (DSL) and Gremlincan be easily extended inside theimplementing programming language. Forall of that,Gremlin is Turing-complete[WD18]and allowsfor asuperset of RPQsaswecan repeat whole traversals [AAB+18]. Nonetheless, being alow-level querylanguage with afocus on the imperativetraversal-style [Rod15]togetherwith the compact syntax leads to worse readabilitycompared to higher-level languages [HP13]. We would thereforenot advise developers thatare familiar with declarativeprogramming- or querylanguages like SQL to switchdirectly to Gremlin. Changing to, or choosingGremlin foraprojectmakes moresense for someone familiar with functional languages as ageneral traversalstep is nothingbut afunction andthe languageheavily reliesonfunction composition and nesting. Apartfrom the query language, the TinkerPop framework and GTM provide adistributed mode where queries canbeexecuted on distributed machines. Together with adedicated OLAP API, this resultsinaviablealternativefor bigdatascenarios.Therefore, Gremlin is areasonablechoicefor someone who is not onlyinterestedinOLTPworkloadsbut also in the analysis of largeramounts of interconnecteddata. We managedtoimplementmost of theanalyzed queriesexcept fortwo:welacka functionalitytotest whether apair of verticesisconnected for afull implementation of Query 7and the general support of aschema for Query 13. Other than that,Gremlin providesmanyinterestinggraphrelated functions out of thebox and usesarichdata modelthatallows notonly for multiple graphsbut also formeta-properties. PGQL PGQL is again ahigh-level declarativequerylanguage that wasinfluenced by Cypherand SQL [Pla18]. Thisresults in asyntax similartoCypher’s and aclose alignmenttoSQL forthe use of thesame functions and mostlythe same data types. Up untilthe release of the currentversion of PGQL in March 2020, thelanguage lackedimportantfunctionality likeaDML sub-language[Ora20]. With this versionhowever,PGQL providesasolid coreofgraphrelated functions. Compared to Cypherand Gremlin, the inclusionofpath patternmacros allowsnot onlyfor fullsupport of RPQs but even for thecomparison of properties on apath andtherefore asubsetofREMs[vR17]. Apartfrom path patterns, Cypherand Gremlin still provide more graph related functions outofthe box. Unlike Cypherthat is implementedinmultiple(commercial) products, PGQLisonlyavailable in some Oracle products where data is often storedinrelational databases. This comes with an advantage as well as adisadvantage.Onthe one hand, including PGQL in the Oracle Database allowsthe largegroupofexisting users to utilizegraph queryingon theiralready available data in arelational database.Onthe other hand, thelanguage is stillcomparativelynew and as PGQL is not utilized by anyother vendor, there is onlya rathersmallcommunity and we couldnot findmuchinformationapartfrom the official specification. It is arational choice to look intoPGQL forsomeone who is already using an Oracle

109 4. QueryLanguageAnalysis

product. Nonetheless, one has to be carefulwhen choosingaproductfor theiruse case, as PGQL in theOracle Database willresultinagraph view of relational datawhereas PGQL in “Oracle Big Data Spatialand Graph”can utilizePGX where data canbe stored in anative graph database.Wedid notgointodetailregardingthe performance and thus didnot test this butitmight become problematic whendataisstoredintables, especiallywhen matching path patterns thatlikely result in amultitude of relational joins. If we restrict ourselvestothe query languagealone, many arguments forCypher, like the alignmenttoSQL and the ASCII-artsyntaxthat resultsinconcise and (subjectively) understandable queries, apply to PGQLaswell. As for theanalyzed queries, we were abletofully implement 10 out of the 13. Lacking support formulti-valuepropertiesorlists hinderedustowards afull implementation of Query 3. Note that we are not abletostore acollection of values(the spoken languages) in asingle propertyinPGQL’s data model whereas this is possible in the other data models that we analyzed. We lackthe ability to pass tuplesoriterateovercollections insideaqueryfor aconcise implementation of Query 10. Note howeverthat the desired functionalitycan be achieved by splitting thequery into multiple ones.Regarding Query 13, we can use the “CREATE PROPERTY GRAPH”statementtocreate agraph with thedesiredstructure outoftabular data, but this schemaisonly used to create the graph and notenforcedlater on. GSQL Similar to PGQL,GSQL is ahigh-level query languageinspiredbySQL. It is usedinthe TigerGraph database and developedbythe homonymous company. The most interesting aspect of GSQL is the use of astrong type system and schema. Thistypesystem extends the one from SQLbygraphspecific typeslike vertex, edge and graph [DXWL19]. Graphs include certain vertex and edge typesthatspecify exactlywhatentities are included/allowedinthe graph andhow theycan be related.The schemaallows forbetter query optimization and storage savings, which is especially needed in TigerGraph as it is designed for“tomorrow’s big data and analytics”[HLP+19]and thereforefor huge graphs. Thisexplains the distributed approachwith nativesupport for parallel queryexecution and Accumulators. Sincethis resemblesthe MapReducemodel,GSQL is areasonable choice for someone with abackgroundonthis or asimilar NoSQLparadigm [DXWL19]. Having aschema is not the onlyargumentfor GSQL as thelanguageprovidesnot only declarativeconstructs butalso imperativeones.Asingle GSQL query canbecomposed of multiple select-from-where structures together with imperative constructs likeloops or branches.This resultsinGSQL being Turing-complete. As for pathqueryingcapabilities, the language supportsnot onlyRPQs but the more expressiveclass of DARPEsthat allowsfor directedaswell as undirected edges in asinglegraph. Although thelanguage provides OLTP functionality, it is targetedatOLAPworkloads. Nonetheless, GSQL is an expressivequerylanguage that can be usedfor OLTP workloads and is especiallyinteresting whendealingwith largeamountsofinterconnected data. On the other hand, theschema, theinclusionofimperativeconstructsand the parallel approachusing accumulators resultsinasyntax thatisnot as conciseasCypher’s [RH19].

110 4.7. Lessons Learned

The language furthermorelacks some built-in functionalityprovidedbyother languages, likeafunctiontoget theshortestpathbetween twovertices.Taken together, this results in queriesthat areoften morecomplex, not onlytodevelop but also to read and understand as they generally containmore code. Thisaugmented code is either needed to make up for nonexistentbuilt-in functions or for thedistributed nature. Although the languageisonly implemented in the TigerGraph database, there aremanyresourcesand tutorials on theirwebsite and an activecommunity in some forums. Regarding the analyzed queries, we managedtofully implement allofthem exceptfor Query 10. In contrast to PGQL, we are abletoinsert avariablenumberofedges by passing acollection of edge endpoints.However, we did notmanage to correctlyset a differentproperty for eachedgewhen adding avariablenumberofedges.Note again thatwecan achievethis functionalitybysplitting the queryintomultiple ones. G-CORE One goal whendesigningthe researchlanguage wastocapture the core of available languages. SinceG-CORE is ahigh-level declarativelanguage, it is not surprising that it usesanASCII-art syntax very similartoCypher’s and incorporates the more expressive path patterns fromPGQL. Thelanguage is designedasacore for future developmentsand shows howasophisticated graphquery language that integratesindustrialexperiences with theoretical researchcan look like [AAB+18]. As for theinclusionofnew functionality, the language isfully composableand raisespaths to first-class citizens by extending the propertygraph data model to alsoinclude paths.This results in apowerful query languagewhile retainingasyntax that closelyresembles Cypher’s. Although thismakes the language relatively easytouse andunderstand while extendingits expressiveness, it does notmakesense to recommendusing G-CORE as it is not implementedinany database system. We were abletoimplementall analyzed DQL queries in G-CORE but no DML or DDL query as thelanguage is solelyaquerylanguage without support for datamanipulation or aschema.

Forconvenience,wenoteagainthe versions of thelanguages that we analyzed. Regarding Cypher, we analyzed Version9ofthe language from the openCypherprojectaswell as some differencestothe versionofCypherasimplementedinthe Neo4j 4.1.0 database. With version 10 of thelanguage being currently under developmentand said to add interesting functionalitylikecomposability,wealsomentioned potential functionality andconceptsofthis upcoming version. RegardingGremlin, we analyzedaslightly older version of theTinkerPop framework andtherefore Gremlin (3.0.1-incubating)but mentionednotabledifferences to more recentones. ForPGQL, we analyzed the newest version1.3 and for GSQL we usedthe version that comes with TigerGraph 3.0, which is currently the latest stableversion of that database.

111

CHAPTER 5

Conclusion

Graph databasesare increasingly adoptedinindustry to handle, store and analyze graph-like data. Users can choose between amultitude of graph databases,but there is no standardized query language to access them and most systems use their owndata model. Although work on thestandardized graphquerylanguage GQLstartedin2019, it is set to takemultiple years untilitisfinished and standardized. As current graph query languages differintheir characteristics, expressivenessand ease of use, it is important to choose alanguage that fitsyour needs. We thereforeanalyzed fivecommon graph query languages regarding these factors. In this thesis, we analyzed common graph query languages and producedthe following results:

• We identified and analyzed common features that formthe conceptualcore of most graph query languages by inspectingqueriesfrom the Social Network Benchmark. Matching graph patterns is a, if notthe, majoroperation on graphs. These patterns can be extended with path expressions to formnavigational queries. We added three more features:structure independent,datamanipulation and datadefinition. The features are described andanalyzed regarding theirtheoreticalfoundations and theexpressiveness theyenable.

• In addition,weanalyzed four of themost prominent graph querylanguages from industry (Cypher,Gremlin,PGQL, GSQL)and one promising candidate for future languages fromaresearchproject(G-CORE) regardingthese features.Cypher offersamultitudeofgraph-relatedfunctions and is relativelyeasy to use.Although Gremlin is more expressive(Turing-complete), itsconcise syntax and imperative natureleadstoworse readabilitycompared to higher-level languages.PGQL offers expressive path patterns but does notprovide as manygraph-related functions as Cypher. GSQL is the onlylanguage in our comparison with atypesystemand a

113 5. Conclusion

schema, providesnot only declarativequerying functionalitybut alsoimperative constructs and is Turing-complete. We conclude with G-CORE,anexpressive language that combinesfeatures mostly from Cypher andPGQL,achieves full composabilityand elevates paths to first-class citizensinthe graph.

We limited our comparison mostly to theexpressiveness and ease of use of the analyzed languages.Itwouldtherefore be naturally of interest to alsoanalyze the complexityof the languages and compare theirperformance on certain workloads. Although thereare some papers that focusonthese matters,like[HP16]for theperformance of Cypherand Gremlin or [AAB+17]for an analysisofthe same languagesregardingtheir complexity, there is little on PGQL and GSQL. As an example, it would be interesting to compare Gremlin with GSQL as both offer anativeparallelmodeand imperativeconstructs. Similarly,itwouldbeinteresting to compare PGQL with Cypher (and G-CORE)asthey provide asimilar syntax but differ in the evaluation semantics and thedatarepresentation if we pickNeo4j and theOracle database for example.

114 ListofFigures

2.1 Asmallsocialgraph...... 5 2.2 Asmallsocialgraphinthe RDF data model...... 9 2.3 Asmallsocialgraphinthe edge-labeled data model...... 10 2.4 Asmallsocialgraph in the propertygraph data model...... 11 2.5 Evolution of graph query languages...... 12 2.6 An exampleofagraphicalquery in thequerylanguage G...... 12 2.7 The datamodel in UML notation of theSNB...... 19

3.1 Avariation of our smallsocialnetwork example including postsinthe property graph datamodel...... 23 3.2 An exampleofabasicgraphquery searching for relationships between a specific post and users in ahypothetical graphical query language. .... 24 3.3 Query 4(IS1) as agraphical querypattern...... 27 3.4 Query 6(IS7) as agraphical querypattern...... 28 3.5 Query 7(IC13) as agraphical querypattern...... 36 3.6 Query 8(IC1) as agraphical querypattern...... 37 3.7 Query 9(IS2) as agraphical querypattern...... 37 3.8 Query 10 (INS1, IU1 in v0.3.2) as agraphicalquery pattern...... 39 3.9 Query 11 (DEL7) as agraphical querypattern...... 40 3.10 Asmallsocialnetwork in the path propertygraph data model thatincludes storedpaths...... 43

4.1 GQLLineage...... 105 4.2 Dependencies between SQL and GQLProjects...... 105

115

List of Tables

4.1 Support forcertain featuresinthe analyzedgraphquery languages.... 107 4.2 Evaluation Semantics used by the analyzedquery languages...... 108 4.3 Implementation statusofthe queriesfrom Chapter 3...... 108

Listings

4.1 ACypherquery thatselects theneighborhood of theperson JohnDoe.46 4.2 Query 4.1 in amore compact notation...... 47 4.3 Asimple Cypher query thatselects allnodes...... 47 4.4 ACypherquery thatcontains alinear composition of queriesvia the MATCH clause...... 48 4.5 Asimple Cypher query showing an alternative to SQL’s GROUP BY.48 4.6 ACypherquery thatselects allfriends of JohnDoe that have at leastten outgoing relationships...... 49 4.7 Query1as aCypherquery...... 49 4.8 Query2(IS4) as aCypher query...... 50 4.9 Query3as aCypherquery...... 50 4.10 ACypherquery thatselects all entities thatare directly related to a continentthatstarts with ’A’...... 50 4.11 Query5(IS3) as aCypher query...... 51 4.12 Query6(IS7) as aCypher query...... 51 4.13 Acypherquery thatreturns false as thesame edge will not be matched to both edges in the pattern...... 52 4.14 Acypherquery thatreturns true as thesame edge will be matched to the edges in both MATCH clauses...... 52 4.15 ACypherquery thatdemonstrates named paths (paths that are assigned to variables)...... 53

117 4.16 Query 7asaCypherquery...... 53 4.17 Query 8asaCypher query...... 54 4.18 Simplified version of Query 10 (INS1) in Cypherthatinserts onlysome properties...... 55 4.19 Query 11 (DEL7) as aCypher query...... 56 4.20 Query 12 thatsets(updatesoradds) theproperty classYear.. ...56 4.21 ACypherqueryinNeo4j’s versionthatselects JohnDoe andcountsthe number of olderpersons...... 57 4.22 An alternativeversion of thequery fromAlgorithm 4.21 that avoids the subquery...... 57 4.23 An exampleonhow to connecttoagraph, createagraph traversalsource on thegraph and spawn atraversalinthe GremlinConsole. The query then selects the properties of node(s) with thename John Doe.. ...60 4.24 AGremlin query thatselects the names of JohnDoe’s friends. .... 60 4.25 AGremlin query thatselects propertiesofoutgoing knows edges of John Doe...... 60 4.26 Multiple Gremlin queriesthatassign and use avariable and demonstrate the use of lambdafunctions...... 61 4.27 AGremlin traversalstarting fromthe edges thatgenerates asummary of the edgelabels...... 61 4.28 Agremlin query thatoutputs informationabout aspecific personand countsthe number of outgoing edges...... 62 4.29 Query 1asaGremlin query in theGremlin-Java language variant...63 4.30 Query 2(IS4) as aGremlin query in Gremlin-Java.Wecan see howresults of aquerycan be accessed in Java...... 63 4.31 Twoversions of Query3in Gremlin...... 64 4.32 Grouping operations in Gremlin queries...... 65 4.33 Query 4(IS1) as aGremlin query...... 66 4.34 Query 5(IS3) as aGremlin query...... 66 4.35 Query 6(IS7) as aGremlinquery.Note howeverthatweare notable to generatethe boolean value depicting whether cr and replyAuthor knoweachotherinside the query...... 67 4.36 PartsofQuery 7(IC13) as aGremlin query...... 69 4.37 PartsofQuery 8(IC1) as aGremlin query...... 69 4.38 Partialimplementation of Query 9(IS2) in Gremlin.Notethatthis implementation cannot deal with thecase that arecent message is a post...... 70 4.39 Simplified versionofQuery 10 (INS1) in Gremlin that inserts onlysome properties and asingle edge...... 71 4.40 Query 11 (DEL7) as aGremlin query...... 71 4.41 Query 12 as aGremlin query...... 71 4.42 Example of the creation and accesstometa-propertiesinGremlin. .. 72

118 4.43 Example of aqueryusing multiple V() steps (as it is supportedinnewer versionofGremlin)and of aquerywith anested traversal...... 72 4.44 Example of twoequivalentqueries: first in Gremlinand then in ahypo- theticalsocial DSL...... 73 4.45 Asimple PGQL query that retrieves the first andlastname of the person withagiven id...... 74 4.46 APGQL query thatselects theids, labelsand the vertices themselves of 5 persons...... 75 4.47 Four PGQL queriesthat show the resemblance between PGQL and SQL. 76 4.48 Query 5(IS3) as aPGQL query...... 77 4.49 Query 6(IS7) as aPGQL query...... 77 4.50 APGQL query using apath pattern macrowith data comparison. The query selects all devices reachable from generator_x29 via apath where eachdevicehas thesame voltage.Source: GQL-StatusUpdate And Comparison to LDBC’s Graph QL proposals,slide 21 [vR17]...... 78 4.51 Query 7(IC13) as aPGQL query...... 79 4.52 APGQL query thatqueriesfor the3shortest paths between src and dst where eachedgealong the path hasaweight greater 10. Source: PGQL1.3 Specification [Ora20]...... 79 4.53 PartsofQuery8(IC1)asaPGQL query...... 79 4.54 Simplified version of Query9(IS2) as aPGQL query thatselects only some properties...... 79 4.55 Simplified versionofQuery 10 (INS1) in PGQL that inserts onlysome properties...... 80 4.56 Query11(DEL7) as aPGQL query...... 81 4.57 APGQL query thatqueriesfor two persons fromthe social_graph and aforumfromthe other_graph...... 81 4.58 Parts of Query 13 as aPGQL query.Note that this createsanew property graph as thereisnoway to adapt an existinggraph...... 82 4.59 Asimple GSQL querythat selectsthe friendsofaPerson with agiven name...... 85 4.60 AGSQL query thatoutputs the global number of WORK_AT relationships as well as thelocal number for thetop 10 companiesinthatregard.. 86 4.61 Query 1asaGSQL query.Note that we givethis queryonce using syntax V1 (lines 3-5) and once using syntax V2 (lines 8-9)...... 87 4.62 Query 2(IS4) as aGSQL query...... 87 4.63 Parts of Query 3asaGSQL query...... 88 4.64 Examples of pathpatterns in syntax V1 and V2...... 88 4.65 Parts of Query 6(IS7) as aGSQL query...... 89 4.66 Parts of aGSQL query in syntax V2 that aggregates pathinformation. 90 4.67 Examples of DARPEs in GSQL...... 90 4.68 Parts of Query 7(IC13) as aGSQL query...... 91 4.69 Parts of Query 9(IS2) as aGSQL query in syntaxV2...... 91

119 4.70 Parts of Query 10 (INS1) as aGSQL query...... 92 4.71 Parts of Query 11 (DEL7) as aGSQL query (syntax V2)...... 93 4.72 Query 13 as aGSQL query...... 94 4.73 Asimple G-COREquery thatreturnsagraph containing the Person with thegiven name andall persons they KNOW,and connects them via outgoing friend edges...... 96 4.74 Query 1asaG-CORE query...... 96 4.75 Query 2(IS4) as aG-COREquery...... 96 4.76 Query 3asaG-CORE query...... 97 4.77 AG-CORE querythatcreates agraph view.This is aslightvariation of the query given in:G-CORE ACore for Future Graph Query Languages, page7,lines 40-48 [AAB+18]...... 98 4.78 Simplifiedversion of Query 6(IS7) in G-CORE...... 98 4.79 Twoversions of Query7(IC13) in G-CORE...... 99 4.80 Query 8(IC1) as aG-CORE query...... 100 4.81 AG-CORE query that demonstrates composability...... 101 4.82 Asimilar querytothe one in Algorithm 4.50. The queryreturnsagraph with paths, i.e. agraph thatcontains not onlyvertices and edges but also paths...... 102

120 Acronyms

2RPQ Two-way Regular Path Query.30, 32, 90

CRPQ ConjunctiveRegular Path Query.31, 32, 99

DARPE Direction-Aware Regular Path Expression. 90, 107, 110

DDL Data DefinitionLanguage. 11, 17, 37, 41, 93, 95, 103, 106–108, 111

DML Data ManipulationLanguage. 11, 17, 37, 38, 41, 47, 55, 81, 92–95, 103, 106–109, 111

DQL Data Query Language. 11, 37, 47, 80, 92, 95, 100, 106, 111

ECRPQ ExtendedConjunctiveRegular Path Query. 32, 33, 42, 78

GQL Graph Query Language. 13, 74, 103, 113

GTM Gremlin TraversalMachine.15, 59, 62, 109

ISO International Organization for Standardization. 13, 74, 103

LDBC LinkedData BenchmarkCouncil.16, 17, 22, 27, 35, 42, 62, 95, 98

OLAP Online Analytical Processing. 4, 15, 17, 84, 95, 109, 110

OLTP Online Transactional Processing. 4, 15–17, 73, 74, 84, 95, 109, 110

PGX Parallel Graph AnalytiX. 15, 16, 81, 110

PPG Path PropertyGraph. 42, 95, 98, 100, 101

RDF Resource Description Framework. 8, 40, 103

REM Regular Expressions with Memory. 32, 78, 99, 107, 109

RPQ Regular Path Query. 29–36, 52, 68, 78, 90, 99, 107–110

121 SNB SocialNetwork Benchmark. 17–19, 22, 27, 35, 38, 40, 41, 45, 66, 78, 86, 113, 115

SQL Structured Query Language. 1, 6, 11–16, 22, 42, 46–48, 51, 63, 74, 76–78, 81–84, 92, 94, 95, 97, 102, 103, 106, 109, 110

W3C World Wide WebConsortium.8,9,13

122 Bibliography

[AAA+20] Renzo Angles, János BenjaminAntal,Alex Averbuch, Peter A. Boncz, Orri Erling, Andrey Gubichev,Vlad Haprian,MoritzKaufmann,Josep-Lluís Larriba-Pey,NorbertMartínez-Bazan,József Marton, MarcusParadies, Minh-Duc Pham, Arnau Prat-Pérez,MirkoSpasic, BenjaminA.Steer, Gábor Szárnyas, andJackWaudby. The LDBC social network benchmark. CoRR,abs/2001.02299, 2020. [currentstable version: 0.3.2].

[AAB+17] Renzo Angles, MarceloArenas, Pablo Barceló, Aidan Hogan, Juan L. Reutter, and DomagojVrgoc.Foundationsofmodernquery languages for graph databases. ACMComputing Surveys,50(5):68:1–68:40, 2017.

[AAB+18] Renzo Angles, MarceloArenas, Pablo Barceló, Peter A. Boncz, George H. L. Fletcher,Claudio Gutierrez, Tobias Lindaaker,Marcus Paradies, Stefan Plantikow,Juan F. Sequeda, Oskar vanRest,and Hannes Voigt. G-CORE: Acore for future graphquery languages.InGautamDas, Christopher M. Jermaine, and Philip A. Bernstein, editors, Proceedings of the 2018 Inter- national ConferenceonManagement of Data, SIGMODConference2018, Houston, TX,USA, June 10-15, 2018,pages 1421–1432. ACM, 2018.

[ABR13] Renzo Angles, Pablo Barceló, and Gonzalo Rios.Apractical query language for graph dbs. In Loreto Bravo and MaurizioLenzerini,editors, Proceedings of the 7th AlbertoMendelzon InternationalWorkshoponFoundations of Data Management, Puebla/Cholula, Mexico, May21-23, 2013,volume 1087 of CEUR WorkshopProceedings.CEUR-WS.org, 2013.

[ACP12] Marcelo Arenas,Sebastián Conca, and Jorge Pérez. Counting beyond a yottabyte,orhow SPARQL 1.1 property paths willpreventadoptionof the standard. In AlainMille, Fabien L. Gandon, Jacques Misselis, Michael Rabinovich,and Steffen Staab, editors, Proceedingsofthe 21st WorldWide WebConference2012, WWW2012, Lyon,France, April 16-20,2012,pages 629–638. ACM, 2012.

[AG08] Renzo Angles and Claudio Gutiérrez. Survey of graph database models. ACMComputing Surveys,40(1):1:1–1:39, 2008.

123 [AG18] Renzo Angles and Claudio Gutierrez.Anintroduction to graphdatamanage- ment. In George H. L. Fletcher,Jan Hidders, and Josep-Lluís Larriba-Pey, editors, GraphDataManagement, Fundamental Issues and Recent Devel- opments,Data-Centric Systems and Applications, pages 1–32. Springer, 2018. [Ang12] Renzo Angles. Acomparison of currentgraph database models.InAnas- tasiosKementsietsidisand Marcos Antonio VazSalles, editors, Workshops Proceedingsofthe IEEE28th International ConferenceonData Engineering, ICDE2012, Arlington,VA, USA, April 1-5, 2012,pages 171–177. IEEE Computer Society, 2012. [Ang18] Renzo Angles. The property graph databasemodel.InDan Olteanuand Barbara Poblete, editors, Proceedingsofthe 12th AlbertoMendelzonInterna- tional WorkshoponFoundations of Data Management, Cali, Colombia, May 21-25, 2018,volume 2100 of CEURWorkshop Proceedings.CEUR-WS.org, 2018. [ARV19] Renzo Angles, Juan L. Reutter, and Hannes Voigt. Graph query languages. In Sherif Sakr and Albert Y. Zomaya,editors, Encyclopedia of Big Data Technologies.Springer, 2019. [AS92] Bernd Amann and Michel Scholl.Gram: Agraph data model and query language. InDario Lucarella, JocelyneNanard, MarcNanard, and Paolo Paolini,editors, ECHT’92: EuropeanConferenceonHypertextTechnology, November30-December 4, 1992, Milan, Italy,pages 201–211. ACM, 1992. [Bar13] Pablo Barceló. Querying graph databases. In RichardHull and Wenfei Fan, editors, Proceedings of the 32nd ACMSIGMOD-SIGACT-SIGART Symposium on PrinciplesofDatabaseSystems, PODS2013, New York,NY, USA -June 22 -27, 2013,pages 175–188. ACM, 2013. [BFL12] Pablo Barceló, DiegoFigueira, and LeonidLibkin. Graph logicswithrational relations and thegeneralizedintersection problem. In Proceedings of the 27th AnnualIEEE Symposium on Logic in Computer Science,LICS 2012, Dubrovnik, Croatia,June25-28, 2012,pages 115–124. IEEE Computer Society, 2012. [BLLW12] Pablo Barceló, LeonidLibkin,Anthony Widjaja Lin,and Peter T. Wood. Expressivelanguages for pathqueriesovergraph-structured data. ACM Transactions on Database Systems,37(4):31:1–31:46, 2012. [BLR11] Pablo Barceló, LeonidLibkin,and JuanL.Reutter. Queryinggraph patterns. In Maurizio Lenzerini and Thomas Schwentick, editors, Proceedings of the 30th ACMSIGMOD-SIGACT-SIGART Symposium on Principlesof DatabaseSystems, PODS2011, June 12-16, 2011, Athens,Greece,pages 199–210. ACM, 2011.

124 [BPG+19] Maciej Besta, EmanuelPeter,RobertGerstenberger, Marc Fischer,Michal Podstawski, Claude Barthels, Gustavo Alonso, and TorstenHoefler. De- mystifying graph databases: Analysis and taxonomyofdata organization, systemdesigns,and graphqueries. CoRR,abs/1910.09017, 2019. [BPR12] Pablo Barceló, JorgePérez, and Juan L. Reutter. Relative expressiveness of nested regularexpressions. In Juliana Freire and Dan Suciu, editors, Proceed- ingsofthe 6th AlbertoMendelzon International WorkshoponFoundations of Data Management, Ouro Preto, Brazil, June 27-30, 2012,volume 866 of CEUR Workshop Proceedings,pages 180–195. CEUR-WS.org, 2012. [CB05] Thomas M. Connollyand CarolynE.Begg. Databasesystems: apractical approach to design, implementation,and management.Pearson Education, 2005. [CGLV00a] DiegoCalvanese, GiuseppeDeGiacomo,Maurizio Lenzerini, and Moshe Y. Vardi. Containment of conjunctiveregular pathquerieswith inverse. In AnthonyG.Cohn,FaustoGiunchiglia,and BartSelman,editors, KR 2000, Principles of Knowledge Representation and Reasoning Proceedings of the Seventh International Conference, Breckenridge, Colorado, USA,April11-15, 2000,pages 176–185. Morgan Kaufmann, 2000. [CGLV00b] DiegoCalvanese, GiuseppeDeGiacomo,Maurizio Lenzerini, and Moshe Y. Vardi. Query processing using views forregular pathquerieswith inverse. In Proceedings of the 19th ACMSIGMOD-SIGACT-SIGART Symposium on Principles of DatabaseSystems, PODS2000,pages 58–66, 2000. [CGLV03] DiegoCalvanese, GiuseppeDeGiacomo,Maurizio Lenzerini, and Moshe Y. Vardi. Reasoning on regular pathqueries. SIGMODRecord,32(4):83–92, 2003. [Cha12] MarkChatham. Structured Query Language By Example-Volume I: Data Query Language.Lulu.com, 2012. [CL14] Arnaud Castelltort andAnne Laurent. Fuzzy queriesovernosqlgraph databases: Perspectives forextending the cypherlanguage. In Anne Laurent, OlivierStrauss,BernadetteBouchon-Meunier,and Ronald R. Yager,editors, InformationProcessingand Management of Uncertainty in Knowledge-Based Systems-15th International Conference, IPMU2014, Montpellier,France, July15-19, 2014, Proceedings, PartIII,volume 444 of Communications in Computerand Information Science,pages 384–395. Springer, 2014. [CM90] Mariano P. Consens and AlbertoO.Mendelzon. Graphlog: avisualformalism for real life recursion.InDaniel J. Rosenkrantz and Yehoshua Sagiv, editors, Proceedingsofthe Ninth ACMSIGACT-SIGMOD-SIGARTSymposium on Principles of DatabaseSystems, April 2-4, 1990,Nashville, Tennessee, USA, pages 404–416. ACMPress,1990.

125 [CMW87] Isabel F. Cruz, Alberto O. Mendelzon,and Peter T. Wood.Agraphical query languagesupportingrecursion.InUmeshwarDayal and IrvingL.Traiger, editors, Proceedings of the 1987 ACMSIGMOD InternationalConference on Management of Data, San Francisco,CA, USA, May 27-29, 1987,pages 323–330. ACMPress,1987.

[CMW88] Isabel F. Cruz, Alberto O. Mendelzon, andPeter T. Wood.G+: recursive querieswithout recursion. In LarryKerschberg,editor, Expert Database Systems, Proceedings from the Second InternationalConference, Vienna, Virginia,USA, April 25-27,1988,pages 645–666. Benjamin/Cummings, 1988.

[Cod80] E. F. Codd. Data modelsindatabase management. In Michael L. Brodie and Stephen N. Zilles, editors, Proceedingsofthe WorkshoponData Abstraction, Databases and Conceptual Modelling, Pingree Park, Colorado, USA, June 23-26,1980,volume11, pages 112–114. ACMPress, 1980.

[DNR09] Anton Dries, Siegfried Nijssen, and Luc De Raedt. Aquery language for analyzingnetworks. In DavidWai-Lok Cheung, Il-Yeol Song, Wesley W. Chu, Xiaohua Hu, and JimmyJ.Lin, editors, Proceedingsofthe 18th ACM ConferenceonInformation and Knowledge Management, CIKM2009, Hong Kong, China, November 2-6, 2009,pages 485–494. ACM, 2009.

[DWX18] AlinDeutsch,Mingxi Wu,and Yu Xu. Seamlesssyntactic and semantic integration of queryprimitives overrelational and graph datainGSQL. https://cdn2.hubspot.net/hubfs/4114546/Collateral/ IntegrationQuery%20PrimitivesGSQL.pdf,2018. [Online; last accessedNovember 2020].

[DXWL19] AlinDeutsch,YuXu, MingxiWu, andVictorLee. Tigergraph: Anative MPP graph database. CoRR,abs/1901.08248, 2019.

[EAL+15] Orri Erling, AlexAverbuch,Josep-Lluís Larriba-Pey,Hassan Chafi,Andrey Gubichev, Arnau Prat-Pérez, Minh-Duc Pham, and Peter A. Boncz. The LDBCsocialnetwork benchmark:Interactiveworkload.InTimosK.Sellis, SusanB.Davidson,and Zachary G. Ives, editors, Proceedings of the 2015 ACMSIGMOD InternationalConference on Management of Data, Mel- bourne, Victoria, Australia, May31-June 4, 2015,pages 619–630. ACM, 2015.

[FB18] Diogo Fernandesand Jorge Bernardino. Graph databasescomparison: Alle- grograph,arangodb,infinitegraph,neo4j, andorientdb. In Jorge Bernardino and Christoph Quix, editors, Proceedingsofthe 7th International Conference on Data Science,Technology and Applications,DATA 2018, Porto, Portugal, July 26-28, 2018,pages 373–380. SciTePress, 2018.

126 [FGG+18a] Nadime Francis, AlastairGreen, Paolo Guagliardo, LeonidLibkin,Tobias Lindaaker, VictorMarsault,StefanPlantikow,Mats Rydberg,Martin Schus- ter, Petra Selmer, and Andrés Taylor. Formal semantics of thelanguage cypher. CoRR,abs/1802.09984, 2018.

[FGG+18b] Nadime Francis, AlastairGreen, Paolo Guagliardo, LeonidLibkin,Tobias Lindaaker, VictorMarsault, StefanPlantikow,Mats Rydberg,Petra Selmer, and Andrés Taylor. Cypher: An evolving query language for propertygraphs. In Gautam Das,Christopher M. Jermaine, and Philip A. Bernstein, editors, Proceedingsofthe 2018 International ConferenceonManagement of Data, SIGMODConference2018, Houston, TX, USA, June 10-15, 2018,pages 1433–1445. ACM, 2018.

[FLS98] Daniela Florescu, Alon Y. Levy, andDan Suciu. Query containment for conjunctivequeries with regularexpressions. In Alberto O. Mendelzon and Jan Paredaens, editors, Proceedings of the Seventeenth ACMSIGACT- SIGMOD-SIGARTSymposium on PrinciplesofDatabaseSystems, June 1-3, 1998, Seattle, Washington, USA,pages 139–148. ACMPress,1998.

[FPSW19] GeorgeH.L.Fletcher,Alexandra Poulovassilis, Petra Selmer, and PeterT. Wood.Approximate queryingfor thepropertygraph language cypher. In 2019 IEEEInternational ConferenceonBig Data, LosAngeles,CA, USA, December 9-12,2019,pages 617–622. IEEE, 2019.

[GFL+18] AlastairGreen, Peter Furniss, TobiasLindaaker,Petra Selmer, Hannes Voigt, and Stefan Plantikow.GQL Scope and features. https: //s3.amazonaws.com/artifacts.opencypher.org/website/ materials/sql-pg-2018-0046r3-GQL-Scope-and-Features. pdf,alternative: https://drive.google.com/file/d/ 1nQXzl7u5eU0iu8s1Q7ZdBSLuPGBxjZWF/view,December 2018. [Online; lastaccessedNovember 2020; linked from http://www.opencypher.org/references].

[GJK+18] AlastairGreen, Martin Junghanns,Max Kießling, TobiasLindaaker, Stefan Plantikow,and Petra Selmer. opencypher: New directions in propertygraph querying. In Michael H. Böhlen,Reinhard Pichler,Norman May, Erhard Rahm,Shan-Hung Wu,and Katja Hose, editors, Proceedings of the 21th International ConferenceonExtending DatabaseTechnology, EDBT2018, Vienna, Austria,March 26-29, 2018,pages 520–523. OpenProceedings.org, 2018.

[GQL] Existing languages. https://www.gqlstandards.org/ existing-languages.[Online;last accessedNovember2020].

[Gre18] AlastairGreen. TheGQL manifesto -one propertygraph query language. https://gql.today,May 2018. [Online; lastaccessed May2020].

127 [Gre19a] AlastairGreen. GQL is nowaglobal standards projectalongside SQL. https://neo4j.com/blog/ gql-standard-query-language-property-graphs,September 2019. [Online; last accessed May2020].

[Gre19b] AlastairGreen. Graph query languageGQL -criticalmilestone forISO graph query standard GQL. https://www.gqlstandards.org/gql-blogs/ critical-milestone-for-iso-graph-query-standard-gql, June 2019. [Online;last accessed May2020].

[HLP+19] Keith Hare,VictorLee, Stefan Plantikow, Oskar vanRest,and Jan Michels. SQL and GQL. https://drive.google.com/file/d/ 1hwmJt9CqD8C8zoFskGnZ44eX2MZooyHM/view,March2019. [Online; last accessed May2020; linked from https://www.gqlstandards.org/ resources/papers-and-slides].

[HP13] FlorianHolzschuher and RenéPeinl.Performance of graph query languages: comparison of cypher, gremlinand nativeaccess in neo4j. In Giovanna Guerrini,editor, Joint 2013 EDBT/ICDT Conferences, EDBT/ICDT ’13, Genoa,Italy, March22, 2013, WorkshopProceedings,pages 195–204. ACM, 2013.

[HP16] FlorianHolzschuher and RenéPeinl. Querying agraph database-language selectionand performanceconsiderations. Journal of Computer and System Sciences,82(1):45–68, 2016.

[HS13] SteveHarris and Andy Seaborne. SPARQL 1.1 Overview. https://www. w3.org/TR/sparql11-overview/,March2013. [Online; lastaccessed May2020].

[Kun87] H. S. Kunii. DBMS with graph datamodel for knowledge handling. In Stephen A. Szygenda, editor, Proceedings of the 1987 Fall Joint Computer ConferenceonExploringtechnology: todayand tomorrow,pages 138–142. ACM, 1987.

[LDB20] LDBCSocialNetwork Benchmark task force.The ldbcsocial network benchmark (version 0.4.0-snapshot).used snapshot version: https: //github.com/martin-kl/Diploma_Thesis_01526110_code/ blob/master/ldbc-snb-specification_0.4.0-SNAPSHOT_ from_2020-11-03.pdf,current LDBC snapshot: https://ldbc. github.io/ldbc_snb_docs/ldbc-snb-specification.pdf,2020. [Online; lastaccessed November 2020; usedsnapshotfromNov.032020].

[Lin18] TobiasLindaaker. An overviewofthe recenthistory of graph query languages. https://s3.amazonaws.com/artifacts.opencypher.org/ website/materials/DM32.2/DM32.2-2018-00085R1-recent_

128 history_of_property_graph_query_languages.pdf,May 2018. [Online; lastaccessed October 2020;linked from http: //www.opencypher.org/references].

[LM13] Katja Losemann and WimMartens. The complexityofregularexpressions and property pathsinSPARQL. ACMTransactions on DatabaseSystems, 38(4):24:1–24:39, 2013.

[LMV16] LeonidLibkin,Wim Martens, and DomagojVrgoc.Querying graphswith data. Journal of the ACM,63(2):14:1–14:53, 2016.

[LS06] Yanhong A. Liu and Scott D. Stoller. Queryingcomplex graphs. In Pas- cal VanHentenryck, editor, PracticalAspects of Declarative Languages,8th InternationalSymposium, PADL 2006, Charleston, SC, USA, January9-10, 2006, Proceedings,volume 3819 of LectureNotes in Computer Science,pages 199–214. Springer, 2006.

[LV12] LeonidLibkin and DomagojVrgoc.Regular pathqueriesongraphs with data.InAlin Deutsch,editor, 15th InternationalConferenceonDatabase Theory, ICDT ’12,Berlin, Germany, March 26-29, 2012,pages 74–85. ACM, 2012.

[MHM17] JanMichels, Keith Hare,and Jim Melton. Standardizinggraph database functionality. https://drive.google.com/file/d/ 18vlmMAZBFAvBkKipSRiu94JTLCt4vxC8/view,February2017. [Online; last accessedMay 2020].

[MSV17] JózsefMarton, Gábor Szárnyas, andDániel Varró. Formalising opencypher graph queriesinrelational algebra. In MariteKirikova, Kjetil Nørvåg, and GeorgeA.Papadopoulos, editors, Advances in Databases and Information Systems-21stEuropeanConference, ADBIS 2017, Nicosia,Cyprus, Septem- ber24-27, 2017, Proceedings,volume 10509 of LectureNotes in Computer Science,pages 182–196. Springer,2017.

[MW89] Alberto O. Mendelzon and Peter T. Wood.Finding regularsimple paths in graph databases. In Peter M. G. Apersand Gio Wiederhold, editors, Proceedings of the Fifteenth InternationalConferenceonVery LargeData Bases, August22-25, 1989, Amsterdam, The Netherlands,pages 185–193. Morgan Kaufmann, 1989.

[Neo20] Neo4j,Inc. TheNeo4j Cyphermanual. https://neo4j.com/docs/ cypher-manual/current/,2020. [Online; lastaccessed October2020].

[ope18] openCypherImplementers Group. Cypher querylanguage reference, version9. https://github.com/opencypher/openCypher/blob/ master/docs/openCypher9.pdf,2018. [Online; lastaccessed October 2020].

129 [Ora20] Oracle. PGQL 1.3 specification | propertygraph query language. https:// pgql-lang.org/spec/1.3/,March2020. [Online; lastaccessed October 2020].

[PAG10] JorgePérez, MarceloArenas, and Claudio Gutiérrez.nsparql:Anavigational language for RDF. Journal of WebSemantics,8(4):255–270, 2010.

[Pla18] StefanPlantikow.Summary chart of Cypher, PGQL, and G- Core. https://gql.today/wp-content/uploads/2018/05/ ytz-030r1-Summary-Chart-of-Cypher-PGQL-GCore-1.pdf, May2018. [Online; lastaccessedNovember2020; linked from https://gql.today/comparing-cypher-pgql-and-g-core/].

[PS08] Eric Prud’hommeaux and Andy Seaborne. SPARQL querylan- guage for RDF, W3C recommendation. https://www.w3.org/TR/ rdf-sparql-query/,January 2008. [Online; lastaccessed May2020].

[RABM17] Michael Rath, David Akehurst,Christoph Borowski, and PatrickMäder. Are graph query languages applicablefor requirementstraceabilityanalysis? In Eric Knauss,Angelo Susi, David Ameller,Daniel M. Berry,Fabiano Dalpiaz, Maya Daneva, Marian Daun, Oscar Dieste,Peter Forbrig,Eduard C. Groen, Andrea Herrmann,Jennifer Horkoff, Fitsum Meshesha Kifetew,Marite Kirikova, Alessia Knauss,PatrickMaeder,Fabio Massacci,Cristina Palo- mares, JolitaRalyté,AhmedSeffah, AlbertoSiena,and Bastian Tenbergen, editors, JointProceedingsofREFSQ-2017 Workshops, Doctoral Symposium, Research MethodTrack, and Poster Track co-locatedwiththe 22nd Interna- tional ConferenceonRequirements Engineering: Foundationfor Software Quality (REFSQ 2017), Essen, Germany, February27, 2017,volume 1796 of CEUR Workshop Proceedings.CEUR-WS.org,2017.

[RH19] Florin Rusu and Zhiyi Huang. In-depth benchmarking of graph database systems with thelinked data benchmark council(LDBC) social network benchmark (SNB). CoRR,abs/1907.07405, 2019.

[RN11] Marko A. Rodriguez andPeter Neubauer.The graph traversalpattern. In Sherif Sakr and Eric Pardede, editors, GraphData Management: Techniques and Applications,pages 29–46. IGI Global, 2011.

[Rod15] Marko A. Rodriguez. Thegremlin graph traversalmachine and language (invited talk). In James Cheneyand Thomas Neumann, editors, Proceedings of the 15th Symposium on DatabaseProgrammingLanguages,Pittsburgh, PA,USA, October 25-30, 2015,pages 1–10. ACM, 2015.

[RTH+17] Nicholas P. Roth, Vasileios Trigonakis, SungpackHong, Hassan Chafi, AnthonyPotter,BorisMotik,and IanHorrocks. Pgx.d/async: Ascal- abledistributed graphpattern matching engine. In Peter A. Boncz

130 and Josep-LluísLarriba-Pey,editors, Proceedings of the Fifth Interna- tional WorkshoponGraph Data-management Experiences &Systems, GRADES@SIGMOD/PODS 2017, Chicago, IL, USA, May 14 -19, 2017, pages 7:1–7:6. ACM, 2017.

[RWE15] Ian Robinson,Jim Webber, and EmilEifrem. GraphDatabases: New Opportunities forConnectedData.O’Reilly Media,Inc., 2nd edition,2015.

[Sco06] Michael L. Scott. Programminglanguage pragmatics (2.ed.).Morgan Kaufmann,2006.

[SHvR+16] Martin Sevenich,Sungpack Hong, Oskar vanRest,Zhe Wu,Jay Banerjee, and Hassan Chafi.Using domain-specific languages for analytic graph databases. Proceedingsofthe VLDB Endowment,9(13):1257–1268, 2016.

[SS19] Chandan Sharmaand Roopak Sinha. Aschema-first formalism for labeled property graph databases: Enabling structured data loading and analytics. In Kenneth Johnson, Josef Spillner, Xinghui Zhao, Olga Datskova, and Blesson Varghese,editors, Proceedings of the 6th IEEE/ACM International ConferenceonBig Data Computing, Applications and Technologies, BDCAT 2019, Auckland, New Zealand, December 2-5,2019,pages 71–80. ACM, 2019.

[SW14] Semih Salihoglu and Jennifer Widom.Optimizing graph algorithms on pregel-likesystems. Proc.VLDB Endow.,7(7):577–588, 2014.

[The02] DimitriTheodoratos. Semanticintegration and querying of heterogeneous data sources using ahypergraph datamodel. In BarryEaglestone,Siobhán North, and Alexandra Poulovassilis, editors, Advances in Databases, 19th British National ConferenceonDatabases, BNCOD19, Sheffield, UK,July 17-19, 2002, Proceedings,volume2405 of LectureNotes in Computer Science, pages 166–182. Springer,2002.

[Tig] TigerGraph.GSQL language reference. https://docs.tigergraph. com/v/3.0/dev/gsql-ref.[Online; last accessed November 2020, GSQL versiononTigerGraph 3.0].

[Tin15] TinkerPop3 documentationv3.0.1. https://tinkerpop.apache.org/ docs/3.0.1-incubating/,September 2015. [Online; lastaccessed Oc- tober2020].

[Tin20] Tinkerpop3 documentation v3.4.6. http://tinkerpop.apache.org/ docs/current/reference/,February2020. [Online; lastaccessed Oc- tober2020].

[TKL19] Frank Tetzel, Romans Kasperovics,and Wolfgang Lehner.Graph traversals for regular pathqueries. In Akhil Arora, Arnab Bhattacharya, and George H. L. Fletcher, editors, Proceedingsofthe 2nd Joint International Workshop

131 on Graph Data Management Experiences &Systems (GRADES) and Network Data Analytics (NDA), Amsterdam, The Netherlands,30June 2019,pages 5:1–5:8. ACM, 2019.

[TPAV19] Harsh Thakkar, DharmenPunjani, SörenAuer,and Maria-Esther Vidal. Towards an integrated graph algebra forgraph pattern matching with gremlin (extended version). CoRR,abs/1908.06265, 2019.

[TvR19] Vasileios Trigonakisand Oskarvan Rest.Propertygraph extensions forthe SQL standard. http://wiki.ldbcouncil.org/download/ attachments/106233859/ldbc_tuc_2019_sql-pgq.pdf,July 2019. [Online; last accessed November 2020].

[vR17] Oskarvan Rest.PGQL -status update and comparison to LDBC’sgraphQLproposals. https://github.com/ldbc/tuc_ presentations/blob/master/20170209-pgql.pdf,alternative link: http://wiki.ldbcouncil.org/download/attachments/ 59277315/ninth_ldbc_tuc_pgql.pdf,February2017. [Online; last accessedNovember 2020].

[vRHK+16] Oskarvan Rest,SungpackHong, Jinha Kim, Xuming Meng, and Hassan Chafi.PGQL: apropertygraph query language.InPeter A. Bonczand Josep-Lluís Larriba-Pey,editors, Proceedingsofthe FourthInternational WorkshoponGraph Data Management Experiences and Systems(GRADES), Redwood Shores, CA, USA, June 24 -24, 2016,page 7. ACM, 2016.

[WD18] Mingxi Wu andAlinDeutsch.GSQL:Ansql-inspiredgraphquery language. https://info.tigergraph.com/gsql,2018. [Online; lastaccessed May2020].

[Woo90] Peter T. Wood.Factoring augmented regular chain programs. In Dennis McLeod, RonSacks-Davis, and Hans-Jörg Schek, editors, 16th Interna- tional ConferenceonVery LargeData Bases,August13-16, 1990, Brisbane, Queensland, Australia, Proceedings,pages 255–263. Morgan Kaufmann, 1990.

[Woo12] Peter T. Wood.Querylanguages for graph databases. SIGMODRecord, 41(1):50–60, 2012.

[WS90] Carolyn R. Wattersand Michael A. Shepherd. Atransienthypergraph-based modelfor dataaccess. ACMTrans.Inf. Syst.,8(2):77–102, 1990.

[Wu18] Mingxi Wu.Propertygraph type system and datadefinition language. CoRR,abs/1810.08755, 2018.

[Yan90] Mihalis Yannakakis. Graph-theoretic methods in database theory. In Daniel J. Rosenkrantz and Yehoshua Sagiv, editors, Proceedings of the Ninth ACM SIGACT-SIGMOD-SIGARTSymposium on PrinciplesofDatabaseSystems,

132 April 2-4, 1990,Nashville, Tennessee, USA,pages 230–242. ACMPress, 1990.

133