Semantic Web 0 (0) 1 1 IOS Press

A Stitch in Time Saves Nine – SPARQL Querying of Property Graphs using Traversals Harsh Thakkar a,*, Dharmen Punjani b Yashwant Keswani c Jens Lehmann a,d and Sören Auer e a Smart Data Analytics, University of Bonn, Germany, E-mail: {thakkar, jens.lehmann}@cs.uni-bonn.de b Department, National and Kapodistrian University of Athens, Greece, E-mail: [email protected] c DA-IICT, India, E-mail: [email protected] d Fraunhofer IAIS, Germany, E-mail: [email protected] e TIB Technische Informationsbibliothek & L3S Research Center, Leibniz University of Hannover, Germany, E-mail: [email protected]

Abstract. Knowledge graphs have become popular over the past years and frequently rely on the Resource Description Frame- work (RDF) or Property Graphs (PG) as underlying data models. However, the query languages for these two data models – SPARQL for RDF and Gremlin for property graph traversal – are lacking interoperability. We present Gremlinator, a novel SPARQL to Gremlin translator. Gremlinator translates SPARQL queries to Gremlin traversals for executing graph pattern match- ing queries over graph databases. This allows to access and query a wide variety of Graph Data Management Systems (DMS) using the W3C standardized SPARQL and avoid the learning curve of a new . Grem- lin is a system agnostic traversal language covering both OLTP or OLAP graph processors, thus making it a desirable choice for supporting interoperability wrt. querying Graph DMSs. We present a comprehensive empirical evaluation of Gremlinator and demonstrate its validity and applicability by executing SPARQL queries on top of the leading graph stores , Sparksee and Apache TinkerGraph and compare the performance with the RDF stores Virtuoso, 4Store and JenaTDB. Our evaluation demonstrates the substantial performance gain obtained by the Gremlin counterparts of the SPARQL queries, especially for star-shaped and complex queries.

Keywords: SPARQL, Gremlin, Pattern Matching, Graph Traversal, Query Translator, RDF Graph, Property Graph, Gremlinator

1. Introduction integration with built-in world-wide unique identifiers and the expressive SPARQL query language; PGs on Knowledge graphs have become increasingly pop- the other hand support extremely scalable storage and ular over the past years. The two most popular data querying and are meanwhile widely used for modern arXiv:1801.02911v2 [cs.DB] 12 Feb 2018 models for representing and storing knowledge graphs Web applications. are property graphs (PG) and the Resource Description In this article, we present an approach for execut- Framework (RDF). For RDF, the SPARQL query lan- ing SPARQL queries over graph databases via Grem- guage was standardized by W3C, whereas for PGs sev- lin traversals – Gremlinator, thus building a bridge be- eral languages are frequently used, including Grem- tween the currently still largely disjoint semantic and lin [1]. Both data models and the corresponding data graph data technology ecosystems and thus addressing management techniques have distinct and complemen- the query interoperability problem. tary characteristics: RDF is suited for distributed data A SPARQL-PG query translation renders several benefits: (1) Applications based on W3C Semantic *Corresponding author. E-mail: {thakkar, jens.lehmann}@cs.uni- Web standards, like SPARQL and RDF, can use prop- bonn.de. erty graph databases in a non-intrusive fashion. (2) The

1570-0844/0-1900/$35.00 © 0 – IOS Press and the authors. All rights reserved 2 H. Thakkar et al. / Gremlinator query translation lays the foundation for a hybrid use of RDF triple stores and property graph DMS – where a particular query can be dispatched to the DMS ca- pable to answer the query more efficiently [2]. In par- ticular, property graph databases have been shown to work very well for a wide range of queries which ben- efit from locality in a graph. Rather than performing expensive joins, property graph databases use micro indices to perform traversals. (3) Users familiar with the W3C SPARQL query language can avoid learning another query language. To the best of our knowledge, this is the first Figure 1. The Gremlin Traversal Language and Machine. work addressing the query interoperability (transla- tion) problem. Related work (cf. Section2) mostly lin (e.g. Gremlin-Java8, Gremlin-Python etc.), we map covers the area of SPARQL to SQL conversion and each corresponding operation within a SPARQL ba- vice versa. In contrast to those previous efforts, we have to overcome the challenge of mediating be- sic graph pattern (BGP) to its corresponding traver- tween two very different execution paradigms: While sal step in the Gremlin instruction library (i.e. a single SPARQL uses pattern matching techniques, Grem- step traversal operation). As a result, we build complex lin is based on performing graph traversals. More pattern matching traversals, in an analogous fashion to specifically, previous efforts applied query rewriting SPARQL style querying wherein multiple BGPs form techniques between formalisms, which are ultimately complex graph patterns (CGP). Thus, it is possible to rooted in relational algebra operations, whereas we had construct a corresponding Gremlin traversal for each to bridge more disparate query paradigms. While this SPARQL query. is a significant challenge, it is also the reason why sub- Overall, we make the following contributions: stantial performance improvements can be made de- – We propose a novel approach for mapping SPARQL pending on the query characteristics: Whereas direct queries to Gremlin pattern matching traversals, , SPARQL query execution can be expected to be suit- which is the first work converting an RDF to a able for large analytical joins over the entire dataset, property graph query language to the best of our the Gremlin conversion can significantly speed up all knowledge. queries that require exploiting the graph locality. We selected TinkerPop Gremlin as target language, – Our Gremlinator implementation for executing since it is more general than, e.g. , as it sup- SPARQL queries over a plethora of third party ported by a wide range of property graph databases graph DMS such as Neo4J, Sparksee, OrientDB, (including OLTP and OLAP processors (see Figure1 etc. using the Apache TinkerPop framework is (a)). Moreover, Gremlin supports both the imperative openly available. (graph traversal) and declarative (graph pattern match- – We report the results of a comprehensive em- ing) style [1], for addressing the query interoperabil- pirical evaluation of the proposed translation ap- ity issue. Lastly, together with the Apache TinkerPop proach comprising a variety of state-of-the-art framework, Gremlin is a language and a virtual ma- property graph databases and triple stores on the chine, enabling to design another traversal language Northwind and BSBM datasets. that compiles to the Gremlin traversal machine (analo- The remainder of the article is organized as follows: gous to how Scala compiles to the JVM), ref. Figure1 Section2 covers related query conversion efforts. Sec- (b). tion3 introduces preliminary notions. Section4 de- We map SPARQL queries to the pattern matching Gremlin traversals (i.e. we map declarative SPARQL scribes the relationship between SPARQL graph pat- queries to declarative Gremlin constructs and not the tern matching and Gremlin traversal steps. Section5 imperative ones). This ensures a level of fairness explains our mapping approach. Section6 presents the when comparing the performance of both Graph Query experimental evaluation on two famous datasets, dis- Languages (GQLs). Furthermore, instead of translat- cusses the results and observations. Finally, Section7 ing SPARQL queries to a specific dialect of Grem- concludes the article and describes future work. H. Thakkar et al. / Gremlinator 3

2. Related Work SQL → CYPHER: CYPHER2 is the graph query language used to query the Neo4j3 graph database. In this section we briefly survey the related work There has been no work yet aiming to convert the with regard to techniques and tools that support the RDBMS to CYPHER. However, there are some exam- translation and execution of GQLs. ples4 that show the equivalent CYPHER queries for SPARQL → SQL: There is a substantial amount certain SQL queries. of work been done for conversion of SPARQL queries to SQL queries [3–8]. Ontop [3]1 exposes relational databases as virtual RDF graphs by linking the terms 3. Preliminaries (classes and properties) in the ontology to the data sources through mappings. This virtual RDF graph In this section, we recall and summarize the mathe- can then be queried using SPARQL by dynamically matical concepts which will be used in this article. Our and transparently translating the SPARQL queries notation closely follows [10] and extends [11] by in- into SQL queries over the relational databases. The troducing the notion of vertex labels, a detailed discus- work presented in [4] generates SQL that is optimized sion on which can be found in [12]. and also provides a well-defined specification of the SPARQL semantics used in the translation. In addition, 3.1. Graph Data Models Ontop also supports R2RML mappings over general relational schemas. The authors show that their imple- 3.1.1. Edge-labeled Graphs. mentation can outperform other well known SPARQL- The Resource Description Framework (RDF) is a to-SQL systems, as well as commercial triple stores well-known W3C standard, which is used for data by large margin. In [5] a SPARQL-to-SQL translation modeling and encoding machine readable content on technique is introduced, that focuses on the genera- the Web [13] and within intranets. An RDF graph tion of efficient SQL queries. It relies on a mapping can be seen as a set of triples, roughly analogous to language that lacks support for URI templates and is nodes and edges in a graph database. However, RDF is less expressive than R2RML. [6] proposes a transla- more specific in defining disjoint vertex-sets of blank tion function that takes a query and two many-to-one nodes, literals and IRIs. In the rudimentary form, an mappings: (i) a mapping between the triples and the RDF graph is a directed, edge-labeled, multi-graph or tables, and (ii) a mapping between pairs of the form simply an edge-labeled graph. In our current context, (triple, pattern, position) and relational attributes. In we do not consider blank nodes.5 Edge-labeled graphs addition, the approach in [6] assumes that the under- can be used to encode complex information despite lying relational DB is denormalized, and stores RDF their elementary structure [10]. Edge-labeled graphs terms. The two semantics deviate in the definition of have been formally defined in a wide variety of texts, the OPTIONAL algebra operator. The work in [8] is such as [10, 14–17]. We adapt the definition provided the extension of work in [6] to include R2RML map- by [10], which is the closest to our current context: pings. [7] makes use of non-standard SQL constructs for SPARQL–SQL translation and lacks the formal Definition 3.1 (Edge-labeled Graph). An edge-labeled proof that the translation is correct and an empirical graph is defined as G = {V, E}; where: evaluation with realistic data is missing. – V is the set of vertices, SQL → SPARQL: The work in [9] presents a for- – E is the set of directed edges such that E ⊆ (V × mal semantics preserving the translation from SQL to Lab × V) where Lab is the set of Labels. SPARQL. RETRO [9] deals only with schema map- ping and query mapping rather than to transform the data physically. Schema mapping derives a domain- 2CYPHER Query Language (https://neo4j.com/developer/ specific relational schema from RDF data. Query map- cypher-query-language/) 3Neo4j (https://neo4j.com/) ping transforms an SQL query over the schema into an 4SQL to CYPHER (https://neo4j.com/developer/ equivalent SPARQL query, which in turn is executed guide--to-cypher/) against the RDF store. 5The treatment of blank nodes is orthogonal to our current goal, as they related to the translation RDF graphs to property graphs. We focus on the pattern matching features and semantics of SPARQL 1Ontop system (http://ontop.inf.unibz.it/) and Gremlin. 4 H. Thakkar et al. / Gremlinator

Figure 2. RDF graph example. This figure portrays a collaboration network of employees in a software company.

3.1.2. Property Graphs Property graphs, also referred to as directed, edge- labeled, attributed multi-graphs, have been formally defined in a wide variety of texts, such as [10, 18–21]. We adapt the definition of property graphs presented by [20]: Definition 3.2 (Property Graph). A property graph is defined as G = {V, E, λ, µ}; where: – V is the set of vertices, Figure 3. Property Graph example. This figure presents the prop- erty graph version of the RDF graph as in Figure2 – E is the set of directed edges such that E ⊆ (V × Lab × V) where Lab is the set of Labels, We use these as running examples throughout this ar- – λ is a function that assigns labels to the edges and ticle. vertices (i.e. λ : V ∪ E → Σ∗)6, and – µ is a partial function that maps elements and keys 3.2. Graph Pattern Matching to values (i.e. µ :(V ∪E)×R → S ) i.e. properties r ∈ R s ∈ S (key , value ). The Graph Pattern Matching (GPM) problem is gen- Figures2 and3, present different data model visual- erally perceived as a subgraph matching problem (aka izations of the Apache TinkerPop modern crew graph7. subgraph isomorphism problem) [16]. GPM can be done over both undirected and directed graphs respec- tively8. Traditionally GPM is a computational task that 6A finite set of strings (Σ∗) 7TinkerPop Modern Crew property graph (http://tinkerpop. apache.org/docs/3.2.3/reference/#intro) 8In this work we will only focus on directed graphs. H. Thakkar et al. / Gremlinator 5 can be defined as the evaluation of graph patterns over 3.3.1. Evaluation of a graph pattern in SPARQL a graph database [10]. The most trivial form of a graph SPARQL operates over homomorphism-based bag pattern is the basic graph pattern (BGP). A BGP cou- semantics defined in [22, 23]. In the context of SPARQL, pled with features such as projection, union, difference the evaluation of a graph pattern P against an RDF and optional forms a complex graph pattern (CGP). We graph G has been well defined in literature. We refer illustrate these concepts in brief with respect to. con- to [10, 15, 21, 23, 24] for the formal definitions. In the text of SPARQL and Gremlin GQLs in Section4. De- later sections we illustrate the evaluation of a SPARQL tailed information on GPM is available in [10, 16, 18]. graph pattern with examples. GPM is carried out by matching (also referred to 3.4. The Gremlin Graph Traversal Language and as evaluation), a sub-graph pattern over a graph G. Machine Matching has been formally defined in various texts and we summarize a formal definition in our context Gremlin is the query language of Apache Tinker- which closely follows the definition provided by [10, Pop9 graph computing framework. Gremlin is system 18]. agnostic, and enables both – pattern matching (declar- ative) and graph traversal (imperative) style of query- Definition 3.3 (Match of a Graph Pattern P G). A ing over graphs. J K graph pattern P = (Vp, Ep, λp, µp); is matching the graph G = (V, E, λ, µ), iff the following conditions are 3.4.1. The Machine. Theoretically, a set of traversers in T move (tra- satisfied: verse) over a graph G (property graph, cf. Section 3.2) 1. there exist mappings µ and λ such that, all vari- p p according to the instruction in Ψ, and this computation ables are mapped to constants, and all constants is said to have completed when there are either: are mapped to themselves (i.e. λp ∈ λ, µp ∈ µ), 1. no more existing traversers (t), or 2. each edge é ∈ Ep in P is mapped to an edge e ∈ E 2. no more existing instructions (ψ) that are refer- in G, each vertex v´ ∈ Vp in P is mapped to a vertex v ∈ V in G, and enced by the traversers (i.e. program has halted). 3. the structure of P is preserved in G (i.e. P is a The result of the computation is either an empty set sub-graph of G) (i.e. former case) or the multiset union of the graph lo- cations (vertices, edges, labels, properties, etc.) of the The definition for matching for edge-labeled graphs halted traversers which they reference. Rodriguez [1] is analogous to that of the property graph (ref. Def. 3.3): formally define the operation of a traverser t as fol- (i) a mapping m maps the constants to themselves and lows: variables to constants; and (ii) the structure of P is t ∈ T G ← → Ψ [1] (1) preserved in G (example illustration ref. Figure3). µ {β, ς} ψ where, µ: T → U is a mapping from the traverser to 3.3. SPARQL Query Language its location in G; ψ: T → Ψ maps a traverser to a step in Ψ; β: T → N maps a traverser to its bulk10; ς: T → SPARQL is a declarative GQL which is a W3C rec- U maps a traverser to its sack (i.e. local variable of a ommendation and the query language of the RDF triple traverser) value. stores. The building blocks of SPARQL are RDF triple 3.4.2. The Traversal. patterns, consisting of subject, predicate, and object, A Gremlin graph traversal can be represented in any where either of it can be a variable, literal value or IRI. host language that supports function composition and In this work, we do not consider the blank node seman- function nesting. These steps are either of: tics. 1. Linear motif - f ◦ g ◦ h, where the traversal is a linear chain of steps; or Definition 3.4 (SPARQL BGP). A SPARQL query defines a graph pattern to be matched against a given 9 RDF graph. A basic graph pattern (BGP) is a set Gremlin: Apache TinkerPop’s graph traversal language and ma- 0 0 0 0 chine (https://tinkerpop.apache.org/) of triple patterns, tp = (s , p , o ) where s ∈ 10 The bulk of a traverser is number of equivalent traversers a 0 0 {s, ?s}, p ∈ {p, ?p} and o ∈ {o, ?o}. particular traverser represents. 6 H. Thakkar et al. / Gremlinator

2. Nested motif - f ◦(g◦h) where, the nested traver- 4. SPARQL ↔ Gremlin homology sal g ◦ h is passed as an argument to step f [1]. In this section we present the correspondence be- A step f ∈ Ψ can be, defined as f : A? → B?11. tween SPARQL BGPs/CGPs with Gremlin pattern Where, f maps a set of traversers of type A (located at matching path traversals. In doing so we devise a for- objects of A) to a set of traversers of type B (located at mal analogy borrowing the evaluation semantics of objects of B). Given that Gremlin is a language and a a SPARQL query [10, 15, 22] (referring to the well virtual machine, it is possible to design another traver- established bag semantics) and put them in context sal language that compiles to the Gremlin traversal ma- of Gremlin traversals [1, 11, 20]. A detailed discus- chine (analogous to how Scala compiles to the JVM). sion on the one-to-one operator level mapping between As a result, there exist various Gremlin dialects such SPARQL and Gremlin can be found in the study [12]. as Gremlin-Groovy, Gremlin-Python, etc. Furthermore, we illustrate the applicability of these 3.4.3. Evaluation of a graph pattern in Gremlin concepts with respect to the running examples (as In Gremlin, GPM is performed by traversing12 over shown in Figures2 and3). a graph G. A traversal t over G derives paths of arbi- trary length. Therefore, a GPM query in Gremlin can 4.1. Graph Pattern Matching via Traversing be perceived as a path traversal. Rodriguez et al. [11] define a path as: A SPARQL query consists of several BGPs which when used with features such as projection or union, Definition 3.5 (Path). A path p is a sequence or a form a CGP (as we discussed in section 3.3). BGPs ? 13 string, where p ∈ E and E ⊂ (V × Le × V) . A path (ref. Definition 3.4) are comprised of triple patterns, allows for repeated edges and the length of a path is which match to RDF triples that constitute the RDF denoted by ||p||, which is equal to the number of edges dataset. Moreover, the RDF data model resembles es- in p. sentially a directed, edge-labeled, multi-graph or RDF Moreover, from [20] we also know that these path graph. It is, therefore, possible to traverse an RDF queries are comprised of several atomic operations graph with Gremlin (i.e. construct traversals), regard- called the single-step traversals. We discuss these in less of it being an edge-labeled graph or a property brief in Section 4.2. The evaluation of an input graph graph as the core-concept of traversing remains unal- pattern in Gremlin is taken care by two functions: tered. Analogous to SPARQL, Gremlin also provides the 1. the recursively defined match() function, which GPM construct, using the Match()-step. This en- evaluates each constituting graph pattern and ables the user to represent the query using multiple in- keeps track of the traverser’s location in the graph dividually connected or disconnected graph patterns. (i.e. path history), and, Each of these graph patterns can be perceived as a sim- 2. the bind() function, which maps the declared ple path traversal, to-and-from a specific source and variables (elements and keys) to their respective destination, over the graph. values. In Gremlin, each traversal can be perceived as a path query, starting from a particular source (A) and The evaluation (also know as matching, ref. Def. 3.3) terminating at a destination (B) by visiting vertices of a graph pattern in Gremlin is carried out by the v ∈ V and edges e ∈ E(V × V). Each path query is match()-step. We borrow the notation of the eval- composed of one or more single-step traversals (SST) uation of a graph pattern ( Q ) from [15] for repre- G as shown by [1]. Through function composition and senting the evaluation of aJ GremlinK traversal Ψ over currying, it is possible to define a query of arbitrary a graph G, i.e. Ψ gml. Details of execution of the G length [1]. Furthermore, just as multiple BGPs form a match()-step inJ GremlinK are described in [1]. CGP in SPARQL, the corresponding SSTs can be cou- pled with features such as projection, union, optional, 11The Kleene star notation (A?, B?) denotes that multiple tra- etc. to form a complex path traversal query in Grem- versers can be in the same element (A,B). lin. These path queries can be a combination of either 12The act of visiting of vertices (v ∈ V) and edges (e ∈ E) in a graph in an alternating manner (in some algorithmic fashion) [20]. a source, destination, labeled traversal or all of them in 13The kleene star operation ? constructs the free monoid E? = a varying fashion, depending on the information need S∞ i 0 n=0 E . where E = {};  is the identity/empty element. of the user. H. Thakkar et al. / Gremlinator 7

4.2. SPARQL BGPs as Gremlin Single Step 1 g .V( ) . as (’x’).has(’name’,’marko’).out(’ Traversals C r e a t e d’).as(’y’) Listing 2: An outgoing traversal from the vertex In this section we establish the exact analogy be- "marko" via an edge labeled "created". tween SPARQL BGPs and Gremlin SSTs. In SPARQL, GPM is conducted by matching BGPs which consist of triple patterns (TP), that form a sub-graph, against Here, g.V() i.e. Vg is the traverser definition bijec- an RDF graph G (i.e. checking whether a sub-graph is tive to V where, ]iµ((Vg)i) = V. Thus, each predicate contained in G). We can represent BGPs notationally in a triple pattern of a SPARQL BGP manifests the as: SST required for the matching the graph pattern. We describe the different types of Gremlin SSTs and their BGP = {TP }+ ; TP = {s p o . }∗14 (2) correspondence with the SPARQL BGPs and summa- In this unique representation, each subject (s) and ob- rize them in Table1. ject (o) (i.e. nodes) in a triple is connected through In [1], Rodriguez presents an itemization of the only one predicate (p) relation (i.e. edges). Figure2 Gremlin SSTs which can be combined together to presents an example of the graph representation of a form a complex path traversal (analogous to CGP in sample RDF dataset. SPARQL). We classify these SSTs into four categories In Gremlin, GPM is conducted by the match()- depending on the whether the predicate-object combi- step, wherein each above graph patterns, represented nation (s p o) of the corresponding SPARQL BGP as pattern matching traversals are evaluated against a is a literal/value of a vertex/edge label (L) or a ver- graph G. We already know that Gremlin allows a user tex/edge property (P1) or a variable representing a to form traversals of arbitrary length using function property value (P2) or a traversal to and from a vertex currying and composition. Due to this functionality given an edge label (E). These four categories are: and given the nature of the information represented in – Case L – Traversal to access the label values of a a triple, it is possible to represent the underlying traver- vertex or an edge (Lv/Le) sal operation using a SST, which represented by its – Case P1/P2 – Traversal to access the property predicate/edge. values of a vertex or an edge (Pv/Pe) For instance, consider the BGP in listing1, where – Case E – Incoming/outgoing traversal between the information need is to find what marko has created. two adjacent vertices given an edge label (Ein/Eout) This pattern, i.e. a subgraph formed by the BGP will We consider the above mentioned four cases as our be matched against a graph (ref. Figure2) to bind the base cases for constructing complex/composite traver- values of the variables labeled as "x" to "marko", and sals from SSTs with respect to their corresponding "y" to the name property of the node connected by the SPARQL CGPs. Now, lets recall the SPARQL BGP edge/predicate labeled "created" by "x". Listing1 rep- from listing1. Here, the corresponding Gremlin SSTs resents the SPARQL BGP as shown in Figure4(a). for the SPARQL BGPs are the cases – Pv (as the traver- sal is to access the property value of a vertex) and Eout 1 { ?xa:name"marko". ?xa:Created ?y.} (as the traversal is from a vertex named "marko" via the Listing 1: What has marko created? edge labelled "Created"). Table1 connects the dots by mapping the the Gremlin SSTs [notationally ψs] (de- The corresponding Gremlin traversal for the above fined in [1, 11]) to the SPARQL BGPs. SPARQL query is shown in listing2 from Figure3. Here the underlying SSTs required are 4.3. SPARQL Queries as Gremlin Pattern Matching .has(’name’,’marko’) and .out(’Created’) Traversals that map to the HasStep() and VertexStep() instructions in the Gremlin instruction-set library [1, In SPARQL query language, as we have mentioned 11] respectively. in earlier sections, a query comprises of one or more CGPs which in turn are formed by a combination of BGPs. 14The ∗ symbolizes that a TP can also be an empty graph pattern, whereas + symbolizes that a BGP can consists of more than one TPs Similarly, in Gremlin traversal language, a pattern (i.e. a set of triple patterns). matching traversal comprises of one or more path 8 H. Thakkar et al. / Gremlinator

S.S.T. [1] Basic Graph Pattern (BGP) Corresponding Gremlin Traversal Step σ(BGP) = ψs Case

Lv { ?x v:label "person" .} [MatchStartStep(x), HasStep([∼label.eq(person)]), MatchEndStep] L Le { ?x e:label "knows" .} [MatchStartStep(x), HasStep([∼label.eq(knows)]), MatchEndStep]

Pv { ?x v:name "marko" .} [MatchStartStep(x), HasStep([name.eq(marko)], MatchEndStep] P1 Pe { ?x e:weight "0.8" .} [MatchStartStep(x), HasStep([weight.eq(0.8)]), MatchEndStep]

Pe { ?x e:weight ?y .} [MatchStartStep(x), PropertiesStep([name],value), MatchEndStep(y)] P2 Pv { ?x v:name ?y .} [MatchStartStep(x), PropertiesStep([name],value), MatchEndStep(y)] E { ?x e:knows ?x .} [MatchStartStep(x), VertexStep(OUT,[knows],vertex), MatchEndStep(y)] OUT E EIN { ?y e:knows ?x .}* [MatchStartStep(y), VertexStep(IN,[knows],vertex), MatchEndStep(x)] Table 1 Mapping between the SPARQL BGPs, Gremlin SSTs and their corresponding Traversal steps. Each SPARQL BGP can be mapped to a corre- sponding Gremlin SST as described in Sect. 4.2.

Where, Proj., Union, Diff. and Filter are the relational operators defined on the BGPs. We have already established from Table1, that each BGP can be mapped to a corresponding Gremlin sin- gle step traversal (σ(BGP) = ψs). Thus, from equations (2,3), we can create a mapping function σ, such that:

σ(BGP) = ψs (4)

Therefore, building on equations (4,3) a SPARQL query Q can be mapped as:

 + σ(Q) = σ {[PROJ.] BGP [UNION/DIFF./OPT.] BGP [Filter (c)]} Figure 4. Example GPM evaluation notion of a, (a) BGP and (b) CGP, SPARQL query over an RDF graph in Figure2. n o+ = σ([PROJ.]) ψs σ(UNION/DIFF./OPT.) ψs σ(Filter (c))

= Ψ (5)

Where, σ([PROJ.]), σ([UNION/DIFF./OPT.]) and σ([FILTER (C)]) represent the respective Gremlin in- struction steps for the operators such as Projection, Union, etc. We present a consolidated summary of the correspondence between SPARQL features/key- Figure 5. Corresponding GPM evaluation notion of a, (a) BGP and words and their corresponding Gremlin instruction (b) CGP, in a Gremlin traversal over a property graph in Figure3. steps in Table2. Furthermore, we also present the SPARQL query language constructs and their corre- sponding Gremlin traversal language constructs in Ta- traversals which in turn are comprised of a combina- ble2. tion of several SSTs. The evaluation of a SPARQL query Q is carried out From the already well established semantics of by matching or evaluating the graph patterns within SPARQL query language [10, 14, 15, 23, 25], a query Q, against a graph G (an RDF graph in this case), de- (Q) can be notationally represented as: noted as P G ). Similarly, in Gremlin traversal lan- guage andJ machine,K the evaluation of a pattern match- Q = {[PROJ.] BGP [UNION/DIFF./OPT.] ing traversal Ψ is carried out, by the match()-step, (3) Ψ + by matching or evaluating the SSTs within against BGP [Filter (c)]} a graph G (a property graph in this case). We borrow H. Thakkar et al. / Gremlinator 9

Operation SPARQL keyword Gremlin Step SPARQL construct (Q) Gremlin construct σ(Q) = Ψ

Graph Pattern(s) { s p o . } ψ (i.e. σ(s p o .)) BGP ψ (single step traversal [list of ψ])

Matching WHERE { ... } MatchStep(AND,[]) WHERE { BGP1 . BGP2 . } [MatchStep(AND,[[ψ1],[ψ2]] Restriction FILTER(C) WhereTraversalStep(p(C)) FILTER (?v1 <30) WhereTraversalStep([value(v1), IsStep(lt(30))])

Join JOIN AndStep() BGP1 * BGP2 AndStep([[ψ1], [ψ2]]) Projection SELECT SelectStep() SELECT ?v1 ?v2 SelectStep([a, b,]) Combination UNION UnionStep() BGP1 UNION BGP2 UnionStep(p(BGP1),p(BGP2)) Deduplication DISTINCT DedupStep() DISTINCT ?v1 DedupStep([a,b]) Restriction LIMIT(M) RangeStep(0,M) LIMIT 2 RangeStep(0,2) Restriction OFFSET(N) RangeStep(N,M+N) OFFSET 10 RangeStep(10,12) Sorting ORDER BY() OrderStep() ORDER BY DESC(?a) OrderStep([[value(a), desc]]) Grouping GROUP BY() GroupStep() GROUP BY(?a) GroupStep(value(a)) Table 2 A consolidated list of SPARQL features/keywords & their corresponding Instruction steps in Gremlin. the same notation from [14, 15] to fit our purpose and denote as Ψ gml). G sparql We displayJ K brevity in constructing our arguments by SELECT ?x1 ?x2 {BGP} G J K quick examples instead of re-inventing the wheel by  gml = σ(SELECT?x1 ?x2 {BGP}) G re-defining formal concepts and proofs (which already J K gml have been addressed in the works [10, 23, 26, 27]). = σ(BGP) σ(SELECT ?x1 ?x2) G J K Moreover, we illustrate using examples, the semantic = ψ SelectStep([x1,x2]) gml = Ψ gml analogy between the evaluation of Gremlin traversal s G G J K J K (7) features in a homologous fashion to that of the multi- set semantics of SPARQL queries defined by [23] Here, ψs (SST) and SelectStep([x1,x2]) col- who extend the work of [14, 15]. We show, by struc- lectively form the final pattern matching traversal tural analogy created with the evaluation semantics of (analogous to a collection of BGPs and BGPs form- 15 SPARQL , that: ing). Moreover, ψs is mapped from Table1, depending on the case it corresponds. sparql gml Optional. The optional operator in corresponds to Q G ≡ Ψ G (∵ σ(Q) = Ψ, Eqn :5) (6) J K J K a left-join operation (in relational sense). The optional graph patterns in a query are declared using this oper- Projection. The projection operator projects/selects ator. For instance, given the CGP: BGP1 OPT. BGP2; the values of a specific set of variables (x, y, .., n), from if the optional BGP2 does not match with graph G, the solution of a matched graph pattern P, against the then the results of BGP1 are returned unchanged, else graph G. Furthermore, it is possible to declare vari- additional bindings of BGP2 are added to the solu- ables in Gremlin using .as() steps, which serve as tion. It is present in both SPARQL (as OPTIONAL) syntactic sugars. For instance, in the CGP as shown in and Gremlin (as .optional() keyword which cor- Figures4(b) and5(b), we project only variable ?c de- responds to ChooseStep() in the Gremlin instruc- spite using (?a, ?b & ?c) in the query, since we are only tion library). interested knowing the value binded to it. It is carried sparql out using the SELECT keyword in SPARQL, and cor- BGP1 OPT. BGP2 G J K responding .select() step in Gremlin. The corre-  gml = σ(BGP1 OPT. BGP2) G sponding evaluation of a select step in Gremlin can be J K (8) gml illustrated as: = σ(BGP1) ChooseStep( σ(BGP2)) G J K gml gml = ψs1, ChooseStep(ψs2) G = Ψ G J K J K 15This is because both Gremlin and SPARQL operate over bag se- mantics and works such as [22, 23, 27] have already debated and for- Union. The union operator combines the solution mally established the equivalence between underlying semantics of sets of the two input graph patterns. In SPARQL, union relational and graph-specific operators for RDF and Property graphs occurs between two BGPs or CGPs, analogously in 10 H. Thakkar et al. / Gremlinator

Gremlin, it occurs between two SSTs and Traversals .gte(), etc.). The Gremlin traversal language sup- (i.e. the result set of two traversers). The solution set ports all the logical operators defined in SPARQL returned after the union operation is not de-duplicated query language (as described here16), which can be by default, because of the governing bag semantics. found at the online documentation17. However, the cur- Thus, all possible solutions are returned. Formally, the rent version of the Gremlin traversal language does evaluation of a union can be illustrated as: not support matching REGEX oper- ators, although specific graph databases that leverage sparql BGP1 UNION BGP2 G TinkerPop framework may provide a partial match ex- J K  gml tension. = σ(BGP1 UNION BGP2) G J K (9) gml = UnionStep(σ(BGP1), σ(BGP2)] G Illustration of a CGP with FILTER J K gml gml SPARQL CGP Gremlin Traversal (σ(BGP)) = UnionStep([ψs1, ψs2]) G = Ψ G J K J K { ?a v:name [MatchStartStep(a), ?b . ?a PropertiesStep([name],value), For instance, consider the sample SPARQL CGP with v:age ?d . MatchEndStep(b)], FILTER(?d<30) [MatchStartStep(a), UNION over the graph G (ref. Figure3) as illustrated } PropertiesStep([age],value)@[d], in the example below. The idea is to find the all the MatchEndStep(d)], software created by "marko" which are in "java" lan- WhereTraversalStep( [WhereStartStep(d), guage. IsStep(lt(30))])

Illustration of a CGP with Union Like in SPARQL, it is possible to declare multiple con- SPARQL CGP Gremlin Traversal (σ(BGP)) straints inside a single FILTER clause: { ?soft v:lang UnionStep ([[StartStep(soft), FILTER (C1&& C2) → "java" .} UNION PropertiesStep([lang],value), WhereTraversalStep(AndStep[(C1, C2)] { ?person v:name IsStep(eq(java)), EndStep], "marko" .} [StartStep@[person], FILTER (C1 || C2) → PropertiesStep([name],value), WhereTraversalStep(OrStep[(C1, C2)] IsStep(eq(marko)), EndStep]]) For brevity we skip the illustration of this step, as it FILTERs. The filter keyword (or a group of op- being perceptible. erators) is used to restrict the results based on user- Query Modifiers. The solution set returned by the defined criteria. Filters declare one or more constraints evaluated graph patterns is NOT de-duplicated or or- on the variables in the query, depending on the need of dered by default, as both the languages operate on bag the user, and limit the solution of the overall group of semantics. Therefore, query modifiers or solution se- BGPs with respect to specified equality/inequality/reg- quence modifiers are used for presenting the results ular expressions (i.e. constraints). It is present in both in the desired order. We list out query modifiers, their SPARQL (as FILTER C, where C is the declared con- corresponding keywords and language constructs in straint) and Gremlin (as .where(C), where C is the Table2. Examples of query modifiers include DIS- constraint). In Gremlin the .where(C) keyword cor- TINCT (for result de-duplication), LIMIT & OFF- responds to the WhereTraversalStep() from the SET (for restricting no. of results), GROUP BY (for instruction set library grouping manipulation of result stream), ORDER BY (for ordering manipulation of result stream). For brevity we skip the formal definitions of each sparql BGP FILTER C G modifier, rather illustrate their correspondence and ap- J K  gml plicability in Table2. = σ(BGP FILTER C) G J K Subgraphs. Like in SPARQL query language, it gml is also possible to load/create and query NAMED = σ(BGP) σ(FILTER C) G J K graphs. This can be achieved using the Gremlin gml gml = ψs, WhereTraversalStep(ψc) G = Ψ G J K J (10)K 16SPARQL operator definitions – (https://www.w3.org/TR/ rdf-sparql-query/#SparqlOps) Here, ψc denotes the corresponding Gremlin logical 17Gremlin logical operators (predicates) – (http://tinkerpop. operator steps (i.e. .eq() for = , .neq() for 6=, apache.org/docs/current/reference/#a-note-on-predicates) H. Thakkar et al. / Gremlinator 11

Subgraph()-step. It allows a user to create cus- tom graphs based on specific graph patterns (vertices, edges and properties) and later query them using the same approach as described earlier in this section.

5. Approach

In this section we discuss our proposed approach – Figure 6. The architectural overview of Gremlinator. Gremlinator, its execution pipeline and limitations in brief. language and platform variants of the Apache Tinker- Pop Gremlin family. 5.1. Encoding SPARQL prefixes Algorithm. The SPARQL to Gremlin translation al- gorithm is presented in Algorithm1 We encode the prefixes of SPARQL queries within Gremlinator implementation, in order to aid the SPARQL 5.3. Limitations to Gremlin translation process. We define custom pre- fixes keeping in mind the four categories of SSTs The current version of Gremlinator supports the (as stated in sec. 4.2). For instance, the standard SPARQL 1.0 SELECT queries with the following ex- rdfs:label prefix (which is generally a predicate) ceptions: 1.) REGEX (regular expressions) in FILTER is represented as e:label or v:label (where e = (restrictions) of a graph pattern are currently not sup- edge and v = vertex). A similar procedure is followed ported19. 2.) Gremlinator does not support variables for other three cases. for the property predicate, i.e. the predicate {p} in a graph pattern {s p o .} has to be defined or 5.2. Gremlinator Architecture & Algorithm known for the traversal to be generated. This is because We now present the architectural overview of Grem- traversing a graph is not possible without knowing the linator in Fig.6 and discuss the role of each of the four- precise traversal operation to the destination (vertex or step execution pipeline. edge) from the source (vertex or edge). Step 1. The input SPARQL query is first parsed us- ing the Jena ARQ module, thereby: (i) validating the query and (ii) generating its abstract syntax tree (AST) 6. Empirical Evaluation representation. Step 2. From the obtained AST of the parsed We now shed light on the empirical evaluation set- SPARQL query, Gremlinator then visits each BGPs, tings of our experiments. These include the dataset and mapping them to the corresponding Gremlin SSTs (ψs, query descriptions, a carefully curated experimental ref. Table1). setup (keeping in mind the various settings native to Step 3. Thereafter, depending on the operator prece- both RDF and Graph DMSs), a brief note on the cor- dence obtained from the AST of the parsed SPARQL rectness of our approach, the reported results and their query, each of the corresponding SPARQL keywords meticulous discussion. Finally, with a brief note on are mapped to their corresponding instruction steps the curated public demonstration of Gremlinator which from the Gremlin instruction library (ref. Table2). promotes the users to get a first hand experience of the Thus, a final conjunctive Traversal (Ψ) is generated ap- proposed system, we conclude the section. pending the SSTs and instruction steps. This can be perceived analogous to the SPARQL query language, 6.1. Datasets wherein a set of BGPs form a single complex graph pattern (CGP). Northwind – is a synthetic-dataset with an e- Step 4. This final conjunctive traversal (Ψ) is used commerce scenario between a fictional company "North- to generate bytecode18 which can be used on multiple wind Traders", its Customers, and Suppliers. Origi-

18Bytecode is simply serialized representation of a traversal, i.e. a 19This is because the REGEX feature is not supported in Tinker- list of ordered instructions where an instruction is a string operator Pop Gremlin as of now. Thus, it is Gremlin’s limitation and not of and a (flattened) array of arguments. our approach. 12 H. Thakkar et al. / Gremlinator

Table 3 Algorithm 1: SPARQL2Gremlin Dataset statistics

input : SQ : SPARQL Query Criterion Northwind BSBM output: GT : Gremlin Traversal RDF PG RDF PG

1 GT ← ∅;T ← ∅ // list of single step Classes 11 - 159 - traversals T Entities & Nodes 4413 3209 71015 92757 2 ; AS T ← getAST(SQ); BGPs Distinct subjects 4413 - 71017 - Distinct objects 8187 - 166384 - ← getAllBGPs(AST) Properties 55 55 40 40 3 foreach bgpi ∈ BGPs do Number of Triples & Edges 33003 6177 1000313 238309 4 T ← T ∪ ψs // mapping BGP to Gremlin S.S.T. (ψs = σ(bgpi)) ∵ Table1 graph version of the dataset from here21. 5 end Berlin SPARQL Benchmark [28] (BSBM) – is a 6 // mapping the corresponding Gremlin synthetic dataset, which is built around an e-commerce operators in the A.S.T. cf. Table2 use case, between a set of products, their vendors, con- 7 if c ← AS T.FILT ER, ∃c 6= ∅ then sumers who review the products. It is a widely famous 8 foreach c ∈ AS T do for benchmarking RDF DMSs as it offers the flexibil- 9 WhereTraversalStep T ← T ∪ (ψc) ity of generating graphs of custom size and density. 10 end We generated a standard 1M triples dataset using their 11 end data generation script, which makes it available in var- 12 if c ← AS T.UNION then ious formats (e.g. .nt, .csv, .sql, .ttl, etc) Figure7(b) 13 GT ← UnionStep(Match(T)) describes the schema of the BSBM Dataset. 14 end Table3 summarizes the statistics of both Northwind 15 if |T| > 1 then and BSBM-1M dataset. 16 GT ← Match(T) 17 else 18 GT ← GT ∪ T 19 end 20 if c ← AS T.ORDERBY then 21 GT ← T ∪ OrderStep(ψc) 22 end 23 if c ← AS T.GROUPBY then 24 GT ← T ∪ GroupByStep(ψc) 25 end 26 if c ← AS T.LIMIT then 27 if k ← AS T.OFFSET then Figure 7. The dataset schema of (a) Northwind and (b) BSBM. 28 GT ← T ∪ RangeStep(k, c + k) 29 else 30 GT ← T ∪ RangeStep(c) 6.2. Pre-defined Queries 31 end 32 end We created a pre-defined set of 30 SPARQL queries, 33 return GT for each dataset, which cover 10 different query fea- tures (i.e. three queries per feature with a combina- tion of various modifiers). These features were se- nated as a sample dataset shipped with Microsoft Ac- lected after a systematic study of SPARQL query se- cess20, it raised to fame with an enormous demand for mantics [10, 15, 22] and from BSBM [28] explore use e-commerce use cases in benchmarking DMSs. In Fig- cases22 and Watdiv Query templates23. A gold stan- ure7(a) we present the dataset schema. We obtained 21SQL2Gremlin website – (http://www.sql2gremlin.com) 22BSBM Explore Use Cases (https://goo.gl/y1ObNN) 20Northwind Database (https://northwinddatabase.codeplex. 23Watdiv Query Features (http://dsg.uwaterloo.ca/watdiv/ com/) basic-testing.shtml) H. Thakkar et al. / Gremlinator 13 dard set of corresponding Gremlin traversals of the Table 4 SPARQL queries was created by three Gremlin ex- Query feature design and description pert users, for a twofold validation of the traversals generated by our approach. We elaborate on the ap- proach evaluation and correctness in the following sub- section. Table4, summarizes the query design and the Query Feature FILTER COUNT LIMIT DISTINCT # Patterns # Proj. Vars. feature distribution within them that was used for our C1 CGP XX 2 2 experiment. C2 CGP X 1 1 C3 CGP X 1 1 6.3. Experimental Correctness F1 CONDITION X(1) 3 3 F2 CONDITION X(2) 3 3 In order to validate the correctness of our approach F3 CONDITION X(1) X 2 1 empirically, we – (a) loaded the RDF datasets in the L1 RESTRICTION X(1) XX 4 2 three top of the line RDF DMSs and the correspond- L2 RESTRICTION X 2 2 ing Property graph datasets in the three top of the line L3 RESTRICTION X 2 2 Graph DMSs (cf. Section 6.4). Thereafter, we executed G1 GROUP BY XX 2 2 the SPARQL queries against the RDF DMSs and the G2 GROUP BY X(1) 6 2 corresponding Gremlinator translated Gremlin traver- G3 GROUP BY X 1 2 sals against the Graph DMSs. We then compared the Gc1 GROUP COUNT XX 3 2 results returned by each of these queries for correct- Gc2 GROUP COUNT X 2 2 ness; (b) compared the results returned by the Gremli- Gc3 GROUP COUNT XX 1 2 nator translated traversals with respect to that returned O1 ORDER BY X 1 1 by the hand crafted Gremlin traversals (gold standard O2 ORDER BY X(1) 4 3 queries curated by three Gremlin experts), for the cor- O3 ORDER BY XX 1 1 responding SPARQL queries, for all the three Graph U1 UNION X(2) X 8 1 DMSs. U2 UNION X(2) 6 2 Having conducted the above validation, we ob- U3 UNION X(2) X 4 1 served that the results returned by the – (a) RDF and Op1 OPTIONAL X(1) 3 3 Graph DMSs were equal. However, the representation Op2 OPTIONAL XX 6 2 of the returned results were distinct. The results re- Op3 OPTIONAL X(2) 8 3 turned by the SPARQL queries were in a tabular for- M1 MIXED XX 3 2 M2 MIXED 2 2 mat, whereas those returned by the Graph DMSs were XXX M3 MIXED 4 2 in a list of sets format. We report this using a subset XX S1 STAR (1) 12 11 of results for BSBM dataset over both RDF and Graph X X S2 STAR (1) 5 4 DMSs in Table5 of AppendixA for reference. Here, X X S3 STAR (1) 10 9 we can clearly observe that the results in both the cases X TOTAL 30 Q. ------are equal though having two different representations. A complete set of all the results for both the datasets can be referred by vising the online resource described JenaTDB24 [v3.2.0], 4Store [30] [v1.1.5]; Graph in the caption of Table5. (b) Gremlin translated traver- DMS: TinkerGraph [31] [v3.2.3], Neo4J25 [v1.9.6], sals and the hand crafted Gremlin traversals were also Sparksee26 [v5.1]. All the experiments were performed equal. Thus, ensuring that the proposed SPARQL → on a machine with the following configuration: CPU: Gremlin translation approach is correct, as it preserves Intel® Xeon® CPU E5-2660 v3 (20 cores @2.60GHz), the meaning of the original query (i.e. the information RAM: 128 GB DDR3, HDD: 512 GB SSD, OS: need of the input SPARQL is not manipulated in the Linux 4.2-generic. translation process.)

24 6.4. Experimental Setup TDB (https://jena.apache.org/documentation/tdb/ index.html) 25Neo4J (https://neo4j.com/) We selected the following DMSs for the experi- 26Sparksee – formerly DEX (http://sparsity-technologies.com/ ments: RDF DMS: Openlink Virtuoso [29] [v7.2.4], #sparksee) 14 H. Thakkar et al. / Gremlinator

Evaluation Metrics. The following conditions and ing on the implementation. For instance, TinkerGraph parameters were considered for reporting all results. supports the creation of regular and composite hash- map indices (multiple key-value pairs) on graph ele- – Query execution time (in milliseconds or ms) ments (node and edge attributes). Neo4J allows declar- considered is the average of 10 runs for each ing regular indices (composite indices are supported query (of both SPARQL and translated Gremlin from v3.5 onwards) on graph elements (including la- traversals). bels). It offers a variety of indices ranging from Lucene – Queries executed in both cold and warm cache index (for textual attributes) and as SBTREE-based in- settings for respective DMSs. Where a warm dex (numeric ones, such as IDs), which is based on cache: implies that the cache is not cleared af- custom implementation of B-Trees with several opti- ter each query run, and cold cache: implies mizations related to data insertion and range queries 29. that the cache is cleared using the ’echo 3 Lastly, like other Graph DMSs, Sparksee also offers > /proc/sys/vm/drop_caches’ UNIX user-defined indices on attributes. It uses a bitmap in- command after each query execution. dex implemented using sorted B-trees [32]. – For Graph DMSs, query execution time is recorded As we pointed out earlier, it is not possible to have for both with and without creating explicit in- a completely index-free RDF DMS. Thus, in order to dices. We elaborate on the reason for the same, grasp a better understanding of query execution per- next. formance with respect to various factors (such as in- Indexing in RDF Triple Stores vs Graph DMS. dexing schemes, query typology and cache configura- RDF triple stores typically index data employing pre- tion) and also for the sake of fairness (towards Graph defined indices. However, it is theoretically possible to DMSs) we run all the experiments with two settings have an RDF DMS totally index-free, but this would of Graph DMSs, i.e. with (i.e. manually created) and imply performing a linear search through the entire without indices. dataset (set of triples) for each query that is executed. For this reason, having some pre-defined index set- 6.5. Results ting within a RDF DMS by default is salient. The same, however, cannot be said for Graph DMS wherein The detailed results including the queries, dataset these indices have to be created manually, depend- statistics, plots and full configuration settings can be ing upon the use case. For instance, Openlink Vir- obtained from here30. The complete source code of tuoso maintains two all-purpose full (bitmap indices Gremlinator is made publicly available, along with over PSOG, POGS) and three partial indices (over SP, a recorded demonstration of Gremlinator in action, 27 OP GS) in the default configuration . Furthermore, which can be accessed here31. The complete setup 4Store in its default setting maintains a set of three including all the datasets, scripts, and DMSs can full indices (R, P, M) [30], where – the R-index is a be found here32. The average time for translating a hash-map index over RDF resources (URIs, Literals, SPARQL query to Gremlin traversal is 14 ms for and Blank Nodes); the P-index consists of a set of two BSBM and 12.5 ms for Northwind queries respec- radix trees per predicate, using a 4-bit radix; the M- tively. index is a hash-map based indexing scheme over RDF Figures8 and9, presents the plots of our experi- Graphs (G). Lastly, Apache Jena TDB maintains three mental results, in all four settings, for the BSBM and indices using a custom persistent implementation of Northwind datasets respectively. The plots follow log 28 B+ Trees . scale for execution time (in ms). Furthermore, we also On the other hand, Graph DMSs rarely maintain any report the detailed query-wise results in tabular for- default indexing scheme. They rather offer the possi- mat in AppendixB, for more comprehensive under- bility of creating explicit indexes over custom graph elements, using a variety of data structures, depend- 29Indexing in Neo4J (http://neo4j.com/docs/developer-manual/ current/cypher/schema/index/) 27RDF indexing scheme in Virtuoso (http://docs.openlinksw.com/ 30Detailed results can be found at (https://goo.gl/CSSVzZ) virtuoso/rdfperfrdfscheme/) 31Gremlinator source code (https://github.com/ 28RDF indexing scheme in Apache Jena TDB (https: LITMUS-Benchmark-Suite/sparql-to-gremlin) //jena.apache.org/documentation/tdb/architecture.html# 32Experimental setup (https://github.com/harsh9t/ triple-and-quad-indexes) SWJ-2018-Experiments) H. Thakkar et al. / Gremlinator 15

Figure 8. Performance comparison of SPARQL queries vs Gremlin (pattern matching) traversals for BSBM dataset with respect to RDF and Graph DMSs in different configuration settings. 16 H. Thakkar et al. / Gremlinator

Figure 9. Performance comparison of SPARQL queries vs Gremlin (pattern matching) traversals for Northwind dataset with respect to RDF and Graph DMSs in different configuration settings. H. Thakkar et al. / Gremlinator 17 standing. We observe similar trend of performance of more, we also note that SPARQL star-shape based SPARQL vs. Gremlin queries over both the datasets, queries do not register substantial improvement in which is evident from Figures8 and9 and also the Ta- warm cache execution. On the other hand, Grem- bles6 and7 of AppendixB. Therefore, we present the lin traversals receive little benefit, from Graph detailed performance analysis of SPARQL vs. Grem- DMSs, in warm cache. We report that on aver- lin only for the BSBM dataset. We organize our obser- age, in this setting, the improvement is up to 1.3x vations on the performances of participating DMSs as for aggregation (count, group count) and star- follows, and present our discussion. shaped queries; up to 1.5x for re-ordering (order- Graph DMSs without index: We categorize our by, group-by) and condition (filters) queries; up findings in two groups – cold cache and warm cache. to 2x for mixed, union and restriction (limit) We observe that for – queries. 1. Cold cache: SPARQL queries report a compara- Graph DMSs with indexing: We manually created tive advantage with respect to Gremlin traversals, composite indices for each Graph DMS on attributes leveraging the advantage of indexing schemes such as "name", "customerId", "unitPrice", of RDF DMSs. SPARQL performs moderately "unitsInStock", "unitsOnOrder" for BSBM faster (1x-2x) for simple queries (C1, C2) and or- dataset. Similarly, on "type", "productID", der by (O1, O3); substantially faster (3x-5x) for "reviewerID", "productTypeID" for North- 33 union and mixed queries (U1-3, M1-3). Whereas, wind dataset, on the node attributes (numeric) . The Gremlin traversals benefit from only the graph lo- indices use the hash-map data structure. We did not re- cality inherent to Graph DMSs. Gremlin traver- execute SPARQL queries on RDF DMSs, as there was sals perform moderately faster (1x-2x) for restric- no change in the indexing setting for the same. tion (L1, L3), group by (G1-3) and conditional 1. Cold cache: Gremlin traversals perform signifi- (F1-3) queries; substantially faster (3x-5x) for cantly faster when executed on Graph DMSs with group count (Gc1-3) and star (S1-3) queries. We composite indices. We observe that, as compared also note that aggregation queries (counts, group to the previous (cold cache + without index) set- counts) in Graph DMSs are an order of mag- ting, the improvement reported on an average is nitude faster as compared to RDF DMSs since up to 1x-2x for union, mixed and group by traver- they do not have to execute multiple inner joins sals; up to 2x-3x for re-ordering (group-by, order- in addition to the aggregation operations. More- by) traversals; up to 3x-5x for regular and restric- over, for star-shaped queries (queries with bushy tion traversals; and >5x for aggregation and star- plans having >=5 triple patterns, >=1 filter and shaped traversals. >=4 projection variables) Gremlin pattern match- 2. Warm cache: In this setting the Graph DMSs ing traversals outperform their SPARQL counter- (i.e. Gremlin traversals) register similar perfor- parts by at least an order of one magnitude for S1, mance gains to that in non-indexed configuration. S2 and at least an order of two magnitudes for S3 (with 10 triple patterns, 1 filter and 9 projection 6.6. Discussion. variables). 2. Warm cache: SPARQL queries reap the most We now discuss the findings of our experiments with benefits of warm caching from RDF DMSs as respect to the factors which influence the query execu- compared to the Gremlin traversals from Graph tion performance of a particular DMS and summarize DMSs. We observe that on average, in this set- our observations. We categorize our findings based on ting, the improvement is up to 1x-1.8x for star and the following factors: mixed queries, 2x-3x for aggregation (counts), – Query typology: We report that for – (i) sim- condition (filter) and re-ordering (order by, group ple/linear queries (such as C1-3, F1-3, L1-3) both by) queries, and 3x-5x for CGPs and union SPARQL and Gremlin traversal performances queries. We also note that SPARQL queries are are comparable; (ii) SPARQL outperforms cor- almost an order of magnitude faster than the cor- responding Gremlin traversals for union queries. responding Gremlin traversals for queries hav- ing a union operator, and are comparable for 33We have provided all the groovy scripts used for creating com- mixed, CGPs, and order by queries. Further- posite indices in the Github repository pointed earlier 18 H. Thakkar et al. / Gremlinator

This is so because in SPARQL a union oc- fault34, which takes one-third amount of space as curs between two or more sets of triple pat- compared to the row-wise indices. On the con- terns. Whereas in the declarative construct (pat- trary, similar claims cannot be made about other tern matching) of Gremlin, a union occurs be- RDF DMSs such as 4Store and JenaTDB. Graph tween two .match()-steps (i.e. Gremlin treats DMSs, have a limited number options in terms each .match()-step as a distinct traversal and of underlying indexing data structures implemen- then executes a union on top of it); (iii) Whereas, tation for creating manual indexes in the chosen for complex queries (such as star-shaped and ag- version. One reason can be deduced that there has not been an explicit need for using complex index gregation based queries), Gremlin traversals out- schemes (as in Virtuoso), since composite indices perform their SPARQL counterparts. As men- based on B+ trees and hash-maps provide suffi- tioned before (ref. 6.5 – cold cache section), this cient performance boost for graph traversal oper- is because Graph DMSs do not have to perform ations. expensive joins (like RDF DMSs) on top of exe- Thus, based on our findings, we summarize that cuting aggregation operations. (iv) Lastly, we also for complex queries (such as aggregation, star-shaped, observe that for queries with greater number of and queries with higher number of projection vari- projection variables (Proj. vars >= 3) and query ables + query modifiers) corresponding Gremlin pat- modifiers (count, distinct, limit + offset, filter), tern matching traversals outperform SPARQL queries. Gremlin traversals show a distinctive advantage Whereas, for union-based queries SPARQL register (more than an order of magnitude) in terms of per- significant performance advantage. formance with respect to corresponding SPARQL queries (e.g. for F1, F2, O2, S1, S2, S3). This ad- 6.7. Hands-On Gremlinator vantage, while still exists, is not as pronounced when comparing queries with a fewer number of For the demonstration of our approach – Gremli- projection variables and query modifiers. nator, we provide the entire setup including both the – Query caching – Cold vs Warm: Despite the datasets and the entire set of pre-defined SPARQL fact that both DMSs benefit from warm cache queries for interested users to get a first hand experi- ence [33]. Furthermore, we encourage the end user to query execution (as compared to cold cache), write and execute custom SPARQL queries for both SPARQL queries receive the most advantage as the datasets, for further exploration. As a part of the compared to corresponding Gremlin traversals. demonstration of our system [33], we provide– (i) an One reason for this is that Gremlin traversals online screencast35 for an introductory video tutorial perform considerately better (except in cases of on how to use the demonstration (ii) a web applica- union queries) by leveraging the advantage of tion accessible at36 (iii) a desktop application of Grem- underlying property graph data model (locality) linator (standalone .jar bundle) which requires Java and cannot be optimized further without explic- 1.8 JRE installed on the corresponding host machine, itly creating regular or composite indices. Out of downloadable from the web demo website. all the three RDF DMSs, Jena shows the most gain in warm execution time, which receives up to 5x boost in cases such as union and CGP queries. 7. Conclusion and Future Work – Indexing scheme: It does not go without notic- ing, the one-sided dominance of Openlink Virtu- In this paper, we presented Gremlinator, a novel oso, amongst all the evaluated RDF DMSs. As approach for supporting the execution of SPARQL mentioned earlier, Virtuoso maintains a variety of queries on property graphs using Gremlin pattern full and partial indices. Moreover, we also know matching traversals. Furthermore, we presented an that virtuoso employs custom partition cluster- ing and caching schemes on top of these indices 34Indexing scheme in Openlink Virtuoso (http://docs.openlinksw. to provide an adaptable solution to all kinds of com/virtuoso/rdfperfrdfscheme/) 35Gremlinator Demo Tutorial – https://youtu.be/Z0ETx2IBamw workloads. One distinctive advantage in virtu- 36Gremlinator Web Demo – http://gremlinator.iai.uni-bonn.de: oso is that the indices are column-wise by de- 8080/Demo H. Thakkar et al. / Gremlinator 19 empirical evaluation of our approach using state-of- the-art RDF and Graph DMSs, demonstrating the va- lidity and applicability of our approach. The evalu- ation demonstrates the substantial performance gain obtained by translating SPARQL queries to Grem- lin traversals, especially for star-shaped and complex queries. Gremlinator has obtained clearance by the Apache Tinkerpop development team and is currently in production phase to be released as a plugin during TinkerPop’s next framework cycle. Gremlinator has also been integrated into the SANSA Stack[34] (v0.3) framework as an experimental plugin. Furthermore, Gremlinator is freely available under the Apache 2.0 license for public use from the Maven Central reposi- tory. As future work, we aim to – (i) extend our cur- rent work by enabling support for SPARQL 1.1 fea- tureset, such as Property Paths, regex in restrictions (i.e. FILTERs) and variables for property predicates; (ii) integrate Gremlinator within frameworks such as LITMUS [35–37], to enable automatic execution of SPARQL queries over property graphs for robust benchmarking diverse RDF and Graph DMSs.

Acknowledgements This work is supported by the funding received from EU-H2020 WDAqua ITN (GA. 642795). We would like to thank Dr. Marko Rodriguez, Mr. Stephen Mal- lette, and Mr. Daniel Kuppitz, of the Apache Tinker- Pop project, for their support and quality insights for integrating Gremlinator.

Appendix A. SPARQL - Gremlin Results

In this section we demonstrate the correctness of Gremlinator empirically, as already discussed in Sec- tion 6.3. We present a subset of the results in Table5, which validate our claim that the proposed SPARQL → Gremlin translation is correct.

Appendix B. SPARQL - Gremlin Performance Comparison

In this section, we present the query-wise detailed results in tabular format of the same plots reported pre- viously in Figures8 and9. 20 H. Thakkar et al. / Gremlinator

Q.# SPARQL Query Feature SPARQL Query Result Gremlin Traversal Result

C1 SELECT (COUNT (DISTINCT (?product)) as ?total) BGP 2787 2787 WHERE { ?a v:type "review" . ?a e:edge ?product . } F3 SELECT DISTINCT ?pid WHERE { ?a v:productID FILTER ?pid bsbm:inst/Product1636 bsbm:inst/Product2295 { pid=1636 } { pid=2295 } ?pid . ?a v:ProductPropertyNumeric_1 ?property1 . FILTER ( ?property1 = 1 ) } L2 SELECT ?rating1 WHERE { ?a v:type "review" . LIMIT ?rating1 9 7 { rating1=9 } { rating1=7 } ?a v:Rating_1 ?rating. ?a e:edge ?product. ?product v:productID ?pid . FILTER ( ?pid = 343 ) .} LIMIT 2 G2 SELECT ?product WHERE { ?a v:type "reviewer" GROUP ?product bsbm:inst/Product1107 bsbm:inst/Product1301 { product=1107 } { product=1301 } { product=1852 } { . ?a v:reviewerID ?rid. ?a e:edge ?review . ?review BY bsbm:inst//Product1852 bsbm:inst/Product2291 product=2291 } { product=1098 } { product=1954 } { prod- v:Rating_1 ?rating1. ?review e:edge ?product. ?product bsbm:inst/Product1098 bsbm:inst/Product1954 uct=1994 } { product=1355 } { product=734 } { product=1448 v:productID ?pid. FILTER ( ?rid = 86). } GROUP BY bsbm:inst/Product1994 bsbm:inst/Product1355 } { product=1426 } { product=1817 } { product=1141 } { (?rating1) bsbm:inst/Product734 bsbm:inst/Product1448 product=1194 } { product=451 } { product=1294 } { prod- bsbm:inst/Product1426 bsbm:inst/Product1817 uct=1532 } bsbm:inst/Product1141 bsbm:inst/Product1194 bsbm:inst/Product451 bsbm:inst/Product1294 bsbm:inst/Product1532 Gc2 SELECT ?product (COUNT (?review) as ?total) GROUP ?product ?total bsbm:inst/Product2588 1 bsbm:inst/Product3 {Product=2588, Total=1} {Product=3, Total=1} {Prod- WHERE { ?review v:type "review" . ?review e:edge COUNT 1 bsbm:inst/Product2331 2 bsbm:inst/Product2553 uct=2331, Total=2} {Product=2553, Total=3} { Prod- ?product . ?product v:productID ?pid. } GROUP BY 3 bsbm:inst/Product1803 5 bsbm:inst/Product2440 uct=1803,Total=5 } { Product=2440, Total=7 } { Prod- (?product) LIMIT 10 7 bsbm:inst/Product2201 5 bsbm:inst/Product316 3 uct=2201, Total=5 } { Product=316, Total=3 } { Prod- bsbm:inst/Product2210 7 uct=2210, Total=7 } O2 SELECT DISTINCT ?product ?label WHERE { ORDER product label bsbm:inst/Product11 "pipers pests" {pid=11, lab=pipers pests} {pid=18, lab=boondogglers} ?a v:productTypeID ?tid. FILTER(?tid = 58). ?a BY bsbm:inst/Product18 "boondogglers" bsbm:inst/Product489 {pid=489, lab=airsickness simplices skiing} {pid=694, e:edge ?product. ?product v:productID ?pid. ?product "airsickness simplices skiing" bsbm:inst/Product694 "nahuatls lab=nahuatls terrifiers direr} {pid=709, lab=jacinth medu- v:label_n ?label. } ORDER BY (?product) LIMIT 5 terrifiers direr" bsbm:inst/Product709 "jacinth medusoids" soids} U1 SELECT ?label WHERE { { ?a v:productTypeID UNION ?label "airsickness simplices skiing" "nahuatls terrifiers direr" { label=airsickness simplices skiing } { label=nahuatls ter- ?tid. FILTER(?tid = 58). ?a e:edge ?product. ?product "jacinth medusoids" "slowed cloche" "meshwork" "nonradi- rifiers direr } { label=jacinth medusoids } { label=slowed v:productID ?pid. ?product v:label_n ?label. }UNION cal warehousing" "furnacing" "accommodator" "collectivized cloche } { label=meshwork } { label=nonradical } { la- { ?a v:productTypeID ?tid. FILTER(?tid = 102). ?a mathematics" "brachiate writeoff" bel=warehousing } { label=furnacing } { label=accommodator e:edge ?product. ?product v:productID ?pid. ?product } { label=collectivized mathematics } { label=brachiate write- v:label_n ?label. }} LIMIT 10 off } Op1 SELECT ?pTex2 ?pText3 ?pNum2 WHERE OPT. pText2 pText3 pNum2 ""cyanided uncharged gametes"" ""flu- {pText_2=cyanided uncharged gametes, pText_3=fluorosis { ?product v:productID ?pid . FILTER ( ?pid orosis appeasing railheads criticizers satirizer controllers"" 758 appeasing railheads criticizers satirizer controllers, = 343 ) . ?product rdfs:label ?label. ?product pNum2_2=758} v:ProductPropertyTextual2 ?propertyTextual_2 . ?prod- uct v:ProductPropertyTextual3 ?propertyTextual_3 . OPTIONAL { ?product v:productID ?pid . FILTER ( ?pid = 350 ) . ?product rdfs:label ?label. ?product v:ProductPropertyNumeric_2 ?propertyNumeric2 . ?product v:ProductPropertyTextual3 ?propertyTex- tual_3 .}} M1 SELECT ?reviewer (COUNT (?product) as ?total) MIX bsbm:inst/Reviewer1294 42 bsbm:inst/Reviewer501 41 [1294:42, 501:41, 424:39, 281:38, 1263:38] WHERE { ?reviewer v:type "reviewer". ?reviewer bsbm:inst/Reviewer424 39 bsbm:inst/Reviewer281 38 e:edge ?review. ?review e:edge ?product . } GROUP BY bsbm:inst/Reviewer1263 38 (?reviewer) ORDER BY DESC (?total) LIMIT 10 S1 SELECT ?plabel ?label ?flabel ?proptext1 ?proptext2 STAR ?label ?comment ?p ?f ?productFeature ?producer ?prop- [ProductPropertyNumeric_1:[1165], productID:[343], Pro- ?proptext3 ?propnum1 ?propnum2 ?comment WHERE ertyTextual1 ?propertyTextual2 ?propertyTextual3 ?proper- ductPropertyTextual_1:[cyanided uncharged gametes], { ?producer v:type "producer". ?producer v:label_n tyNumeric_1 ?propertyNumeric1_2 "ors" "sobbers kynurenic ProductPropertyNumeric_2:[1526], ProductPropertyTex- ?plabel. ?producer e:edge ?product. ?product v:type undergoing remained horsed sidings hutzpa continence tual_2:[fluorosis appeasing railheads criticizers satirizer "product". ?product v:productID ?pid. FILTER(?pid = flighty japingly semiretired crispest chukkers bamboozler controllers], label_n:[ors], comment:[sobbers kynurenic 343). ?product v:label_n ?label. ?product v:comment shivah lagged miggs snickering arbitrators propped os- undergoing remained horsed sidings hutzpa continence mic mismeeting dissimulate fraudulently cabled yeller trun- flighty japingly semiretired crispest chukkers bamboozler ?comment. ?product v:ProductPropertyTextual_1 cheons sigil expatriating viceless merrymakers fetas recom- shivah lagged miggs snickering arbitrators propped osmic ?proptext1. ?product v:ProductPropertyTextual_2 penses disreputability taperer multiplexed toddler disaffili- mismeeting dissimulate fraudulently cabled yeller truncheons ?proptext2. ?product v:ProductPropertyTextual_3 ating radiating worshipper flamboyance waggly bothering sigil expatriating viceless merrymakers fetas recompenses ?proptext3. ?product v:ProductPropertyNumeric_1 swindlers eucharistical enserfing lightfaced tench tramping disreputability taperer multiplexed toddler disaffiliating ra- ?propnum1. ?product v:ProductPropertyNumeric_2 margraves bewilderment deuteronomy contravened fourpenny diating worshipper flamboyance waggly bothering swindlers ?propnum2. ?product e:edge ?pfeature. ?pfeature v:type coveralls traitorousness millpond redetermine jeremiad re- eucharistical enserfing lightfaced tench tramping margraves "product_feature". ?pfeature v:label_n ?flabel. } LIMIT sealable abreaction marblers whisks" bsbm:inst/Producer8 bewilderment deuteronomy contravened fourpenny coveralls 1 bsbm:inst/ProductFeature11 "entoiling" "assignat disrobe" traitorousness millpond redetermine jeremiad resealable "housewifeliness neoliths proselytizers infirmable meditations abreaction marblers whisks],type:[product]],label:hedgehogs bedchair maschera hagfish saplings prearranges debacles be- barstools,label_prod:assignat disrobe dews straying grouter stereophonically" "cyanided uncharged gametes" "fluorosis appeasing railheads criticizers satirizer controllers" 1165 1526 Table 5 Comparison of results of a subset of SPARQL queries and their corresponding Gremlin traversals for BSBM dataset. The complete list of all the queries and their corresponding results can be accessed from the spreadsheet available at (https://goo.gl/CSSVzZ). Spaksee (w) H. Thakkar et al. / Gremlinator 21 Spaksee (c) Neo4J (w) Neo4J (c) Tinker (w) Gremlin Traversal Execution Time (ms, with indexes) Tinker (c) 191.6238.1 107.691.5 100.52 157.317.83 218.46 80.051.12 0.33 15.27 97.141.15 253.82 0.9 11.93 9.43 279.010.246 184 0.72 203.7 77.9 4.75 0.169 2.86 34.670.79 2.4 49.41 3.85 26.218.52 16.3 29.5 0.6318.5 45.26 8.01 1.74 1.15 3.49 49.7 3.68 269.5 3.25 17.98 16.50.33 28.3 17.8 29 163.3 23.82 43.80.737 1.72 18.2 0.268 133.4 4.18 5.3 0.762 0.679 3.77 21.52 12.75294.68 4.92 54.52 3.98 0.667 29.3 6.074.56 193.44 4.29 212.82 5.19 2.08 11.67205.1 210.2 9.29 2.34 174.25 2.6318.98 3.89 158.18 77.75 2.25 7.92257.5 5.88 9.42 8.12 196.87 331.9218.73 6.31 3.05 173.2 153.76 3 899 153.9 28.51 3.96 227.24 338.74 159.01 2.79 657.6 141.52 13.49 179.82 68.63 618.25 26.7402.7 451 47.14 47.5 126.331.55 307.53 235.9 3.39 224.11.45 75.79 143.61 152.7 15.59 0.79 66.487.02 72.67 87.81 226.23 0.82 34.36 2.54 130.7 4.46 139.6 2.39 102.35 55.84 1.97 15.92 39.37 1.71 8.02 12.91 10.18 3.94 31.2 3.42 6.84 4store (w) 4store (c) Jena (w) Jena (c) Table 6 Virtuoso (w) SPARQL Query Execution Time (ms, with indexes) Virtuoso (c) 458.1561.25 167.25271 1652 21.5172.25 1614 70.2584.15 45.5 39043.3 260 404.63 23.75 44814.45 361.25 384 15 945.544.5 232.5 152.5 7.25 417.58.3 871.5 227.5 65 21.25 812.7561.45 275.25 379.15 632 70 272.3 904.3 4.75 456.75 72.5 369.25 22.577.25 416.5 245 69.25 655 148.25 433289 811.34 50.5 287.5 434.3110.3 72.5 544.21 132.5 429.75 171.25 55.3 102 916.5 383.75 465.19 94.25 382.25 1535183.5 212.5 132.75 858 2809.75 175 3044.5 164.7 2717.3 1450 60.34 604 682.5 2727.5 15.3 1234.2544.89 425 1063 1200 1110.25 130 51934 13.25 530 1045.75 305 151.4 207.45 287.5 498.5 2645.574.15 130.5 10.75 403.25 40.75 120.75 2109120.6 504 58.5 278 444.5 111184 98.25 650.9388.15 403 207.5 3445.15 87.5 1169 3513.75 321.65 3417.5332 108.5 75.5 90 3418.25 1535 507.5 19040641 2857.5 593 89.5 3384.25 40329 1412.5 190 498 3367 55 630 22663 442 1080 21180 253 413 343 2175 70 202.5 1730 80 Spaksee (w) Performance comparison ofGremlin traversals SPARQL over BSBM dataset. queries vs translated Spaksee (c) Neo4J (w) Neo4J (c) Tinker (w) Gremlin Traversal Execution Time (ms, without indexes) Tinker (c) 220.12302.6 136.717.7 187.3 177.6016.32 152.72 15.4 272.6745.6 15.4 306.8 231.3622.8 18.50 33.5 340.84 272.67 17.6 19.72 20.1 15.29 224.48 107.17 67.06 16.40 166.84 56.57 38.86 29.37 53.17 32.6642.57 71.37 13.81 6.4 25.81 34.7319.15 102.61 22.08 28.87 32.05260.47 44.93 73.70 33.09 16.23 18.80 50.1424.6 149.1 105.26 25.34 39.20 35.7023.47 14.77 22.49 68.42 248.53 22.3623.98 19.97 16.58 21.36 156.58 9.09 48.11328.57 15.85 21.6 306.06 20.76 21.49 43.54 20 192.77 8.53 11.27 287.54 18.87 525.81 238.3 19.27 17.09 29.5945.72 357.98 17.3 173.95 15.62 27.04373.95 489.63 25.2 23.17 461.59 28.78 32.93 24.85 278.89 203.61 369.2 402.64 53.151221.7 173.01 402.40 25.01 25.32 612.04806.1 765.7 328.60 378.1 48.6 483.51 56.64551.6 287.29 537.4 246.25 503 111.6220.75 35.68 565.19 309.32 116.57 283.76 58.78 17.5 329.1 15.5 161.72 286.24 453.82 161.4334.02 81.63 367.93 15.4 327.01 25.38 26.65 261.92 276.04 21.16 16.72 40.75 168.91 14.8 38.83 31.54 32.79 52.57 26.45 40.2 21.89 S1 S2 S3 F1 F2 F3 L1 L2 L3 C1 C2 C3 U1 U2 U3 G1 G2 G3 O1 O2 O3 M1 M2 M3 Gc1 Gc2 Gc3 Query Spaksee (w) 22 H. Thakkar et al. / Gremlinator Spaksee (c) Neo4J (w) Neo4J (c) Tinker (w) Gremlin Traversal Execution Time (ms, with indexes) Tinker (c) 7.7511.52 4.257.5 6.140.53 12.879.6 4.25 18.4 0.29 6.171 12.38 5.260.55 6.25 2.16 14.40.39 4.58 0.72 23.9 0.4 19.5211.43 1.03 6.64 0.38 15.171.46 9.26 8.15 10.52 6.42 3.01 2.452.29 7.1 4.12 1.11 24.41.34 3.38 21.68 1.39 1.32 0.97.25 1.83 3.22 8.09 11.08 0.85 7.615.92 4.5 3.9 3.72 4.570.52 1.26 27.1 9.25 5.25 3.46 12.32 2.16 2.35 2.25 0.47 10.83 3.04 15.32 19.2 23.62 3.99 7.510.65 7.05 2.8 3.42 9.84 2.27 11.93 3.9543.75 6.6 19.1 9.6640 22.3 3.33 18.9 2.19 24.11 3.04 44.5 8.44 18.4 4.6 35.64 6.85 3.52 7.01 12.87 23.25 23.58 27.461.6 15.46 5.41 6.9 35.49 53.21 3.43 1.8 65.31 12.2 45.34 28.27 13.8 32.5 29.71 1.13 18.09 28.33 28.76 4.09 1.25 11.3 69.2 17.3 6.02 7.97 120.491.25 3.25 4.07 73.79 33.71 26.47 3.26 24.1 0.82 2.25 3.38 12.15 28.6 13.8 9.15 2.02 21.4 27.8 15.36 5.01 18.9 10.63 15.59 11.9 8.51 6.02 4store (w) 4store (c) Jena (w) Jena (c) Table 7 Virtuoso (w) SPARQL Query Execution Time (ms, with indexes) Virtuoso (c) 36.2525 3.755212 364 2.756 4.25 3955 374 4.33 3997.75 2 389 298 36312 2.65 5 3924 290 93 36964 370 2.5 470 31827 439 2.5 80 2732.75 427 373 18.25 364 90 9 349 9.75 410 433 115 2 2706 252 364 34847.75 280 378 461 6.33 124 115 4.5 295 471 4.25 2.75 369 112 20 440 279 245 113 3.75 2.25 753 476 385 28527.75 467 8.25 108 103 2.25 44616.5 775 328 354 3.75 92 32131.5 247 545 483 2.75 248 157 12 250 377 4.5 354 88 8 573 529 267 118 105 35 368 6.25 270 346176 300 536 125 3 240 45750 14.5 95 434 67.5902 120 252.5 115 473 563 373 24.75 260 451 130 824 328 325 547 423 112.5 372 362 369 140 298 301 278 406 253 110 118 135 250 98 80 Spaksee (w) Performance comparison ofGremlin traversals SPARQL over Northwind dataset. queries vs translated Spaksee (c) Neo4J (w) Neo4J (c) Tinker (w) Gremlin Traversal Execution Time (ms, without indexes) Tinker (c) 11.4920.757 5.4510.76 11.80.77 17.7 5.84 28.315.45 0.8 13.32.4 18.9 8.47 21.21.5 23.89 3.49 12.62.96 1.24 24.5 31.13 13.8 17.01 0.73 2.7 24 15.7 0.47 6.68 12.096 8.98 4.495.14 34.3 7.43 5.41 4.45 11.4 6.32 29.1 3.6 2.15 1.8 17.62 10.2 4.6 11.28 2.4 1.5 18.4225 12.16 5.3 9.7 5.45 10.60.783 37.5 4.75 8.81 13.2718.25 14.09 0.68 17.3 5.5 4.09 24.2 2.8 28 9.76 12.23 4.39 38.118.45 6.89 13.6 3.62 28 4.8667.25 3.2 17.4 23.9 10.44 3.7 15.4968.5 2.7 36.75 3.8 28.3 39.275.5 17.4 36.6 9.52 35.45 70.6 4.6122.45 18.02 23 42.6 40.6 15.67 2.56 89.8 12.4 46.25 3.8 42.782.5 98.5 25.3 1.65 42.61 54.3 81.7235.8 37.8 24.64 7.85 1.92 54.4 107.76 39.78 6.94 27.81 28.4 23.44.23 48.28 5.03 7 157.6 3.67 42.6 49.14 1.42 62.91 11.8 51.2 24.5 20.9 4.16 6.9 5.3 26.3 35.6 50.8 3.8 21.4 15.9 25.4 17.6 10.8 8.3 S1 S2 S3 F1 F2 F3 L1 L2 L3 C1 C2 C3 U1 U2 U3 G1 G2 G3 O1 O2 O3 M1 M2 M3 Gc1 Gc2 Gc3 Query H. Thakkar et al. / Gremlinator 23

References [18] A. Gubichev and M. Then, Graph Pattern Matching - Do We Have to Reinvent the Wheel?, in: Second International Work- [1] M.A. Rodriguez, The Gremlin graph traversal machine and shop on Graph Data Management Experiences and Systems, GRADES 2014, co-loated with SIGMOD/PODS 2014, Snow- language (invited talk), in: Proceedings of the 15th Symposium bird, Utah, USA, June 22, 2014, 2014. on Database Programming Languages, Pittsburgh, PA, USA, [19] C. Krause, D. Johannsen, R. Deeb et al., An SQL-based query October 25-30, 2015, 2015. language and engine for graph pattern matching, in: Interna- [2] S. Das, J. Srinivasan, M. Perry, E.I. Chong and J. Banerjee, A tional Conference on Graph Transformation, Springer, 2016. Tale of Two Graphs: Property Graphs as RDF in Oracle., in: [20] M.A. Rodriguez and P. Neubauer, The Graph Traversal Pattern, EDBT, 2014. in: Graph Data Management: Techniques and Applications., [3] D. Calvanese, B. Cogrel, S. Komla-Ebri, R. Kontchakov, IGI Global, 2011. D. Lanti, M. Rezk, M. Rodriguez-Muro and G. Xiao, Ontop: [21] E. Prud, A. Seaborne et al., SPARQL query language for RDF, Answering SPARQL queries over relational databases, Seman- Citeulike Online Archive (2006). tic Web 8(3) (2017), 471–487. [22] M. Schmidt, M. Meier and G. Lausen, Foundations of [4] M. Rodriguez-Muro and M. Rezk, Efficient SPARQL-to-SQL SPARQL query optimization, in: Proceedings of the 13th In- with R2RML mappings, Web Semantics: Science, Services and ternational Conference on Database Theory, ACM, 2010. Agents on the World Wide Web 33 (2015). [23] R. Angles and C. Gutierrez, The Multiset Semantics of [5] B. Elliott, E. Cheng, C. Thomas-Ogbuji and Z.M. Ozsoyoglu, SPARQL Patterns, in: The Semantic Web - ISWC 2016 - 15th A complete translation from SPARQL into efficient SQL, in: International Semantic Web Conference, Kobe, Japan, October Proceedings of the 2009 International Database Engineering 17-21, 2016, Proceedings, Part I, 2016. & Applications Symposium, ACM, 2009. [24] R. Angles and C. Gutierrez, The expressive power of SPARQL, [6] A. Chebotko, S. Lu and F. Fotouhi, Semantics preserving in: International Semantic Web Conference, Springer, 2008. SPARQL-to-SQL translation, Data & Knowledge Engineering [25] S. Harris, A. Seaborne and E. Prud’hommeaux, SPARQL 1.1 68(10) (2009). query language, W3C recommendation 21(10) (2013). [7] F. Zemke, Converting sparql to sql, Technical Report, Techni- [26] G.S.J. Marton, Formalizing openCypher Graph Queries in Re- lational Algebra, Published online on FTSRG archive (2017). cal Report, October 2006., 2006. [27] J. Hölsch and M. Grossniklaus, An Algebra and Equivalences [8] F. Priyatna, O. Corcho and J. Sequeda, Formalisation and ex- to Transform Graph Patterns in Neo4j, in: EDBT/ICDT 2016 periences of R2RML-based SPARQL to SQL query translation Workshops: EDBT Workshop on Querying Graph Structured using Morph, in: Proceedings of the 23rd international confer- Data (GraphQ), 2016. ence on World wide web , ACM, 2014. [28] C. Bizer and A. Schultz, The berlin sparql benchmark, 2009. [9] J. Rachapalli, V. Khadilkar, M. Kantarcioglu and B. Thu- [29] O. Erling, Virtuoso, a Hybrid RDBMS/Graph Column Store., raisingham, RETRO: A Framework for Semantics Preserving IEEE Data Eng. Bull. 35(1) (2012). SQL-to-SPARQL Translation, The University of Texas at Dal- [30] S. Harris, N. Lamb and N. Shadbolt, 4store: The design and las 800 (2011). implementation of a clustered RDF store, in: 5th International [10] R. Angles, M. Arenas, P. Barceló, A. Hogan, J.L. Reutter and Workshop on Scalable Semantic Web Knowledge Base Systems D. Vrgoc, Foundations of Modern Graph Query Languages, (SSWS2009), 2009, pp. 94–109. CoRR abs/1610.06264 (2016). [31] A.T.P. Home, Apache tinkerpop home, Web Page (2016). [11] M.A. Rodriguez and P. Neubauer, A path algebra for multi- [32] N. Martinez-Bazan, S. Gomez-Villamor and F. Escale- relational graphs, in: Workshops Proceedings of the 27th Inter- Claveras, DEX: A high-performance graph database manage- national Conference on Data Engineering, ICDE 2011, 2011. ment system, in: Data Engineering Workshops (ICDEW), 2011 [12] H. Thakkar, D. Punjani, M.-E. Vidal and S. Auer, Towards IEEE 27th International Conference on, IEEE, 2011, pp. 124– an Integrated Graph Algebra for Graph Pattern Matching with 127. Gremlin, in: Proceedings of the 28th International Conference, [33] H. Thakkar, D. Punjani, J. Lehmann and S. Auer, Killing DEXA 2017, Lyon, France, August 28-31, 2017, Proceedings, Two Birds with One Stone – Querying Property Graphs us- Part I, Springer, 2017, pp. 81–91. ing SPARQL via GREMLINATOR, CoRR abs/1801.09556 [13] W.W.W. Consortium et al., RDF 1.1 concepts and abstract syn- (2018). [34] J. Lehmann, G. Sejdiu, L. Bühmann, P. Westphal, C. Stadler, tax, WC3 Archive (2014). I. Ermilov, S. Bin, N. Chakraborty, M. Saleem and A.- [14] R. Cyganiak, A relational algebra for SPARQL, Digital Media C.N. Ngomo, Distributed Semantic Analytics using the Systems Laboratory HP Laboratories Bristol. HPL-2005-170 SANSA Stack, in: Proceedings of the 16th International Se- (2005). mantic Web Conference (ISWC), Springer, 2017, pp. 147–155. [15] J. Pérez, M. Arenas and C. Gutierrez, Semantics and Complex- [35] H. Thakkar, Towards an Open Extensible Framework for Em- ity of SPARQL, in: International semantic web conference, pirical Benchmarking of Data Management Solutions: LIT- Springer, 2006. MUS, in: The Semantic Web - 14th International Conference, [16] J.L. Reutter, Graph Patterns: Structure, Query Answering and ESWC 2017, Portorož, Slovenia, May 28 - June 1, 2017, Pro- Applications in Schema Mappings and Formal Language The- ceedings, Part II, 2017, pp. 256–266. ory, Edinburgh Research Archive (2013). [36] H. Thakkar, Y. Keswani, M. Dubey, J. Lehmann and S. Auer, [17] V. Nguyen, J. Leeka, O. Bodenreider et al., A Formal Graph Trying Not to Die Benchmarking–Orchestrating RDF and Model for RDF and Its Implementation, CoRR abs/1606.00480 Graph Data Management Solution Benchmarks Using LIT- (2016). MUS (2017). 24 H. Thakkar et al. / Gremlinator

[37] Y. Keswani, H. Thakkar, M. Dubey, J. Lehmann and S. Auer, ceedings of the 24th International Conference on World Wide The LITMUS Test: Benchmarking RDF and Graph Data Man- Web, WWW ’15, ACM, 2015. ISBN 978-1-4503-3469-3. agement Systems. [49] M. Morsey, J. Lehmann, S. Auer et al., DBpedia SPARQL [38] A. Gubichev, Query Processing and Optimization in Graph Benchmark – Performance Assessment with Real Queries on Databases, PhD thesis, München, Technische Universität Real Data, in: The Semantic Web – ISWC 2011: 10th Interna- München, Diss., 2015, 2015. tional Semantic Web Conference, Proceedings, Part I, Springer [39] R. Angles, A comparison of current graph database models, Berlin Heidelberg, 2011. ISBN 978-3-642-25073-6. in: Data Engineering Workshops (ICDEW), 2012 IEEE 28th [50] R.C. Murphy, K.B. Wheeler, B.W. Barrett and J.A. Ang, Intro- International Conference on, IEEE, 2012. ducing the GRAPH 500, Cray User’s Group (CUG) (2010). [40] R. Nambiar, N. Wakou, F. Carman et al., Transaction Process- ing Performance Council (TPC): State of the Council 2010, [51] M. Dayarathna and T. Suzumura, XGDBench: A benchmark- in: Performance Evaluation, Measurement and Characteriza- ing platform for graph stores in exascale clouds., in: Cloud- tion of Complex Systems: Second TPC Technology Conference, Com, IEEE Computer Society, 2012. ISBN 978-1-4673-4511- TPCTC 2010, Revised Selected Papers, Springer Berlin Hei- 8. delberg, 2011. ISBN 978-3-642-18206-8. [52] C. Bizer and A. Schultz, Benchmarking the performance of [41] M. Saleem, Y. Khan, A. Hasnain, I. Ermilov and A.N. Ngomo, storage systems that expose SPARQL endpoints, World Wide A fine-grained evaluation of SPARQL endpoint federation sys- Web Internet And Web Information Systems (2008). tems, Semantic Web 7 (2015). [53] A.-C.N. Ngomo and M. Röder, HOBBIT: Holistic Benchmark- [42] G. Tsatsaronis, G. Balikas, P. Malakasiotis et al., An overview ing for Big Linked Data, ERCIM News 2016 (2016). of the BIOASQ large-scale biomedical semantic indexing [54] G. Aluç, O. Hartig, M.T. Özsu et al., Diversified stress testing and question answering competition, BMC Bioinformatics 16 of RDF data management systems, in: International Semantic (2015). Web Conference, Springer, 2014. [43] R. Usbeck, M. Röder, A.N. Ngomo et al., GERBIL: Gen- eral Entity Annotator Benchmarking Framework, in: Proceed- [55] A. Flores, G. Palma, M.-E. Vidal et al., GRAPHIUM: visualiz- ings of the 24th International Conference on World Wide Web, ing performance of graph and RDF engines on linked data, in: WWW 2015, 2015. Proceedings of the 2013th International Conference on Posters [44] C. Unger, C. Forascu, V. Lopez et al., Question Answering & Demonstrations Track-Volume 1035, CEUR-WS. org, 2013. over Linked Data (QALD-5), in: Working Notes of CLEF 2015, [56] C. Bizer and A. Schultz, The Berlin SPARQL Benchmark, Int. Toulouse, France, 2015. J. of Semantic Web Inf. Syst. 5 (2009). [45] R. Angles, P.A. Boncz, J. Larriba-Pey et al., The linked data [57] Y. Guo, Z. Pan and J. Heflin, LUBM: A Benchmark for OWL benchmark council: a graph and RDF industry benchmarking Knowledge Base Systems, Web Semant. 3 (2005). effort, SIGMOD Record (2014). [58] M. Schmidt, T. Hornung, M. Meier et al., SP2Bench: A [46] X. Zhang and J. Van den Bussche, On the power of SPARQL SPARQL Performance Benchmark., in: Semantic Web Infor- in expressing navigational queries, The Computer Journal 58 mation Management, Springer, 2009. ISBN 978-3-642-04328- (2015). 4. [47] D. Dominguez-Sal, P. Urbón-Bayes, A. Giménez-Vañó et al., [59] H. Thakkar, M. Dubey, G. Sejdiu et al., LITMUS: An Open Survey of Graph Database Performance on the HPC Scalable Graph Analysis Benchmark, in: Proceedings of the 2010 In- Extensible Framework for Benchmarking RDF Data Manage- ternational Conference on Web-age Information Management, ment Solutions, CoRR abs/1608.02800 (2016). WAIM’10, Springer-Verlag, 2010. ISBN 3-642-16719-5, 978- [60] M.A. Rodriguez and J.H. Watkins, Quantum Walks with Grem- 3-642-16719-5. lin, CoRR abs/1511.06278 (2015). [48] R. Usbeck, M. Röder, A.-C. Ngonga Ngomo et al., GERBIL: [61] O. Hartig, Reconciliation of RDF* and Property Graphs, CoRR General Entity Annotator Benchmarking Framework, in: Pro- abs/1409.3288 (2014).