A Stitch in Time Saves Nine – SPARQL Querying of Property Graphs Using Gremlin Traversals
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Web 0 (0) 1 1 IOS Press A Stitch in Time Saves Nine – SPARQL Querying of Property Graphs using Gremlin Traversals Harsh Thakkar a,*, Dharmen Punjani b Yashwant Keswani c Jens Lehmann a,d and Sören Auer e a Smart Data Analytics, University of Bonn, Germany, E-mail: {thakkar, jens.lehmann}@cs.uni-bonn.de b Department, National and Kapodistrian University of Athens, Greece, E-mail: [email protected] c DA-IICT, India, E-mail: [email protected] d Fraunhofer IAIS, Germany, E-mail: [email protected] e TIB Technische Informationsbibliothek & L3S Research Center, Leibniz University of Hannover, Germany, E-mail: [email protected] Abstract. Knowledge graphs have become popular over the past years and frequently rely on the Resource Description Frame- work (RDF) or Property Graphs (PG) as underlying data models. However, the query languages for these two data models – SPARQL for RDF and Gremlin for property graph traversal – are lacking interoperability. We present Gremlinator, a novel SPARQL to Gremlin translator. Gremlinator translates SPARQL queries to Gremlin traversals for executing graph pattern match- ing queries over graph databases. This allows to access and query a wide variety of Graph Data Management Systems (DMS) using the W3C standardized SPARQL query language and avoid the learning curve of a new Graph Query Language. Grem- lin is a system agnostic traversal language covering both OLTP graph database or OLAP graph processors, thus making it a desirable choice for supporting interoperability wrt. querying Graph DMSs. We present a comprehensive empirical evaluation of Gremlinator and demonstrate its validity and applicability by executing SPARQL queries on top of the leading graph stores Neo4J, Sparksee and Apache TinkerGraph and compare the performance with the RDF stores Virtuoso, 4Store and JenaTDB. Our evaluation demonstrates the substantial performance gain obtained by the Gremlin counterparts of the SPARQL queries, especially for star-shaped and complex queries. Keywords: SPARQL, Gremlin, Pattern Matching, Graph Traversal, Query Translator, RDF Graph, Property Graph, Gremlinator 1. Introduction integration with built-in world-wide unique identifiers and the expressive SPARQL query language; PGs on Knowledge graphs have become increasingly pop- the other hand support extremely scalable storage and ular over the past years. The two most popular data querying and are meanwhile widely used for modern arXiv:1801.02911v2 [cs.DB] 12 Feb 2018 models for representing and storing knowledge graphs Web applications. are property graphs (PG) and the Resource Description In this article, we present an approach for execut- Framework (RDF). For RDF, the SPARQL query lan- ing SPARQL queries over graph databases via Grem- guage was standardized by W3C, whereas for PGs sev- lin traversals – Gremlinator, thus building a bridge be- eral languages are frequently used, including Grem- tween the currently still largely disjoint semantic and lin [1]. Both data models and the corresponding data graph data technology ecosystems and thus addressing management techniques have distinct and complemen- the query interoperability problem. tary characteristics: RDF is suited for distributed data A SPARQL-PG query translation renders several benefits: (1) Applications based on W3C Semantic *Corresponding author. E-mail: {thakkar, jens.lehmann}@cs.uni- Web standards, like SPARQL and RDF, can use prop- bonn.de. erty graph databases in a non-intrusive fashion. (2) The 1570-0844/0-1900/$35.00 © 0 – IOS Press and the authors. All rights reserved 2 H. Thakkar et al. / Gremlinator query translation lays the foundation for a hybrid use of RDF triple stores and property graph DMS – where a particular query can be dispatched to the DMS ca- pable to answer the query more efficiently [2]. In par- ticular, property graph databases have been shown to work very well for a wide range of queries which ben- efit from locality in a graph. Rather than performing expensive joins, property graph databases use micro indices to perform traversals. (3) Users familiar with the W3C SPARQL query language can avoid learning another query language. To the best of our knowledge, this is the first Figure 1. The Gremlin Traversal Language and Machine. work addressing the query interoperability (transla- tion) problem. Related work (cf. Section2) mostly lin (e.g. Gremlin-Java8, Gremlin-Python etc.), we map covers the area of SPARQL to SQL conversion and each corresponding operation within a SPARQL ba- vice versa. In contrast to those previous efforts, we have to overcome the challenge of mediating be- sic graph pattern (BGP) to its corresponding traver- tween two very different execution paradigms: While sal step in the Gremlin instruction library (i.e. a single SPARQL uses pattern matching techniques, Grem- step traversal operation). As a result, we build complex lin is based on performing graph traversals. More pattern matching traversals, in an analogous fashion to specifically, previous efforts applied query rewriting SPARQL style querying wherein multiple BGPs form techniques between formalisms, which are ultimately complex graph patterns (CGP). Thus, it is possible to rooted in relational algebra operations, whereas we had construct a corresponding Gremlin traversal for each to bridge more disparate query paradigms. While this SPARQL query. is a significant challenge, it is also the reason why sub- Overall, we make the following contributions: stantial performance improvements can be made de- – We propose a novel approach for mapping SPARQL pending on the query characteristics: Whereas direct queries to Gremlin pattern matching traversals, , SPARQL query execution can be expected to be suit- which is the first work converting an RDF to a able for large analytical joins over the entire dataset, property graph query language to the best of our the Gremlin conversion can significantly speed up all knowledge. queries that require exploiting the graph locality. We selected TinkerPop Gremlin as target language, – Our Gremlinator implementation for executing since it is more general than, e.g. CYPHER, as it sup- SPARQL queries over a plethora of third party ported by a wide range of property graph databases graph DMS such as Neo4J, Sparksee, OrientDB, (including OLTP and OLAP processors (see Figure1 etc. using the Apache TinkerPop framework is (a)). Moreover, Gremlin supports both the imperative openly available. (graph traversal) and declarative (graph pattern match- – We report the results of a comprehensive em- ing) style [1], for addressing the query interoperabil- pirical evaluation of the proposed translation ap- ity issue. Lastly, together with the Apache TinkerPop proach comprising a variety of state-of-the-art framework, Gremlin is a language and a virtual ma- property graph databases and triple stores on the chine, enabling to design another traversal language Northwind and BSBM datasets. that compiles to the Gremlin traversal machine (analo- The remainder of the article is organized as follows: gous to how Scala compiles to the JVM), ref. Figure1 Section2 covers related query conversion efforts. Sec- (b). tion3 introduces preliminary notions. Section4 de- We map SPARQL queries to the pattern matching Gremlin traversals (i.e. we map declarative SPARQL scribes the relationship between SPARQL graph pat- queries to declarative Gremlin constructs and not the tern matching and Gremlin traversal steps. Section5 imperative ones). This ensures a level of fairness explains our mapping approach. Section6 presents the when comparing the performance of both Graph Query experimental evaluation on two famous datasets, dis- Languages (GQLs). Furthermore, instead of translat- cusses the results and observations. Finally, Section7 ing SPARQL queries to a specific dialect of Grem- concludes the article and describes future work. H. Thakkar et al. / Gremlinator 3 2. Related Work SQL ! CYPHER: CYPHER2 is the graph query language used to query the Neo4j3 graph database. In this section we briefly survey the related work There has been no work yet aiming to convert the with regard to techniques and tools that support the RDBMS to CYPHER. However, there are some exam- translation and execution of GQLs. ples4 that show the equivalent CYPHER queries for SPARQL ! SQL: There is a substantial amount certain SQL queries. of work been done for conversion of SPARQL queries to SQL queries [3–8]. Ontop [3]1 exposes relational databases as virtual RDF graphs by linking the terms 3. Preliminaries (classes and properties) in the ontology to the data sources through mappings. This virtual RDF graph In this section, we recall and summarize the mathe- can then be queried using SPARQL by dynamically matical concepts which will be used in this article. Our and transparently translating the SPARQL queries notation closely follows [10] and extends [11] by in- into SQL queries over the relational databases. The troducing the notion of vertex labels, a detailed discus- work presented in [4] generates SQL that is optimized sion on which can be found in [12]. and also provides a well-defined specification of the SPARQL semantics used in the translation. In addition, 3.1. Graph Data Models Ontop also supports R2RML mappings over general relational schemas. The authors show that their imple- 3.1.1. Edge-labeled Graphs. mentation can outperform other well known SPARQL- The Resource Description Framework (RDF) is a to-SQL systems, as well as commercial triple stores well-known W3C standard, which is used for data by large margin. In [5] a SPARQL-to-SQL translation modeling and encoding machine readable content on technique is introduced, that focuses on the genera- the Web [13] and within intranets. An RDF graph tion of efficient SQL queries. It relies on a mapping can be seen as a set of triples, roughly analogous to language that lacks support for URI templates and is nodes and edges in a graph database. However, RDF is less expressive than R2RML. [6] proposes a transla- more specific in defining disjoint vertex-sets of blank tion function that takes a query and two many-to-one nodes, literals and IRIs.