Querying Property Graphs Using SPARQL Via Gremlinator
Total Page:16
File Type:pdf, Size:1020Kb
Killing Two Birds with One Stone – erying Property Graphs using SPARQL via Gremlinator Harsh Thakkar Dharmen Punjani University of Bonn National and Kapodistrian University of Athens Bonn, Germany 53117 Athens, Greece 10679 [email protected] [email protected] Jens Lehmann Sören Auer University of Bonn & Fraunhofer IAIS TIB & Leibniz University of Hannover Bonn, Germany 53117 Hannover, Germany 30167 [email protected] [email protected] ABSTRACT of queries eciently and can serve as a foundation for a range Knowledge graphs have become popular over the past decade and of Articial Intelligence applications. The Resource Description frequently rely on the Resource Description Framework (RDF) or Framework (RDF) and Property Graphs (PGs) are popular languages Property Graph (PG) databases as data models. However, the query for knowledge graphs. For RDF, the SPARQL query language was languages for these two data models – SPARQL for RDF and the standardized by W3C, whereas for PGs several languages are fre- PG traversal language Gremlin – are lacking interoperability. We quently used, including Gremlin [11]. present Gremlinator, the rst translator from SPARQL – the W3C PGs and RDF have evolved from dierent origins and still have standardised language for RDF – and Gremlin – a popular property largely disjoint user communities. RDF is part of the Semantic graph traversal language. Gremlinator translates SPARQL queries Web initiative with a focus on expressive data modelling as well to Gremlin path traversals for executing graph pattern matching as data publication and linking. PGs originate from the database queries over graph databases. This allows a user, who is well versed community with a focus on ecient execution of graph traversals. in SPARQL, to access and query a wide variety of Graph Data With Gremlinator, we build a bridge between both communi- Management Systems (DMSs) avoiding the steep learning curve ties and research the interoperability of RDF and PG query lan- for adapting to a new Graph Query Language (GQL). Gremlin is guages [16]. Moreover, we allow combining the best of both worlds: a graph computing system-agnostic traversal language (covering Powerful modelling capabilities as well as data publication and both OLTP graph database or OLAP graph processors), making it a interlinking methods combined with ecient graph traversal ex- desirable choice for supporting interoperability for querying Graph ecution. In particular, Gremlinator has the following advantages: DMSs. Gremlinator currently supports the translation of a subset (1) Existing SPARQL-based applications can switch to property of SPARQL 1.0, specically the SPARQL SELECT queries. graphs in a non-intrusive way. (2) It provides the foundation for a hybrid use of RDF triple stores and property graph DMS – a system KEYWORDS could detect which DMS is more ecient for answering a particular query [5] and redirect the query accordingly. In particular, property Property Graph, SPARQL, Gremlin, Graph Traversal, Gremlinator graph databases have been shown to work very well for a wide ACM Reference format: range of queries which benet from locality in a graph. Rather Harsh Thakkar, Dharmen Punjani, Jens Lehmann, and Sören Auer. 2016. than performing expensive joins, property graph databases use Killing Two Birds with One Stone – Querying Property Graphs using micro indices to perform traversals. (3) Users familiar with the SPARQL via Gremlinator. In Proceedings of ACM Conference, Washington, W3C standardized SPARQL query language do not need to learn DC, USA, July 2017 (Conference’17), 4 pages. DOI: 10.1145/nnnnnnn.nnnnnnn another query language. arXiv:1801.09556v1 [cs.DB] 25 Jan 2018 Overall, we make the following contributions: • A novel approach for mapping SPARQL queries to Grem- 1 INTRODUCTION lin pattern matching traversals, Gremlinator, which is the Knowledge graphs model the real world in terms of entities and rst work converting an RDF to a property graph query relations between them. They became popular as they are an intu- language to the best of our knowledge. itive and simple data model, which allows to execute many types • An openly available implementation for executing SPARQL queries over a plethora of third party graph DMS such as Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed Neo4J, Sparksee, OrientDB, etc. using the Apache TinkerPop for prot or commercial advantage and that copies bear this notice and the full citation framework. on the rst page. Copyrights for components of this work owned by others than ACM The remainder of the article is organized as follows: Section2 must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a summarizes the related work. Section3 sheds light on the impor- fee. Request permissions from [email protected]. tance of Gremlin, briey discusses the Gremlinator approach and Conference’17, Washington, DC, USA its limitations. Section4 presents the demonstration details and © 2016 ACM. 978-x-xxxx-xxxx-x/YY/MM...$15.00 DOI: 10.1145/nnnnnnn.nnnnnnn Conference’17, July 2017, Washington, DC, USA H. Thakkar et al. the value Gremlinator will cater to its users. Finally, Section5 Gremlin is more general than, e.g. ,CYPHER, as it provides in concludes the article and describes the future work. addition to a query language a common execution platform for supporting any graph computing system (including both OLTP and 2 RELATED WORK OLAP graph processors), for addressing the querying interoperabil- We present a brief summary of related work with regard to tech- ity issue (see Figure1 (a)). Together with Apache TinkerPop frame- niques and tools that support the translation and execution of work, Gremlin is a language and a virtual machine, it is possible formal query languages, addressing the interoperability issue. to design another traversal language that compiles to the Gremlin SPARQL ! SQL: There is a substantial amount of work been traversal machine (analogous to how Scala compiles to the JVM), done for conversion of SPARQL queries to SQL queries, such as – ref. Figure1 (b). Gremlin provides the declarative (SPARQL style) Ontop [3], R2RML [12], Elliot et al. [6], Chebotko et al. [4], Zemke pattern matching querying construct using the .match()-step. et al. [18], Priyanka et al. [9]. Ontop [3], one of the most popular For brevity, we abstain from dwelling into the denitions and system, exposes relational databases as virtual RDF graphs by link- formal semantics of Gremlin, rather point the interested reader ing the terms (classes and properties) in the ontology to the data to the literature [11, 17]. Furthermore, one may also refer to [16], sources through mappings. This virtual RDF graph can then be where the complete SPARQL to Gremlin translation approach is queried using SPARQL. discussed in detail. SQL ! SPARQL: RETRO [10] presents a formal semantics pre- serving the translation from SQL to SPARQL. It follows a schema 3.2 Gremlinator Pipeline and query mapping approach rather than to transform the data We now present the architectural overview of Gremlinator in Fig- physically. The schema mapping derives a domain-specic rela- ure2 and discuss each of the four steps of its execution pipeline. tional schema from RDF data. Query mapping transforms an SQL Step 1. The input SPARQL query is rst parsed using the Jena query over the schema into an equivalent SPARQL query, which in ARQ module, thereby: (i) validating the query and (ii) generating turn is executed against the RDF store. its abstract syntax tree (AST) representation. SQL ! CYPHER: CYPHER1 is the graph query language used Step 2. From the obtained AST of the parsed SPARQL query, to query the Neo4j2 graph database. There has been no work yet to Gremlinator then visits each basic graph pattern (BGP), mapping use SQL on top of CYPHER. However, there are some examples3 them to the corresponding Gremlin single step traversals (SSTs). A that show the equivalent CYPHER queries for certain SQL queries. SST in Gremlin is an atomic traversal step (ψs ) we describe in [16] in detail. 3 GREMLINATOR APPROACH Step 3. Thereafter, depending on the operator precedence ob- In this section we discuss the why we choose Gremlin as a prop- tained from the AST of the parsed SPARQL query, each of the cor- erty graph query language and briey describe the Gremlinator responding SPARQL keywords are mapped to their corresponding approach. instruction steps from the Gremlin instruction library. Thereafter a nal conjunctive traversal (Ψ) is generated appending the SSTs and 3.1 Why Apache TinkerPop Gremlin? instruction steps. This can be perceived analogous to the SPARQL query language, wherein a set of BGPs form a single complex graph Gremlin is a system-agnostic query language developed by Apache pattern (CGP). 4 TinkerPop . It supports both – pattern matching (declarative) and Step 4. This nal conjunctive traversal (Ψ) is used to generate graph traversal (imperative) style of querying over property graphs. bytecode5 which can be used on multiple language and platform variants of the Apache TinkerPop Gremlin family. Figure 2: The architectural overview of Gremlinator. Figure 1: The Gremlin Traversal Language and Machine. 3.3 Pre-dened Queries Considerations. We encode the prexes of SPARQL queries within 1 CYPHER Query Language (https://neo4j.com/developer/cypher-query-language/) the Gremlinator implementation. In order to aid the SPARQL to 2Neo4j (https://neo4j.com/) 3SQL to CYPHER (https://neo4j.com/developer/guide-sql-to-cypher/) 4Gremlin: Apache TinkerPop’s graph traversal language and machine (https:// 5Bytecode is simply serialized representation of a traversal, i.e. a list of ordered instruc- tinkerpop.apache.org/) tions where an instruction is a string operator and a (attened) array of arguments. Killing Two Birds with One Stone via Gremlinator Conference’17, July 2017, Washington, DC, USA Table 1: Query feature and description Query Id.