A Survey of RDF Stores & SPARQL Engines for Querying Knowledge
Total Page:16
File Type:pdf, Size:1020Kb
Noname manuscript No. (will be inserted by the editor) A Survey of RDF Stores & SPARQL Engines for Querying Knowledge Graphs Waqas Ali · Muhammad Saleem · Bin Yao · Aidan Hogan · Axel-Cyrille Ngonga Ngomo Received: date / Accepted: date Abstract The RDF graph-based data model has seen ever- 1 Introduction broadening adoption in recent years, prompting the stan- dardization of the SPARQL query language for RDF, and the The Resource Description Framework (RDF) is a graph- development of local and distributed engines for process- based data model where triples of the form (s; p; o) denote p ing SPARQL queries. This survey paper provides a com- directed labeled edges s −! o in a graph. RDF has gained prehensive review of techniques, engines and benchmarks significant adoption in the past years, particularly on the for querying RDF knowledge graphs. While other reviews Web. As of 2019, over 5 million websites publish RDF data on this topic tend to focus on the distributed setting, the embedded in their webpages [30]. RDF has also become main focus of the work is on providing a comprehensive sur- a popular format for publishing knowledge graphs on the vey of state-of-the-art storage, indexing and query process- Web, the largest of which – including Bio2RDF, DBpedia, ing techniques for efficiently evaluating SPARQL queries in PubChemRDF, UniProt, and Wikidata – contain billions of a local setting (on one machine). To keep the survey self- triples. These developments have brought about the need for contained, we also provide a short discussion on graph par- optimized techniques and engines for querying large RDF titioning techniques used in the distributed setting. We con- graphs. We refer to engines that allow for storing, indexing clude by discussing contemporary research challenges for and processing joins over RDF as RDF stores. further improving SPARQL query engines. An online ex- Various query languages has been proposed for RDF, but tended version also provides a survey of over one hundred the SPARQL Protocol and RDF Query Language (SPARQL) SPARQL query engines and the techniques they use, along has become the standard [84]. The first version of SPARQL with twelve benchmarks and their features. was standardized in 2008, while SPARQL 1.1 was released in 2013 [84]. SPARQL is an expressive language that sup- Keywords Knowledge Graph · Storage · Indexing · Query ports not only joins, but also variants of the broader rela- Processing · SPARQL · Benchmarks tional algebra (projection, selection, union, difference, etc.). W. Ali Various new features were added in SPARQL 1.1, includ- SEIEE, Shanghai Jiao Tong University, Shanghai, China property paths arXiv:2102.13027v3 [cs.DB] 2 Aug 2021 ing for matching arbitrary-length paths in the E-mail: [email protected] RDF graph. Hundreds of SPARQL query services, called M. Saleem endpoints, have emerged on the Web [39], with the most AKSW, University of Leipzig, Leipzig, Germany popular endpoints receiving millions of queries per day [182, E-mail: [email protected] 137]. We refer to engines that support storing, indexing and B. Yao processing SPARQL (1.1) queries over RDF as SPARQL SEIEE, Shanghai Jiao Tong University, Shanghai, China E-mail: [email protected] engines. Since SPARQL supports joins, we consider any SPARQL engine to also be an RDF store. A. Hogan DCC, University of Chile & IMFD, Santiago, Chile Efficient data storage, indexing and join processing are E-mail: [email protected] key to RDF stores (and thus, to SPARQL engines): A.-C. Ngonga Ngomo DICE, Paderborn University, Paderborn, Germany – Storage. Different engines store RDF data using differ- E-mail: [email protected] ent structures (tables, graphs, etc.), encodings (integer 2 Ali et al. IDs, string compression, etc.) and media (main memory, 2 Literature Review disk, etc.). Which storage to use may depend on the scale of the data, the types of query features supported, etc. We first discuss related studies. More specifically, we sum- – Indexing. Indexes are used in RDF stores for fast lookups marize peer-reviewed tertiary literature (surveys in journals, and query execution. Different index types can support short surveys in proceedings, book chapters, surveys with different operations with varying time–space trade-offs. empirical comparisons, etc.) from the last 10 years collating – Join Processing. At the core of evaluating queries lies techniques, engines and/or benchmarks for querying RDF. efficient methods for processing joins. Aside from tra- We summarize the topics covered by these works in Table 1. ditional pairwise joins, recent years have seen the emer- We use 3, ∼ and blank cells to denote detailed, partial or lit- gence of novel techniques, such as multiway and worst- tle/no discussion, respectively, when compared with the cur- case optimal joins, as well as GPU-based join process- rent survey (the bottom row). We also present the number of ing. Optimizing the order of evaluation of joins can also engines and benchmarks included in the work. If the respec- be important to ensure efficient processing. tive publication does not formally list all systems/bench- marks (e.g., as a table), we may write n+ as an estimate Beyond processing joins, SPARQL engines must offer for the number discussed in the text. efficient support for more expressive query features: Sakr et al. [181] present three schemes for storing RDF – Query Processing. SPARQL is an expressive language data in relational databases, surveying works that use the containing a variety of query features beyond joins that different schemes. Svoboda et al. [205] provide a brief sur- need to be supported efficiently, such as filter expres- vey on indexing schemes for RDF divided into three cate- sions, optionals, path queries, etc. gories: local, distributed and global. Faye et al. [63] focus on both storage and indexing schemes for local RDF en- RDF stores can further be divided into two categories: gines, divided into native and non-native storage schemes. (1) local stores (also called single-node stores) that manage Luo et al. [131] also focus on RDF storage and indexing RDF data on one machine and (2) distributed stores that par- schemes under the relational-, entity-, and graph-based per- tition RDF data over multiple machines. While local stores spectives in local RDF engines. Compared to these works, are more lightweight, the resources of one machine limit we present join processing, query processing and partition- scalability [230,160,96]. Various kinds of distributed RDF ing techniques; furthermore, these works predate the stan- stores have thus been proposed [80,96,188,189] that typi- dardization of SPARQL 1.1, and thus our discussion includes cally run on clusters of shared-nothing machines. more recent storage and indexing techniques, as well as sup- In this survey, we describe storage, indexing, join pro- port for new features such as property paths. cessing and query processing techniques employed by local Later surveys began to focus on distributed RDF stores. RDF stores, as well as high-level strategies for partitioning Kaoudi et al. [107] present a survey of RDF stores explicitly RDF graphs as needed for distributed storage. An extended designed for a cloud-based environment. Ma et al. [134] pro- online version of this survey [9] further compares 121 lo- vide an overview of RDF storage in relational and NoSQL cal and distributed RDF engines in terms of the techniques databases. Özsu [159] presents a survey that focuses on stor- they use, as well as 12 benchmarks in terms of the types of age techniques for RDF within local and distributed stores, data and queries they contain. The goal of this survey is to with a brief overview of query processing techniques in dis- give a succinct introduction of the different techniques used tributed and decentralized (Linked Data) settings. Abdelaziz by RDF query engines, and also to help users to choose the et al. [3] survey 22 distributed RDF stores, and compare 12 appropriate engine or benchmark for a given use case. experimentally in terms of pre-processing cost, query per- The rest of the paper is structured as follows. Section 2 formance, scalability, and workload adaptability. Elzein et discusses and contrasts this survey with related literature. al. [60] present a survey on the storage and query process- Section 3 provides preliminaries for RDF and SPARQL. Sec- ing techniques used by RDF stores on the cloud. Janke & tions4,5,6 and7 review techniques for storage, index- Staab [102] present lecture notes discussing RDF graph par- ing, join processing and query processing, respectively. Sec- titioning, indexing, and query processing techniques, with a tion 8 explains different graph partitioning techniques for focus on distributed and cloud-based RDF engines. Pan et distributing storage over multiple machines. Section 9 in- al. [160] provide an overview of local and distributed stor- troduces additional content available in the online extended age schemes for RDF. Yasin et al. [235] discussed SPARQL version [9], which surveys 121 local and distributed RDF en- (1.1) query processing in the context of distributed RDF gines, along with 12 SPARQL benchmarks. Section 10 con- stores. Wylot et al. [230] present a comprehensive survey of cludes the paper with subsections for current trends and re- storage and indexing techniques for local (centralized) and search challenges regarding efficient RDF-based data man- distributed RDF stores, along with a discussion of bench- agement and query processing. marks; most of their survey is dedicated to distributed and A Survey of RDF Stores & SPARQL Engines for Querying Knowledge Graphs 3 Table 1: Prior tertiary literature on RDF query engines; the abbreviations are: Sto./Storage, Ind./Indexing, J.Pr./Join Pro- cessing, Q.Pr./Query Processing, Dis./Distribution, Eng./Engines, Ben./Benchmarks Techniques Study Year Eng. Bench. Sto. Ind. J.Pr.