Storage, Indexing, Query Processing, and Benchmarking in Centralized and Distributed RDF Engines: a Survey

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 28 August 2020 doi:10.20944/preprints202005.0360.v3 Storage, Indexing, Query Processing, and Benchmarking in Centralized and Distributed RDF Engines: A Survey Waqas Ali Muhammad Saleem Bin Yao Department of Computer Agile Knowledge and Department of Computer Science and Engineering, Semantic Web (AKWS), Science and Engineering, School of Electronic, University of Leipzig, Leipzig, School of Electronic, Information and Electrical Germany Information and Electrical Engineering (SEIEE), Engineering (SEIEE), Shanghai Jiao Tong University, [email protected] Shanghai Jiao Tong University, Shanghai, China leipzig.de Shanghai, China [email protected] [email protected] Aidan Hogan Axel-Cyrille Ngonga Ngomo IMFD; Department of University of Paderborn, Computer Science (DCC), Paderborn, Germany Universidad de Chile, Santiago, Chile [email protected] [email protected] ABSTRACT Keywords: Storage, Indexing, Language, Query Plan- The recent advancements of the Semantic Web and Linked ning, SPARQL Translation, Centralized RDF Engines, Dis- Data have changed the working of the traditional web. There tributed RDF Engines, SPARQL Benchmarks, Survey. is significant adoption of the Resource Description Framework (RDF) format for saving of web-based data. This massive adoption has paved the way for the development of various 1. INTRODUCTION centralized and distributed RDF processing engines. These Over recent years, the simple, decentralized, and linked engines employ various mechanisms to implement critical architecture of Resource Description Framework (RDF) data components of the query processing engines such as data has greatly attracted different data providers who store their storage, indexing, language support, and query execution. data in the RDF format. This increase is evident in nearly ev- All these components govern how queries are executed and ery domain. For example, currently, there are approximately can have a substantial effect on the query runtime. For ex- 150 billion triples available from 9960 datasets1. Some huge ample, the storage of RDF data in various ways significantly RDF datasets such as UniProt2, PubChemRDF3, Bio2RDF4 affects the data storage space required and the query runtime and DBpedia5 have billions of triples. The massive adoption performance. The type of indexing approach used in RDF of the RDF format requires effective solutions for storing and engines is critical for fast data lookup. The type of the un- querying this massive amount of data. This motivation has derlying querying language (e.g., SPARQL or SQL) used for paved the way the development of centralized and distributed query execution is a crucial optimization component of the RDF engines for storage and query processing. RDF storage solutions. Finally, query execution involving RDF engines can be divided into two major categories: (1) different join orders significantly affects the query response centralized RDF engines that store the given RDF data as a time. This paper provides a comprehensive review of cen- single node and (2) distributed RDF engines that distribute tralized and distributed RDF engines in terms of storage, the given RDF data among multiple cluster nodes. The com- indexing, language support, and query execution. plex and varying nature of Big RDF datasets has rendered centralized engines inefficient to meet the growing demand of PVLDB Reference Format: complex SPARQL queries w.r.t. storage, computing capacity .. PVLDB, (xxx): xxxx-yyyy, . and processing [126, 81, 50,3]. To tackle this issue, various DOI: kinds of distributed RDF engines were proposed [40, 33, 49, 85, 50, 103, 104]. These distributed systems run on a set of cluster hardware containing several machines with dedicated This work is licensed under the Creative Commons Attribution- memory and storage. NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtain permission by emailing 1http://lodstats.aksw.org/. [email protected]. Copyright is held by the owner/author(s). Publication rights 2http://www.uniprot.org/. licensed to the VLDB Endowment. 3http://pubchem.ncbi.nlm.nih.gov/rdf/. Proceedings of the VLDB Endowment, Vol. , No. xxx 4http://bio2rdf.org/. ISSN 2150-8097. 5 DOI: http://dbpedia.org/. 1 © 2020 by the author(s). Distributed under a Creative Commons CC BY license. Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 28 August 2020 doi:10.20944/preprints202005.0360.v3 Efficient data storage, indexing, language support, and Furthermore, we hope that this study will help users to optimized query plan generation are key components of RDF choose the appropriate triple store for the given use-case. engines: The remaining of the paper is divided into different sec- tions. Section2 provides vital information on RDF, RDF • Data Storage. Data storage is an integral component engines and SPARQL. The section3 is about related work. of every RDF engine. Data storage is dependent on Section4,5,6 and7 reviews the storage, indexing, query factors like the format of storage, size of the storage language and query execution process. Section8 explains and inference supported by the storage format [81]. A different graph partitioning techniques. Section9 explains recent evaluation [4] shows that the storage of RDF centralized and distributed RDF engines w.r.t storage, index- data in different RDF graph partitioning techniques ing, query language and query execution mechanism. Section has a vital effect on the query runtime. 10 discusses different SPARQL benchmarks. Section 11 illus- • Indexing. Various indexes are used in RDF engines trates research problems and future directions, and section for fast data lookup and query execution. The more 12 gives the conclusion. indexes can generally lead to better query runtime performance. However, maintaining these indexes can be 2. BASIC CONCEPTS AND DEFINITIONS costly in terms of space consumption and keeping them This section contains a brief explanation of RDF and updated to reflect the variations in the underlying RDF SPARQL. The main purpose of explanation is to establish a datasets. An outdated index can lead to incomplete basic understanding of the terminologies used in the paper. results. For complete details, readers are encouraged to look at origi- 6 7 • Query Language. Various RDF engines store data nal W3C sources of RDF and SPARQL . This discussion is in different formats, thus support various querying lan- adapted from [80, 98, 50, 94]. guages such as SQL [37], PigLatin [79] etc. Since 2.1 RDF SPARQL is the standard query language for RDF datasets, many of the RDF engines require SPARQL Before going to explain the RDF model, we first define the translation (e.g., SPARQL to SQL) for query execution. elements that constitute an RDF dataset: Such language support can have a significant impact on query runtimes. This is because the optimization tech- • IRI: The International Resource Identifier (IRI) is a niques used in these querying language can be different general form of URIs (Uniform Resource Identifiers) from each other. that allowing non-ASCII characters. The IRI globally identifies a resource on the web. The IRIs used one • Query Execution. For a given input SPARQL query, dataset can be reused in other datasets to represent RDF engines generate the optimized query plan that the same resource. subsequently guides the query execution. Choosing the best join execution order and the selection of different • Literal: is of string value which is not an IRI. join types (e.g., hash join, bind join, nested loop join, etc.) is vital for fast query execution. • Blank node: refers to anonymous resources not hav- ing a name; thus, such resources are not assigned to a Various studies categorize, compare, and evaluate different global IRI. The blank nodes are used as local unique RDF engines. For example, the query runtime evaluation of identifiers, within a specific RDF dataset. different RDF engines are shown in [81,3, 96, 21,6]. Studies like [31, 70] are focused towards the data storage mechanisms The RDF is a data model proposed by the W3C for rep- in RDF engines. Svoboda et al. [114] classify various indexing resenting information about Web resources. RDF models approaches used for linked data. The usage of relational each "fact” as a set of triples, where a triple consists of three data models for RDF data is presented in [91]. A survey parts: of the RDF on the cloud is presented in [58]. A high-level illustration of the different centralized and distributed RDF • Subject. The resource or entity upon which an asser- engines and linked data query techniques are presented in tion is made. For subject, IRI (International Resource [80]. Finally, empirical performance evaluation and a broader Identifier) and blank nodes are allowed to be used. overview of the distributed RDF engines are presented in [3]. According to our analysis, there is no detailed study that • Predicate. A relation used to link resources to an- provides a comprehensive overview of the techniques used to other. For this, only URIs can be used implement the different components of the centralized and distributed RDF engines. • Object. Object can be the attribute value or an- Motivated by the lack of a comprehensive overview of the other resource. Objects can be URIs, blank nodes, and components-wise techniques used in existing RDF engines. strings. We present a detailed overview of the techniques used in Thus the RDF triple represents some kind of relationship a total of 77 (the largest to the best of our knowledge) (shown by the predicate) between the subject and object. centralized and distributed RDF engines. Specifically, we An RDF dataset is the set of triples and if formally defined classify these triples stores into different categories w.r.t as follows. storage, indexing, language and query planning. We provide simple running examples to understand the different types. 6RDF Primer: http://www.w3.org/TR/rdf-primer/. We hope this survey will help readers to get a crisp idea 7SPARQL Specification: https://www.w3.org/TR/ of the different techniques used RDF engines development.

Storage, Indexing, Query Processing, and Benchmarking in Centralized and Distributed RDF Engines: a Survey

Hadoop Tutorials  Cassandra  Hector API  Request Tutorial  About

MÁSTER EN INGENIERÍA WEB Proyecto Fin De Máster

Security Log Analysis Using Hadoop Harikrishna Annangi Harikrishna Annangi, [email protected]

Orchestrating Big Data Analysis Workflows in the Cloud: Research Challenges, Survey, and Future Directions

Apache Oozie the Workflow Scheduler for Hadoop

Storage, Indexing, Query Processing, And

Vulnerability Summary for the Week of July 10, 2017

Chainsys-Platform-Technical Architecture-Bots

Sphinx: Empowering Impala for Efficient Execution of SQL Queries

A Survey of Current Property Graph Query Languages Peter Boncz (CWI)

Yellowbrick Versus Apache Impala

Supplement for Hadoop Company