Semantic Web 1 (0) 1–5 1 IOS Press

1 1 2 2 3 3 4 A Performance Study of RDF Stores for 4 5 5 6 Linked Sensor Data 6 7 7 8 Hoan Nguyen Mau Quoc a,*, Martin Serrano b, Han Nguyen Mau c, John G. Breslin d , Danh Le Phuoc e 8 9 a Insight Centre for Data Analytics, National University of Ireland Galway, Ireland 9 10 E-mail: [email protected] 10 11 b Insight Centre for Data Analytics, National University of Ireland Galway, Ireland 11 12 E-mail: [email protected] 12 13 c Information Technology Department, Hue University, Viet Nam 13 14 E-mail: [email protected] 14 15 d Confirm Centre for Smart Manufacturing and Insight Centre for Data Analytics, National University of Ireland 15 16 Galway, Ireland 16 17 E-mail: [email protected] 17 18 e Open Distributed Systems, Technical University of Berlin, Germany 18 19 E-mail: [email protected] 19 20 20 21 21 Editors: First Editor, University or Company name, Country; Second Editor, University or Company name, Country 22 Solicited reviews: First Solicited Reviewer, University or Company name, Country; Second Solicited Reviewer, University or Company name, 22 23 Country 23 24 Open reviews: First Open Reviewer, University or Company name, Country; Second Open Reviewer, University or Company name, Country 24 25 25 26 26 27 27 28 28 29 Abstract. The ever-increasing amount of Internet of Things (IoT) data emanating from sensor and mobile devices is creating 29 30 new capabilities and unprecedented economic opportunity for individuals, organisations and states. In comparison with tradi- 30 31 tional data sources, and in combination with other useful information sources, the data generated by sensors is also providing a 31 32 meaningful spatio-temporal context. This spatio-temporal correlation feature turns the sensor data become even more valuables, 32 33 especially for applications and services in Smart City, Smart Health-Care, Industry 4.0, etc. However, due to the heterogeneity 33 and diversity of these data sources, their potential benefits will not be fully achieved if there are no suitable means to support 34 34 interlinking and exchanging this kind of information. This challenge can be addressed by adopting the suite of technologies 35 developed in the Semantic Web, such as Linked Data model and SPARQL. When using these technologies, and with respect to 35 36 an application scenario which requires managing and querying a vast amount of sensor data, the task of selecting a suitable RDF 36 37 engine that supports spatio-temporal RDF data is crucial. In this paper, we present our empirical studies of applying an RDF 37 38 store for Linked Sensor Data. We propose an evaluation methodology and metrics that allow us to assess the readiness of an RDF 38 39 store. An extensive performance comparison of the system-level aspects for a number of well-known RDF engines is also given. 39 40 The results obtained can help to identify the gaps and shortcomings of current RDF stores and related technologies for managing 40 41 sensor data which may be useful to others in their future implementation efforts. 41 42 42 Keywords: IoT, Semantic Web, Linked Data, Sensor, Triple store 43 43 44 44 45 45 46 46 1. Introduction 47 47 48 48 49 The Internet of Things(IoT) is a network of physical 49 50 objects embedded with sensors that are providing real- 50 51 *Corresponding author. time observations about the world as it happens. With 51

1570-0844/0-1900/$35.00 c 0 – IOS Press and the authors. All rights reserved 2 Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data

1 an estimation of there being 50 billion connected IoT the readiness of current RDF database technologies for 1 2 objects by 2020[16], there will be an enormous amount managing Linked Sensor Data. In summary, the main 2 3 of sensor observation data being continuously gener- contributions of this paper are: 3 4 ated per second. Connecting these sensor observation 4 1. A detailed analysis on the fundamental require- 5 data sources to the rest of the digital world, and turn- 5 ments for an RDF processing engine that can be 6 ing this data into meaningful actionable information, 6 applied for Linked Sensor Data. 7 will create new capabilities, richer experiences and un- 7 2. An extensive performance assessment of five se- 8 precedented economic opportunity for individuals, or- 8 lected and well-known RDF stores for managing 9 ganisations and states. However, deriving trends, pat- 9 Linked Sensor Data. In comparison with exist- 10 terns, outliers and unanticipated relationships in such 10 ing works, our performance study also focuses 11 an enormous amount of dynamic data with unprece- 11 on evaluating the spatial, temporal and text data 12 dented speed and adaptability is extremely challeng- 12 indexing performance of these systems. 13 ing. 13 3. A set of important findings about the gaps and 14 When trying to fully exploit the huge potential of 14 shortcomings of current RDF stores in support- 15 these sensor data sources, deriving insights from dy- 15 ing spatio-temporal computation and full-text 16 namic raw observation data poses various challenges 16 searches. 17 in terms of data integration. Fortunately, a suite of 17 4. Identification of a query optimization challenge 18 technologies developed through the Semantic Web 18 for executing a complex spatio-temporal query 19 [6] effort, such as the RDF model, Linked Data [8], 19 over Linked Sensor Data. 20 SPARQL [36], and RDF stores, can be used as some 20 21 of the principal solutions to help relieve sensor data The remainder of this paper is organized as follows. 21 22 sources from the challenge of poor integration [12, Section 2 reviews the related work as regards the study 22 23 30]. Among these technologies, the RDF store is one and evaluation of existing RDF stores. In Section 3, we 23 24 that was developed that allows management of RDF analyze the fundamental requirements of RDF stores 24 25 data. Additionally, an RDF store typically also pro- for Linked Sensor Data. Section 4 describes the gen- 25 26 vides a public SPARQL endpoint such that data can eral architectural design of current RDF databases that 26 27 be queried from RDF-enabled applications via the support spatio-temporal querying. A selected set of 27 28 SPARQL query language. five popular RDF stores for our analysis is introduced 28 29 In the Linked Sensor Data context, the usage of in Section 5. Section 6 presents the general evalua- 29 30 an RDF store is relatively recent. This is due to sev- tion methodology for assessing the RDF stores (RDF 30 31 eral specific requirements which are required in or- engines) for Linked Sensor Data. Section 7 describes 31 32 der to enable querying and storage of sensor data. In the experimental setting. Section 8 reports on the eval- 32 33 fact, sensor data is always associated with some spatio- uation results when the evaluation methodology de- 33 34 temporal contexts, i.e, they are produced in specific lo- scribed in Section 6 is carried out. A discussion and 34 35 cations at specific points in time. Therefore, all sensor our main findings are also given in this section. Finally, 35 36 data items can be represented in three dimensions: se- we conclude our work in the last section. 36 37 mantic, spatial and temporal. As a result, to enable effi- 37 38 cient (and accurate) querying and storage of this multi- 38 39 dimensional sensor data, spatio-temporal computation 2. Related work 39 40 support becomes a mandatory requirement for an RDF 40 41 store or database. Moreover, the "big data" nature of During the last decade, we have witnessed a rapid 41 42 sensor data also requires that these systems need to increase in the number of RDF store implementations 42 43 scale to millions of sensor sources and years of data. that have been proposed [2, 7, 15, 20, 33, 37, 41]. 43 44 In this paper, the requirements and constraints to Along with this growth is a corresponding increase 44 45 adopt an RDF store as a fundamental back-end solu- in studies interested in looking into their performance 45 46 tion for semantic sensor-based applications and ser- and data processing behaviours. For that reason, a 46 47 vices are analyzed. Moreover, an extensive perfor- number of performance assessment techniques specific 47 48 mance comparison of the most used and well-known to these RDF stores have been introduced. For exam- 48 49 RDF store engines that can be utilised for Linked Sen- ple, in [31], the authors provide a performance com- 49 50 sor Data are presented. A requirements analysis, along parison of seven selected RDF stores over the synthetic 50 51 with detailed evaluation results, can be used to assess Lehigh University Benchmark (LUBM) dataset [19]. 51 Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data 3

1 This evaluation aims to test the data loading and query 3. Fundamental requirements of a processing 1 2 performance of these stores with respect to a large data engine for linked sensor data 2 3 application. Similarly, Rohloff et al. [38] present a per- 3 4 formance evaluation over the LUBM dataset of Alle- Sensor data has distinctive characteristics that make 4 5 groGraph, Jena and Sesame with various storage back- traditional RDF engines an obsolete solution. This is 5 6 ends (such as MySQL, DAML DB[34], BigIOWLIM, due to the limited capability of such engines to process 6 7 etc). In this work, a set of evaluation metrics is pre- the massive amount of linked sensor data available, as 7 8 sented such as the data loading time, query execution well as the lack of spatio-temporal index support. In 8 9 performance, query completeness and soundness, and general, when providing an RDF processing engine for 9 10 storage size requirements. Unlike [31, 38] that com- linked sensor data, there are some important features 10 11 pares the performance of different RDF stores, the that should be provided. These features are as follows: 11 12 evaluation described in [39] focuses on the experimen- 12 13 tal comparison of a single native triple store (Sesame) – Geospatial search: This mandatory feature plays 13 14 and a vertically partitioned scheme for storing RDF an important role for processing sensor data. Hav- 14 15 data in a relational database on top of the SP2Bench ing spatial search support will help to not only 15 16 SPARQL benchmark [40]. In [9], the authors introduce provide the spatial information of a spatial object, 16 17 the Berlin SPARQL Benchmark (BSBM) for compar- but will also help to compute the spatial relation- 17 18 ing the performance of native RDF stores, non-RDF ships between the two geometries of objects (i.e, 18 19 relational databases and SPARQL-to-SQL rewriters. within, intersecting, etc). For example, this can be 19 20 The benchmark provides valuable insights into sys- used for finding all the weather stations that are 20 21 tem behaviour by comparing the data loading time, the operating within a given area. 21 22 number of mixed queries executed per hour, etc. – Temporal filter: Most of the queries from end- 22 23 Regarding an performance study of RDF stores user or sensor-based applications require filtering 23 24 over spatial RDF data, Kolas [26] presents a state- on the temporal aspect of sensor observation data. 24 25 of-the-art assessment for querying geospatial data en- This task is very expensive due to the high fre- 25 26 coded in RDF. In this work, the author extends the quency updates of sensor observation data. Fur- 26 27 LUBM dataset by adding spatial entities so that they thermore, this is also the main reason behind a 27 28 were able to evaluate a spatial search on the geo- dramatic increase in the required storage size. 28 29 enabled RDF stores. Along the same lines, Garbis et Therefore, several enhancements should be made 29 30 al. [18] developed a benchmark, called Geographica, to the temporal filter by the query processing en- 30 31 which uses both real-world and synthetic datasets to gine so that observation data can be easily re- 31 32 test the offered functionality and the performance of trieved. 32 33 some prominent geospatial RDF stores. Despite the – Full-text search: This is an advanced feature that 33 34 wide range of performance assessments between RDF aims to improve performance of full-text search 34 35 stores, to the best of our knowledge, there is no ex- queries. Essentially, these kind of searches are 35 36 isting approach that focuses on studying the readiness mostly based on finding given keywords embed- 36 37 of RDF stores as regards Linked Sensor Data. In com- ded in literal values such as descriptions, street 37 38 parison with traditional RDF data, sensor data is usu- names, location names, etc. 38 39 ally associated with spatial and temporal contexts. This – Scalability: This is also another advanced fea- 39 40 distinctive characteristic therefore poses challenges for ture that allows the engine to deal with the "big 40 41 the current RDF store implementations due to the re- data" processing challenge of sensor data. Also, 41 42 quirements of geospatial supports, temporal filtering this can help the engine to address a perfor- 42 43 and full-text search. In this regard, rather than provid- mance issue when executing a high volume of 43 44 ing another benchmarking effort, our paper should be user queries. Thus, either a scalable solution or 44 45 read as an empirical study to find the gaps and short- an efficient index mechanism are needed [5]. One 45 46 comings with current RDF store technologies so that possible solution that can be considered is one 46 47 they can be properly applied for the management of that will duplicate the same service so as to al- 47 48 Linked Sensor Data. Following this direction, our eval- low many concurrent queries while also provid- 48 49 uation and analysis aims to help guide the development ing suitable fault tolerance capabilities. Alterna- 49 50 of RDF stores, rather than serving as a comparison of tively, another solution is to split the data across 50 51 existing ones. different servers and to provide a middleware 51 4 Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data

1 component that can coordinate data transactions example, spatial data are indexed by an R/R+ trees al- 1 2 amongst these underlying servers. gorithm while the text literals are analyzed by a string 2 3 – Spatio-temporal query language: It is impor- index algorithm. 3 4 tant to note that the standard SPARQL query lan- 4 4.1.2. Query engine 5 guage does not support spatio-temporal queries 5 The query engine is used for data retrieval which is 6 (nor full-text queries). For this reason, to enable 6 generally comprised of three sub-modules, namely a 7 this feature, the RDF database creators must pro- 7 query parser, query optimizer and query executor. 8 vide spatio-temporal querying by either defining 8 9 a new language in their own specific syntax or by Query parser The query parser is responsible for 9 10 extending the SPARQL query language. parsing the input query from a user or an applica- 10 11 tion program to an algebraic expression. Unlike the 11 12 standard SPARQL query parser, the one for spatio- 12 13 4. Architectural Design temporal queries over Linked Sensor Data has to 13 14 adapt to its associated spatio-temporal query language 14 15 The general architectural design of RDF stores that syntax. As SPARQL does not cover spatio-temporal 15 16 support spatio-temporal query processing over Linked queries, so the query language that supports spatio- 16 17 Sensor Data can be classified into two categories, temporal or full-text search can be defined either by 17 18 namely native architectures and hybrid architectures. the RDF engines in their own syntax or by extend- 18 19 In this section, we will discuss the details for each ar- ing the SPARQL language. For example, Virtuoso and 19 20 chitecture and analyze the features that might affect the Jena define their own spatial query language by adding 20 21 query performance of each. spatial built-in functions to SPARQL. These addi- 21 22 tional functions are assigned along with dedicated pre- 22 23 4.1. Native architectural design fixes, such as and , for Virtuoso and 23 24 Jena, respectively. In the meantime, instead of defin- 24 25 The native architectural design of RDF stores for ing a new query syntax, Strabon and Stardog adopt the 25 26 Linked Sensor Data is illustrated in Figure 1. This ar- WGS84 and OGC’s GeoSPARQL vocabularies, which 26 27 chitecture needs to implement the data indexing sys- have been widely used and have become the standard 27 28 tem, the query engine, and physical storage. W3C recommendation. 28 29 29 30 4.1.1. Data index Query optimizer This module is used to determine 30 31 The data index has a vital role and can significantly the most efficient way to execute a given query by 31 32 affect the performance of data insertion. This is one considering all the possible query execution plans. 32 33 of the primary design elements for storage systems. The proper execution plan will be represented as an 33 34 Additionally, an efficient data indexing system also execution-order tree that consists of algebraic opera- 34 35 helps to improve query performance. In the Linked tors. A good execution plan will help to prune all the 35 36 Sensor Data context, along with the triple index, defin- redundant intermediate results, and thus will improve 36 37 ing spatio-temporal data indices is an essential require- the query performance by reducing the memory con- 37 38 ment for engines that follow native architectural de- sumption as well as result materialization. 38 39 sign principles. This multidimensional index feature Obviously, each query plan leads to different query 39 40 not only helps with triple-based retrieval requirements performance. Finding the proper execution plan that 40 41 but also with supporting spatio-temporal queries. For can balance the resource-consuming cost and the query 41 42 example, Virtuoso uses RTrees for spatial indexing and response time is always an ultimate goal of the query 42 43 bitmap indices for RDF triples. Meanwhile, Jena uses optimizer. In the Linked Sensor Data context, this pro- 43 44 a Lucene spatial index along with B/B+ and three ad- cess becomes more complex and expensive due to the 44 45 vanced triple indexes on spo, pos and osp to accommo- difficulties in building a comprehensive cost model 45 46 date different triple patterns. that can reflect all the spatial and temporal aspects 46 47 Another important feature of the data index compo- of Linked Sensor Data. In addition, the lack of sta- 47 48 nent is to define the triple patterns that are used to ex- tistical information on spatio-temporal data and ac- 48 49 tract the spatial, temporal or text values from the in- cess patterns also makes the join order optimization 49 50 put data. The extracted values will be indexed properly challenging and resource consuming. Currently, most 50 51 based on their characteristics and implied context. For RDF query optimizers can only collect limited statis- 51 Predicates Definition

Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data 5

1 1 Algebra 2 Triple 2 expression 3 index 3 4 Spatial Text 4 5 index 5 Query index 6 6 execution Query plan 7 Data loading/ 7 8 8 indexing Files Data 9 9 access 10 10 Query Query Query Files 11 Optimizer 11 Parser Executor Buffer/Cache 12 12 13 Query Engine Data Indices Physical Storage 13 14 14 15 Fig. 1. The native architectural design 15 16 16 17 17 18 tics, such as Jena and Strabon, which collects the num- niques so that there will be enough space to not only 18 19 ber of times a predicate appears. Furthermore, other store the data but also for the other operations associ- 19 20 systems (i.e., Virtuoso and Stardog) cache query plans ated with query processing and data management [17]. 20 21 for later use. Some implementations falling into this category are 21 Jena, Hexastore[43], Bitmap[3], etc. 22 Query executor The query executor is responsible 22 23 for taking the query execution plan handed back by the Non-native storage The non-native storage relies on 23 24 query optimizer, recursively processing it by accessing the relational database system to store RDF data per- 24 25 the physical storage, and returning the query results. manently [11, 15, 20, 21, 44]. In this type of storage, 25 26 data is either stored in a single table (schema-free ap- 26 4.1.3. Physical storage 27 proach) or in a set of relational tables (schema-based 27 The final component is the physical storage which 28 approach) such as a subject table, predicate table, ob- 28 includes the database files, a buffer or its own file cache 29 ject table, etc. The advantage of the non-native storage 29 manager that manages the buffering of data to reduce 30 is the less demanding efforts in terms of design and im- 30 the number of disk accesses. The physical storage can 31 plementation. Nevertheless, in order to achieve good 31 be classified into two types, namely native and non- 32 system performance, an efficient mapping solution of 32 native storage [14]. 33 SPARQL-to-SQL and graph-to-relational schema have 33 34 Native storage The native storage treats RDF triples to be taken into consideration [14]. 34 35 as first-class citizens and stores them in a way that 35 36 is close to the RDF data model. There are two ap- 4.2. Hybrid architectural design 36 37 proaches to implement a native storage: persistent 37 38 disk-based and main memory-based. The persistent The hybrid architecture uses existing systems as 38 39 disk-based systems store the RDF data permanently in sub-components for the processing needed. The com- 39 40 files [22, 32, 33]. The advantage of these implemen- mon architecture of a hybrid solution is illustrated 40 41 tations is that the data is safe during a system restart. in Figure 2. In this architecture, the chosen sub- 41 42 However, it should be considered that the data writing components are accessed with different query lan- 42 43 and reading operations can be slow due to a perfor- guages and different input data formats. Hence, the hy- 43 44 mance bottleneck. brid approach needs a Query Rewriter, a Query Del- 44 45 In contrast to the disk-based approaches, the main egator and a Data Transformer. The Query Rewriter 45 46 memory-based ones keep data in the system mem- rewrites a SPARQL-like query to sub-queries that the 46 47 ory. All the data reading and writing operations oc- underlying systems can understand. The Query Del- 47 48 cur there, thereby improving the overall system per- egator is used to delegate the execution process by 48 49 formance. However, due to the limited size of the externalizing the processing to sub-systems with the 49 50 memory, this solution requires a memory-efficient data rewritten sub-queries. In some cases, the Query Del- 50 51 representation as well as composite index-based tech- egator also includes some components for correlating 51 6 Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data

1 and aggregating partial results returned from the hy- Virtuoso does not have a dedicated temporal index. In- 1 2 brid databases if they support this. The Data Trans- stead, the value with a timestamp is indexed as an RDF 2 3 former is responsible for converting input data to the literal value. Therefore, a temporary query is processed 3 4 compatible formats of the access methods used in the via the provided built-in SPARQL date-time functions. 4 5 hybrid databases. It also has to transform the query re- For RDF data, Virtuoso’s index scheme consists of five 5 6 sults from these systems to the format that the Query indices (psog, pogs, sp, op, gs) that have two full in- 6 7 Delegator requires. dices over RDF quads plus three partial indices. 7 8 By delegating the processing to available systems, Virtuoso serves SPARQL queries via a pre-assigned 8 9 building a system following the hybrid architecture public SPARQL endpoint. The query modules will 9 10 takes less effort than using the native approach. How- then translate an input, SPARQL query to the cor- 10 11 ever, the disadvantage of this approach include the lim- responding SQL query referring to the five triple 11 12 ited control over the sub-components as well as the store tables mentioned above. Virtuoso is backed by a 12 13 challenges involved to efficiently coordinate them. RDBMS and can adopt to all SQL optimization tech- 13 14 niques. For example, it provides a cost-based SQL op- 14 15 timizer which performs several types of query transfor- 15 16 5. Analyzing existing RDF stores for Linked mation, such as join order, index selection, selection of 16 17 Sensor Data join algorithms, etc. The Virtuoso cost model is based 17 18 on table row counts, defined indices and uniqueness 18 19 In this section, we will briefly analyze the most pop- constraints, and column cardinalities, i.e. counts of dis- 19 20 ular and mature existing RDF stores that qualify for the tinct values in columns. Additionally, histograms can 20 21 required features which have been identified and dis- be made for value distribution of individual columns. 21 22 cussed in Section 3. Table 1 summarizes the features 22 23 supported by these selected stores. Please be aware that 5.2. Stardog 23 24 the selection process has been carried out with the tar- 24 25 get of identifying suitable RDF stores that can be used Stardog1 is a knowledge graph platform that sup- 25 26 for efficiently processing Linked Sensor Data with re- ports RDF storage. Stardog follows the native architec- 26 27 spect to the provided features. A detailed systematic- ture design solution. Originally, it was implemented by 27 28 level evaluation of these stores will be presented later the developers of a well-known OWL reasoner (Pel- 28 29 in Section 7. let) [42]. At the time of writing, the latest version of 29 30 Stardog is v6.0.1 which supports full-text search, spa- 30 31 5.1. Virtuoso Open Source 31 tial and basic temporal filters. As a graph database, 32 32 Stardog supports ACID transactions and SPARQL 1.1 33 Virtuoso Open Source (v7.2.4) [15] is an example of 33 [14]. Unlike Virtuoso, this engine does not define its 34 the native architecture type and is also a widely-used 34 own spatio-temporal query language. Instead, it adopts 35 RDF store in the Semantic Web community. It is the 35 the WGS84 and OGC’s GeoSPARQL vocabularies 36 storage system that is hosting the DBPedia database 36 [35]. 37 [4]. Virtuoso is a general purpose RDBMS with ex- 37 There are two indexing modes supported in Stardog, 38 tensive RDF adaptations that are comprised of RDF- 38 one based only on triples and another one for quads. 39 oriented data types and a SPARQL-to-SQL front-end 39 Indexing data are stored on-disk for both cases. How- 40 compiler. In Virtuoso, RDF data can be stored as 40 ever, "in-memory" mode is also available. No other de- 41 RDF quads which are indexed properly in SQL tables. 41 tails in terms of index mechanisms are provided in any 42 The engine supports the SPARQL 1.1 language. A 42 publications. 43 SPARQL query will be translated properly to SQL via 43 Stardog follows the bottom-up approach to evalu- 44 the SPARQL-to-SQL compiler. Virtuoso’s user com- 44 ate the query execution plans generated from a given 45 munity is quite active and the is updated reg- 45 SPARQL query [25]. In the query plan tree, leaf nodes 46 ularly. 46 without input are evaluated first, and their results are 47 The current version of Virtuoso Open Source does 47 then sent to their parent nodes up the plan. Typical 48 not support clustering features. However, it supports 48 examples of leaf nodes include scans, i.e. evaluations 49 advanced spatial indexing (using RTrees) and full-text 49 50 search. The engine provides its own spatial and text 50 51 search query language via SPARQL built-in functions. 1http://stardog.com 51 Files

Files Spatial

Buffer/Cache value Triple Router Data Indices Physical Storage RDF triples Text value

Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data 7

1 1 2 2 Query Data loading/ Data Triple Store 3 3 indexing Transformer 4 4 5 5 Spatial DB 6 6 Query Query Query Query 7 7 Parser Rewriter Optimizer Delegator 8 8 9 Text DB 9 Query Engine Data 10 10 Transformer 11 11 12 12 Fig. 2. The hybrid architectural design 13 13 14 14 15 of triple patterns, evaluations of full-text search pred- namely string-to-id and id-to-string. The default stor- 15 16 icates, and VALUES operators. Parent nodes, such as age of the former is implemented using B+ trees, and 16 17 joins, unions, or filters, take the results produced by the latter is based on a sequential access file. Triples 17 18 the leaf nodes as inputs, process them and send their and quads are stored in specialized structures. Triples 18 19 results further towards the root of the tree. The root are held as thre identifiers in the node table while quads 19 20 node of the plan tree, which is typically one of the solu- are assigned as four. Again, B+ trees are used to persist 20 21 tion modifiers2, produces the final results of the query these indices. 21 22 which are then encoded and sent to the client. Query execution in Jena involves both static and dy- 22 23 Another important feature of Stardog is clustering namic optimizations [1]. Static optimizations aim to 23 24 support which allows horizontal scaling. However, this transform the algebras of the execution tree into new, 24 25 feature is only available in the commercial version. equivalent and optimized algebra forms. This transfor- 25 26 26 According to a recent platform report, Stardog can pro- mation process is performed in advance of the query 27 27 cess 10 billion triples stored on a single server. execution. Meanwhile, dynamic optimizations involve 28 deciding on the best execution approach during the ex- 28 29 29 5.3. ecution phase and can take into account the actual data 30 30 so far retrieved. 31 31 Apache Jena3 is an open source SPARQL 1.1 frame- A number of optimization strategies are provided: 32 32 work for Java. It provides an API to extract data from a statistics-based strategy, a fixed strategy and a strat- 33 33 egy of no reordering. For the statistics-based strategy, 34 and write to RDF graphs. Jena implements the native 34 the Jena optimizer uses information captured in a per- 35 architecture design which includes the optional quads 35 database statistics file to specify the join order of the 36 RDF storage layers, namely TDB (file system), SDB 36 query triple patterns. For that, the statistics file takes 37 (SQL DBMS), and in memory. Jena also provides in- 37 the form of a number of rules for approximate match- 38 ference support (supporting RDFS, OWL-Lite or us- 38 ing counts for triple patterns. The file can be automat- 39 ing custom rules), but it works only on triple stores and 39 ically generated when the engine starts. Users can up- 40 not on quadruples stores. From version 3, Jena sup- 40 date it manually by adding and modifying rules to tune 41 ports full-text search and basic spatial index through 41 the database based on higher-level knowledge, such as 42 the use of Lucene or Solr. No clustering solution has 42 inverse function properties. 43 been mentioned. 43 44 Regarding the indexing architecture, the one in Jena 44 5.4. RDF4J 45 is built around three concepts, namely a node table, 45 46 triple/quad indexes, and a prefixes table. Among them, 46 4 47 the node table is responsible for storing the dictio- RDF4J (formerly known as Sesame [10]) is an 47 48 nary which provides two mappings for the RDF terms, open source Java framework for processing RDF data. 48 49 It is a native RDF processing engine which includes 49 50 2https://www.w3.org/TR/sparql11-query/#solutionModifiers 50 51 3https://jena.apache.org/ 4http://rdf4j.org 51 8 Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data

1 parsing, storing, inferencing and querying over RDF. A spatial index is then created by applying an R-tree- 1 2 Similar to Apache Jena, RDF4J also supports two over-GiST algorithm [23]. 2 3 out-of-the-box RDF databases (in-memory and native Query processing in Strabon is implemented by 3 4 store) along with third party storage solutions. It also modifying the RDF4J components. The query engine 4 5 fully supports SPARQL 1.1, full-text search and spatial consists of a parser, an optimizer, an evaluator and a 5 6 index in conjunction with the GeoSPARQL language. transaction manager. Moreover, besides the standard 6 7 RDF4J has no clustering feature. formats offered by Sesame, Strabon offers the KML 7 8 The key component of RDF4J is a “Storage And and GeoJSON encodings, which are widely used for 8 9 Inference Layer” (SAIL). In short, SAIL is an appli- representing the spatial data. Strabon applies exactly 9 10 cation programming interface that offers RDF-specific the same as Sesame’s query optimization techniques 10 11 methods to its client and translates these methods to for the standard SPARQL part of a stSPARQL query. 11 12 calls to its specific underlying DBMS. The benefit of For the spatial extension functions, an additional opti- 12 13 the introduction of SAIL is the flexible implementa- mization step is introduced. In this step, the cost of the 13 14 tion of RDF4J on top of a wide variety of repositories spatial functions presented in the query will be eval- 14 15 without changing other RDF4J’s components. Conse- uated by PostGIS. Based on the cost estimated, the 15 16 quently, in addition to its own in-memory and disk- query tree is then optimized and executed. 16 17 based repositories, RDF4J can be easily attached to 17 18 other DBMS such as MySQL, PostgreSQL, etc. The 18 19 repository used in this paper is the default native disk- 6. Evaluation methodology and metrics 19 20 based repository, which originally provided by RDF4J. 20 21 Regarding the query optimization used in RDF4J, no The previous sections give insight to the architec- 21 22 detailed information are publicly provided. ture and available features of selected RDF stores that 22 23 can be applied for Linked Sensor Data. This section 23 24 will describe in detail the methodology and the met- 24 5.5. Strabon 25 rics used to evaluate the performance of these systems. 25 26 We have triggered the performance of Jena, RDF4J 26 Strabon[28] is an open-source semantic DBMS that 27 and Strabon by modifying their source code. Virtuoso 27 can be used to store linked geospatial data expressed 28 and Stardog already provide some metrics themselves, 28 in stRDF format and query it using the stSPARQL 29 which we retrieve by using the Virtuoso JDBC/CLI 29 language [27]. It follows the hybrid architecture de- 30 and Stardog , respectively. 30 sign and is backed by a PostGIS extension of Post- 31 Our evaluation methodology is carried out within 31 32 gres RDBMS plus RDF4J, allowing it to manage two phases. In the first phase, we will evaluate the data 32 33 both semantic and spatial RDF data. Strabon supports loading performance of the RDF stores over the linked 33 34 SPARQL 1.1. It provides spatial, temporal search and sensor dataset. In this experiment, we evaluated sepa- 34 35 basic text search. In contrast to other engines, Strabon rately the loading performance of semantic, spatial and 35 36 provides an advanced temporal search which allows text data. 36 37 users to query Allen’s temporal relationship between The second phase is to evaluate the query execution 37 38 two temporal objects. In terms of indexing capability, performance. We firstly define a benchmark queries 38 39 the developers of Strabon reported that it can scale up set which is performed over our linked meteorological 39 40 to 500 million triples. No clustering solution is avail- dataset. These queries and dataset are described later 40 41 able for Strabon. in Section 7. After having these benchmark queries de- 41 42 When data is imported into to Strabon, the data is fined, we evaluate the query performance by executing 42 43 firstly decomposed into stRDF triples and each triple these queries. In addition to the overall query execu- 43 44 is processed separately. Data is stored using dictio- tion time, we also measure the execution performance 44 45 nary encoding that follows a “per-predicate” scheme. of query parsing, query optimization and query exe- 45 46 The “per-predicate” scheme uses vertical partitioning cution. The query parsing time is calculated from the 46 47 to store triples in different tables based on their pred- time the system retrieves the query string to the time 47 48 icate, and one table per predicate is maintained. The the query algebra tree is generated. Similarly, the query 48 49 B-Tree algorithm is then applied on these predicate ta- optimization process is considered starting from the 49 50 bles. For spatial literals found during the data loading, time the query tree and it is finished when an execution 50 51 Strabon stores them in a dedicated geo_values table. plan is delivered. For simplicity, any other run-time de- 51 Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data 9

1 Table 1 1 2 RDF stores comparison 2 3 Technical characteristics 3 4 Latest release 4 Query language Spatial filter Full-text search Temporal filter License Clustering 5 Date 5 6 Virtuoso 7 SPARQL, SQL Y Y Y (basic) Cm,OS N Oct 2018 6 7 Stardog 6 SPARQL, SQL Y Y Y (basic) Cm,OS Y Jan 2019 7 8 Apache Jena SPARQL Y Y Y (basic) OS N March 2019 8 9 RDF4J SPARQL Y Y Y (basic) OS N April 2019 9 10 Strabon SPARQL Y N Y (advance) OS N Jan 2018 10 11 11 12 cisions are considered as part of the query execution Table 2 12 13 rather than part of the query optimization. The query Benchmark Linked Meteorological Dataset 13 14 execution is finished when the last result has been re- Year Num. Observation Num. Triples (billion) Size (Gb) 14 15 ceived. 2016 79,305,738 0.555 112 15 16 2017 194,748,696 1.363 274 16 17 2018 357,237,456 2.5 503 17 18 7. Experimental Settings Sum 631,291,890 4.419 889 18 19 19 7.1. Benchmark dataset 20 Table 3 20 21 Our experiments are conducted over the linked me- Static datasets description 21 22 22 teorological dataset, a sub-set of our GoT dataset [29]. Dataset Static dataset 23 23 The meteorological dataset consists of over 30 years Num. spatial/text object (million) 20 24 24 of meteorological data for over the world. The dataset Num. triples (million) 120 25 25 is categorized into two parts, namely static and dy- Raw Data size 20G 26 namic dataset. The static one describes the station and 26 27 sensor descriptions, such as spatial information, text 27 7.2. Benchmark queries 28 data, etc, and is not frequently updated. The dynamic 28 29 dataset contains the observation data generated by the 29 30 sensor station described in the static dataset. The dy- We have selected a set of 11 queries that performs 30 31 namic dataset is frequently updated. over our benchmark dataset. In general, our bench- 31 32 Due to the large size of the meteorological dataset mark queries aim to test the engine processing ca- 32 33 and also because of our limited infrastructure re- pability with respect to their provided features for 33 34 sources, only a subset of this dataset is used in our ex- querying Linked Sensor Data. As previously men- 34 35 periments. As described in Table 2, this subset con- tioned, because the standard SPARQL 1.1 language 35 36 tains the static dataset and the observation data from does not support spatio-temporal queries nor the full- 36 37 year 2016 to 2018, results in a set of 2.5B triples. text queries, some RDF stores have to extend the 37 38 The spatial and text data loading performance are SPARQL language with their own specific syntax. 38 39 evaluated over the static dataset. To provide a more Therefore, some of these queries need to be rewritten 39 40 comprehensive evaluation, we enlarge this dataset by so that they are compatible with the engine under test. 40 41 generating more spatial and text objects on the ba- We summarized some highlight features of the 41 42 sis of the SOSA/SSN data model [13, 24], same data benchmark queries as follows: (i) if the query has in- 42 43 model used for our linked meteorological data. The put parameter; (ii) if it requires geo-spatial search; (iii) 43 44 static dataset is described in Table 3. if it uses temporal filter; (iv) if it uses full-text search 44 45 The dynamic dataset presenting the observation data on string literal; (v) if it has Group By feature; (vi) if 45 46 is used to evaluate the query performance in the sec- the results needs to be ordered via ORDER By oper- 46 47 ond phase of our experiment. The observation data are ator; (vii) if the results are using the LIMIT operator; 47 48 extracted within the time period from Jan 2016 to May (viii) the number of variables in the query; and (ix) the 48 49 2016 (250M triples), as described in Table 4, which we number of triples patterns in the query. The group-by, 49 50 believe could indicate a general performance measure order-by and limit operators imply the effectiveness of 50 51 for this test. the query optimization techniques used by the engine 51 10 Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data

1 Table 4 1 2 Datasets description for query performance evaluation (from 01/2016 to 05/2016) 2 3 Dataset Num. observation (million) Num. Triples (million) Size 3 4 2 months 17.5 122.4 2.8G 4 5 3 months 22.2 155.6 4.3G 5 6 4 months 25.8 180.9 6.5G 6 7 5months 30.5 213.3 7.1G 7 8 Static A 0.75 135Mb 8 9 9 10 (e.g., parallel unions, ordering or grouping using in- 8.1. Data loading throughput 10 11 dexes, etc.), and the number variables and triples pat- 11 12 terns give a measure of query complexity. These sum- 8.1.1. Triple loading performance 12 13 mary of highlight features are described in Table 5 and The triple loading performance of the test stores is 13 14 their SPARQL representations are presented in the Ap- depicted in Figure 3. We can see how the performance 14 15 15 pendix 9. is affected when the size of the RDF dataset increases. 16 According to the results, Virtuoso is the fastest, fol- 16 17 17 7.3. Platform lowed by Stardog and Jena with about 2 times and 6 18 times slower than Virutoso, respectively . RDF4J is the 18 19 slowest being about 10 times slower than Virtuoso. Ac- 19 20 We conducted our experiments on a physical server 20 cording to our observation, there are two possible ex- 21 which has following configuration: 2x E5-2609 V2 In- planations for the poor loading performance of RDF4J: 21 22 tel Quad-Core Xeon 2.5GHz 10MB Cache, Hard Drive (1) The high frequent index updates due to the rela- 22 23 3x 2TB Enterprise Class SAS2 6Gb/s 7200RPM - 3.5" tively small page size used by RDF4J; (2) The HTTP 23 24 on RAID 0, Memory 128GB 1600MHz DDR3 ECC communication latency between the RDF4J client and 24 25 Reg w/Parity DIMM Dual Rank. server components. Note that, RDF4J does not support 25 26 bulk data import. Instead, it provides a set of REST 26 27 7.4. Setup API functions for client to access the RDF database 27 28 either for reading, writing or querying data. This con- 28 29 29 We have experimented on Jena v3.9.0, RDF4J sequently leads to the dramatical increase in the net- 30 work communication latency time when the number of 30 31 v2.6.5, Stardog v6.0.1 and Virtuoso Open Source 31 v7.2.5. For Virtuoso, the system parameters Num- request access to the database goes up. This reason is 32 also explained for the poor data loading performance 32 33 berOfBuffers and MaxDirtyBuffers were set to 40GB. 33 Java heap size for Jena and RDF4J is also set to 40GB. of Strabon. 34 Regarding the required repository size, it can be ob- 34 35 Besides, for Jena, we have enabled its optimization 35 strategy option to statistic strategy. Regarding RDF4J served in Table 6 that this metric has a linear increase 36 with respect to the dataset size. Unfortunately, due to 36 configuration, we have configured the index mecha- 37 the complex hybrid architecture of Strabon, its repos- 37 38 nism to spoc, posc and opsc. For Stardog, as we will 38 itory size metric can not be measured. For Virtuoso, 39 not evaluate the inference capability, we disable this 39 its index compression methods pay off, resulting in 40 option when importing data. The spatial and textual 40 the smallest index size. Stardog uses 80% more space 41 41 index options were enabled. Similar to other engines, than Virtuoso. Jena and RDF4J generate much larger 42 42 Stardog memory size was also set to 40 GB. The rest indexes about 4 times larger than those of Virtuoso, 43 43 of the parameters were left to the default values. though they have less indexes. 44 44 45 8.1.2. Spatial and text data loading performance 45 46 Unlike the existing performance studies that only fo- 46 47 8. Results and Discussion cus on general data loading performance, our evalua- 47 48 tion also measures the loading performance on spatial 48 49 In this section we present and discuss the experi- and text data. In this regard, we measure the loading 49 50 mental results following the evaluation methodology speed via the number of object can be indexed per sec- 50 51 and metrics described in Section 6. ond, instead of the number of triples. An object con- 51 Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data 11

1 Table 5 1 2 Benchmark queries characteristics 2 3 Query Parametric Spatial filter Temporal filter Text search Group By Order By LIMIT Num. variables Num. triple patterns 3 4 1 XX 3 3 4 5 2 X 3 4 5 3 7 8 6 XX 6 4 7 8 7 XXXXXX 7 5 XXXX 4 5 8 8 6 XX 5 8 9 9 7 XXXXX 7 8 10 10 8 XX 3 4 11 11 9 XX 6 7 12 10 XXXX 7 9 12 13 11 X 2 1 13 14 14 15 Table 6 15 16 Required repository capacity (in GB) with respect to dataset size 16 17 RDF Engine/Data size 500 (mil) 750 (mil) 1000 (mil) 1250 (mil) 1500 (mil) 1750 (mil) 2000 (mil) 17 18 18 RDF4j 60 79 99 121 142 NA NA 19 19 20 Virtuoso 7 14 19 23 32 39 47 56 20 21 Stardog 24 43 64 82 102 125 146 21 22 Apache Jena 62 89 117 135 164 NA NA 22 23 Strabon NA NA NA NA NA NA NA 23 24 Chart Title 24 25 200 loading speed can reach to 1306 object/sec. Its loads 25 26 180 20 (mil) dataset in about 5.6 hours. Stardog performs 26 27 160 better than Jena, results in 4 hours for loading the same 27 28 28 140 dataset and the average speed is almost 14k object/sec. 29 29 120 In comparison, Virtuoso achieves much faster loading 30 30 100 speed than others, which can reach to 3287 object/sec. 31 31 80 However, a drawback of the impressive loading speed

32 Loading (hour)time 32 60 is the increase of number triples stored in Virtuoso 33 33 40 with respect to the original data provided. This is be- 34 20 34 cause the additional geo:geometry spatial triples which 35 0 35 are generated during the transformation of the geo:lat 36 500 750 1000 1250 1500 1750 2000 36 Number of loaded triples (million) geo:long 37 and triples to enable the geo-spatial indexing. 37 RDF4j Virtuoso 7 Stardog Apache Jena Strabon 38 38 39 Fig. 3. RDF stores performance of RDF data loading 8.2. Query performance 39 40 40 41 The set of figures 6 to 16 presents the average query 41 42 sists of spatial and text description. This evaluation response time concerning Jena, Virtuoso, RDF4J, Star- 42 43 helps us to have a better understanding about the in- dog and Strabon with respect to the different time hori- 43 44 dexing behaviour of the test engines for specific types zons of two, three, four and five months observation 44 45 of data such as geo spatial and text. data of year 2016, respectively. The datasets used to 45 46 The spatial and text data loading time with respect evaluate the query performance are described in Table 46 47 to dataset size is shown in Figure 4. Figure 5 depicts 4. The benchmark queries have been executed by per- 47 48 the average loading speed. As described, RDF4J and forming a pseudo-random sequence of these queries 48 49 Strabon perform poorly - their average loading speed repeated ten times in order to minimize the caching ef- 49 50 is less then 1000 object/sec when loading 20 millions fect. Beside, we also empty the system cache before 50 51 objects. Apache Jena has a better performance and its running the query of each test execution. 51 12 Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data Spatial/text object average loading/indexing performance 1 25000 6000 1 2 2 20000 5000 3 3 4000 4 15000 4 5 3000 5 6 10000 6 2000 Num. object/sec

7 (second) time Indexing 7 5000 1000 8 8 9 0 0 9 1 5 10 15 20 1 5 10 15 20 10 10 Number of loaded objects (million) Number of loaded objects (million) 11 RDF4j Virtuoso 7 Stardog Apache Jena Strabon RDF4j Virtuoso 7 Stardog Apache Jena Strabon 11 12 12 13 Fig. 4. Spatial/text data indexing time Fig. 5. Average spatial/text data indexing speed 13 14 14 15 Looking at the evaluation results, unsurprisingly, the and resource-consuming, results in the considerable 15 16 query performance of all systems decreases with the increase of the query execution time. (2) The ineffi- 16 17 growth of dataset. However, in the cases of no spatio- cient query execution strategies that have been applied. 17 18 temporal nor full-text searches are involved (Q2, Q11), Inappropriate execution plan also consequently effects 18 19 the query performances remain stable and are weakly to the query performance. 19 20 effected by the increase of dataset size. This is un- 20 21 derstandable because these queries are only performed 8.3. Cost breakdown 21 22 over the static dataset, which is unchanged and not in- 22 23 fluenced by the dynamic data. Virtuoso and Stardog Table 7 shows the cost breakdown between query 23 24 perform closely and require less execution time, in the parsing, query optimization and execution, across all 24 25 order of ms. For example, in the case of 5 months stores and queries for 5 months dataset. It is easy to 25 26 dataset, Virtuoso takes 83 (ms) to execute the Q2 while realize that, except Virtuoso, for most stores, the query 26 27 it is 137 (ms) in Stardog. Regarding the others, the per- run-time cost is mostly dominated by the execution 27 28 formances of RDF4j and Strabon are comparable, fol- time. For example, looking at the Q3 cost breakdown, 28 29 lowed by Jena. the execution costs of Jena and Stardog are 48,833 29 30 For the query that has only spatial filter and ba- (ms) and 11410 (ms), taking over 97% and 78% of 30 31 sic triple matching such as Q1, Virtuoso executes this query response time, respectively. Meanwhile, Virtu- 31 32 query in about 80(ms) for 5 months dataset. Following oso dominates only 74 (ms) which takes about 12% the 32 33 this is Stardog with 112(ms). Jena and RDF4J appear total cost. 33 34 to have the worst performance with average response The query parsing time only takes a small fraction 34 35 times are 50128(ms) and 1761(ms), respectively. We of the total cost. This is applied for all stores. Regard- 35 36 attribute this to the effectiveness of the query optimizer ing the cost of query optimization, we observe that this 36 37 in each system. Wrong execution plan might lead to cost is different across the systems. Specifically, Virtu- 37 38 a catastrophic performance. We will discuss this fur- oso spends significantly more time with respect to oth- 38 39 ther in the following subsection that is about the cost ers. An example of this is the Q6, Virtuoso spends 48 39 40 breakdown of query performance. (ms) over the total time 49 (ms) for the query optimiza- 40 41 Another aspect to be considered is the mixing of tion, taking almost 100%. In contrast, the optimization 41 42 spatial filter, time filter and full-text search queries cost of Q6 is 80 (ms) in Stardog, about 55% of the total 42 43 (Q3, Q4, Q5, Q9, Q10). According to the evaluation cost. 43 44 results, Virtuoso is still best ranked with less then 1(s) In the following, for further analyzing the cost 44 45 of execution time for 5 months dataset, followed by breakdown, we choose the Q10 as the representative 45 46 Stardog. With Jena, RDF4J and Strabon, we obtained for all queries. The reason for this is due to its high 46 47 an extremely high query execution time, greater than complexity, that covers almost all query characteristics 47 48 50(s). Analyzing the cost breakdown of these queries, and also demands a heavy computation. Analyzing the 48 49 we derive two possible explanations: (1) The growth of results in Figure 17, we observes that: (1) Optimiza- 49 50 the observation dataset size, which certainly requires tion costs of this query in Virtuoso, Stardog and Jena 50 51 more data processing operations (look up, read, etc) are fairly consistent and are not significantly affected 51 Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data 13

1 1 2 2 3 3 Exe. (ms) 4 4 5 5 6 6 Opt. (ms)

7 Strabon 7 8 8 9 9 Pars. (ms) 118 5512019 100 88 314021 1203 4859 59830 23 85 55427 30 9015 6 214717 815 1425 56817 33 3549 568 6 1187 76977 71636 19 3 10 10 11 11 12 12

13 Exe. (ms) 13 14 14 15 15

16 Opt. (ms) 16 Stardog 17 17 18 18

19 Pars. (ms) 211 1011 94 9 2 30851 42 3115 11410 1 73 10883 1 801 66 2042 64 71 3 1 105 41 3093 136 46 11201 67 19 20 20 21 21 22 22

23 Exe. (ms) 23 24 24 25 25 RDF4j 26 Opt. (ms) 26 27 27 28 28

29 Pars. (ms) 9616 54215 1100 1210 250015 127 1432 226414 16 90 309528 21 8138 14 143012 356 227404 21 1100 21134 868 1546 151653 144141 20 4 29 30 30 31 31 Exe. (ms) 32 32 33 33 34 34 Opt. (ms) 35 35 Virtuoso 36 36

37 Table 7: Cost breakdown of evaluated queries for 5 months dataset 37 Pars. (ms) 111 752 701 4 5001 12 560 74 2 50 27 1 481 4 722 0 651 4 110 4 635 8 15 8 0 38 38 39 39 40 40

41 Exe. (ms) 41 42 42 43 43 Jena 44 Opt. (ms) 44 45 45 46 46

47 Pars. (ms) 107 181415 48304 13 105 132014 53 1840 48833 13 1410 48305 20 163 48739 10 173116 10 159 48415 22 17745 373 1677 48388 48470 37 410 47 48 48 49 49 50 50 Query Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 51 51 14 Q1Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked SensorQ2 Data 1 1360 1 2 1260 2 3 1160 3 1060 4 12000 4 960 5 860 5 6 760 6 7 660 7 560 8 8 460 9 360 9

10 (ms) time QueryExecution 260 10 Query Execution time (ms) time QueryExecution 11 160 11 60 60 12 12 2 months 3 months 4 months 5 months 2 months 3 months 4 months 5 months 13 Dataset Dataset 13 14 Jena Virtuoso RDF4j Stardog Strabon Jena Virtuoso RDF4j Stardog Strabon 14 15 15 Fig. 6. Q1 execution time 16 Q3 Fig. 7. Q2 execution time 16 17 Q4 17 18 18 19 19 20 20 21 100000 21 22 22 12000 23 23 24 24 25 25 26 26 Query Execution time (ms) time QueryExecution 27 27 Query Execution time (ms) time QueryExecution 28 28 500 60 29 2 months 3 months 4 months 5 months 2 months 3 months 4 months 5 months 29 30 Dataset Dataset 30 31 Jena Virtuoso RDF4j Stardog Strabon Jena Virtuoso RDF4j Stardog Strabon 31 32 32 Fig. 8. Q3 execution time Fig. 9. Q4 execution time 33 Q5 Q6 33 34 34 50000 1430 35 35 36 1230 36 37 37 1030 38 5000 38 39 830 39 40 40 630 41 41 500 42 430 42 43 43 Query Execution time (ms) time QueryExecution 230 44 (ms) time QueryExecution 44

45 50 30 45 46 2 months 3 months 4 months 5 months 2 months 3 months 4 months 5 months 46 47 Dataset Dataset 47 48 Jena Virtuoso RDF4j Stardog Strabon Jena Virtuoso RDF4j Stardog Strabon 48 49 49 Fig. 10. Q5 execution time Fig. 11. Q6 execution time 50 50 51 51 HoanQ7 Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor DataQ8 15

1 2230 1 2 2030 2 3 1830 3 1630 4 4 1430 5 5 12000 1230 6 6 1030 7 7 830 8 8 630 9 9 430 10 10 230 Query Execution time (ms) time QueryExecution 11 (ms) time QueryExecution 11 60 30 12 2 months 3 months 4 months 5 months 2 months 3 months 4 months 5 months 12 13 Dataset Dataset 13 14 Jena Virtuoso RDF4j Stardog Strabon Jena Virtuoso RDF4j Stardog Strabon 14 15 15 Fig. 12. Q7 execution time Fig. 13. Q8 execution time 16 Q9 Q10 16 17 17 18 18 19 19 20 20 21 21

22 12000 12000 22 23 23 24 24 25 25 26 26

27 (ms) time QueryExecution 27 Query Execution time (ms) time QueryExecution 28 60 60 28 2 months 3 months 4 months 5 months 29 2 months 3 months 4 months 5 months 29 Dataset Dataset 30 30 Jena Virtuoso RDF4j Stardog Strabon Jena Virtuoso RDF4j Stardog Strabon 31 31 32 Fig. 14. Q9 execution time Q11 Fig. 15. Q10 execution time 32 33 33 34 34 35 460 35 36 410 36 37 360 37 38 310 38 39 39 260 40 40 210 41 41 160 42 42 110 43 43 60

44 (ms) time QueryExecution 44 45 10 45 2 months 3 months 4 months 5 months 46 46 Dataset 47 47 Jena Virtuoso RDF4j Stardog Strabon 48 48 49 Fig. 16. Q11 execution time 49 50 50 51 51 16 Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data

1 by the dataset size. We attribute this to the effective- 1 (join 2 2 ness of query execution plan caching and the statistical (quadpattern 3 approach taken in these systems. (2) On the contrary, (quad 3 ?station rdf:type got:ontology/WeatherStation) 4 4 RDF4J and Strabon clearly present a growth in opti- (quad 5 ?station geo:hasGeometry ?geoFeature) 5 mization costs with respect to the dataset size. This in- (propfunc spatial:withinCircle 6 6 dicates that these two systems have applied different ?geoFeature (59.783 5.35 20 "miles") 7 (table unit) 7 optimization techniques in the presence of different )) 8 8 workloads. 9 Listing 1: Jena optimization plan for Q1 9 10 Following these two observations mentioned above, 10 11 it still lacks of evidences to conclude which strategy 11 12 performs better. It is evidenced by the unpredictable 12 13 query processing behaviour that draws our attention 8.4. Discussion 13 14 on Q1. Note that, the complexity of this query is low 14 15 as it only requires simple spatial computation in con- Our performance study addresses a number of well- 15 16 junction with semantic triple matching. Among the know RDF stores such as Virtuoso, RDF4J, Apache 16 17 four stores, Virtuoso spends 98% query run-time for Jena, Stardog and Strabon in terms of data loading 17 18 performance and query execution behaviours on pro- 18 query optimization. For Q1, the total query run-time 19 cessing over Linked Sensor Data. In order to analyze 19 is 80(ms) and Virtuoso takes 75(ms) for query opti- 20 the systems strengths and weakness of these stores, 20 mization. However, the significant effort for optimiza- 21 we first carefully assess their data loading performance 21 22 tion pays off. Figure 18 illustrates Q1 execution costs on sensor data. In this regards, along with the general 22 23 across different stores, where Virtuoso significantly RDF data loading evaluation, we also study the spa- 23 24 outperforms other stores. In the meanwhile, RDF4J tial and text data loading speed. Generally, Virtuoso 24 25 spends only 30% time for query optimization on Q1, and Stardog perform better than others. Moreover, we 25 26 much less than Virtuoso, but the execution time is ex- also learn that they have better supports on bulk and 26 27 tremely high, which is 1210(ms), far worse than Virtu- near real-time data import. In our opinion, this is an 27 28 oso. advanced feature that should be considered when se- 28 29 In the same case of Q1, Apache Jena spends more lecting a RDF engine for sensor data. 29 30 time than RDF4J for query optimization, 1814(ms) Along with the different data aspects loading assess- 30 31 ment, corresponding query performance evaluations 31 in comparison with 542(ms). However, it results in a 32 are also discussed. Unlike the existing approaches that 32 catastrophic performance. Analyzing the Jena query 33 only observe the overall query performance, we also 33 processing log, we attribute this to the inappropriate 34 analyze the dynamics and query processing behaviours 34 35 query execution plan generated for Q1. According to of the test stores at a deeper detailed level. We split the 35 36 the data statistic, a most efficient way to execute Q1 is query execution process into a series of sub processes, 36 37 that the spatial filter with the low selectivity should be which are query parsing, query optimization and exe- 37 38 executed first. Thereafter the results will be joined with cution. Each sub process is then evaluated separately. 38 39 the semantic triple matching with higher selectivity. Through the evaluation results, with the general as- 39 40 This execution plan helps to eliminate all unnecessary sessment, Virtuoso outperforms all other stores. It is 40 41 intermediate results so that it will reduce the number followed by Stardog and Strabon. Moreover, we also 41 42 of join operations. Unfortunately, Jena executes this learn that the time cost for each process is different 42 43 query in the opposite way, as shown in Listing 1, that across these stores. For example, Virtuoso dedicates 43 44 significantly most time for query optimization process 44 requires a lot of join operations. This results in the dra- 45 with respect to the others. On the contrary, Jena and 45 matical increase of query processing time. We attribute 46 RDF4J mostly focus on the execution process. 46 47 this to a lack of spatio-temporal statistical information Beside the judgments about the overall engines per- 47 48 about the dataset. Although Jena already provided a formance, several further findings related to data in- 48 49 function to generate the statistic of a specified dataset, dex and query optimization have been derived: (1) Due 49 50 however, it only works for standard RDF dataset, not to the poor performance of the queries that requires 50 51 for spatio-temporal dataset. analytical computing on sensor observation data, such 51 Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data 17 Q1 execution time 1 3500 1 100000 2 3000 2 3 3 10000 4 2500 4 5 5 2000 1000 6 6 1500 7 7 100 8 1000 8 9 9 500 (ms)time Query execution 10 10 (ms)time Query optimization 10 11 0 11 2 months 3 months 4 months 5 months 1 12 RDF4j Virtuoso 7 Stardog Apache Jena Strabon 12 Dataset 13 RDF engines 13 Jena Virtuoso RDF4J Stardog Strabon 14 14 Fig. 18. Q1 execution costs for 5 months dataset (in logscale) 15 Fig. 17. The optimization time of Query 10 by varying increasing dataset 15 16 16 17 as aggregation, a proper time series index approach weaknesses of the currents RDF stores implementation 17 18 is needed. Strabon has a dedicated temporal index for when applying for Linke Sensor Data. 18 19 data with time-stamp, but it just focuses on querying For future work, we are currently planning more ex- 19 20 temporal relations of two temporal objects rather than periments on spatio-temporal distributed RDF stores. 20 21 on supporting efficient analytical functions . (2) Select- We want to have a more comprehensive understanding 21 22 ing a wrong query execution plan is potentially catas- about the spatio-temporal query processing behaviours 22 23 trophic. In our experiments, for Jena, failure in query in a distributed environment. Our long-term goal is to 23 24 planning results in the worse performance. (3) There develop of a scalable spatio-temporal query processing 24 25 is a lack of statistic data on spatial and temporal as- engine for Linked Sensor Data. 25 26 pects of sensor dataset. This is evidenced by the in- 26 27 efficient query planning (Q1,Q3, Q4, Q5) of the RDF 27 28 28 stores that use statistic optimization strategy such as Acknowledgements 29 Jena and RDF4J. 29 30 30 31 This paper has been funded in part by Science Foun- 31 32 9. Conclusion and Future Work dation Ireland under grant numbers SFI/12/RC/2289 32 33 and SFI/16/RC/3918 (co-funded by the European Re- 33 34 This paper presents our recent efforts to provide a gional Development Fund), the BIG-IoT project un- 34 35 comprehensive performance study of RDF stores for der grant number 688038, and the Marie Skłodowska- 35 36 Linked Sensor Data. One of the goals of our paper Curie Programme SMARTER project under grant 36 37 is to summarize the list of fundamental requirements number 661180. 37 38 of RDF stores so that they can be used for managing 38 39 and querying sensor data. We have also described the 39 40 abstract architecture design of current RDF database Appendix - Benchmark Queries SPARQL 40 41 technologies that support spatio-temporal queries. An- Representations 41 42 other main contribution of our work is to provide a 42 43 comparative analysis of data loading and query perfor- 43 44 mance of a representative set of RDF stores, namely 44 45 Apache Jena, Virtuoso, RDF4J, Stardog and Strabon. PREFIXES 45 46 In our evaluation, particular attentions have been given PREFIX sosa: 46 47 PREFIX geo: 47 on evaluating the performance of geo-spatial search, PREFIX wgs84: 48 temporal filter and full-text search over sensor data. PREFIX got: 48 PREFIX geoname: 49 Since such assessment aspects have not been fully con- 49 50 sidered and addressed by the existing works, our pa- 50 51 per gives valuable insights about the strengths and 51 18 Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data

1 Query 1. Given the latitude and longitude position, it 1 SELECT ?month (avg(?value) as ?avgTemp) 2 retrieves the nearest weather station within 10 miles WHERE 2 { 3 3 SELECT ?station ?coor ?sensor sosa:isHostedBy <$station URI$>. 4 WHERE ?sensor sosa:observes got:WindSpeedProperty. 4 { ?obs sosa:madebySensor ?sensor; 5 ?station a got:WeatherStation. sosa:resultTime ?time; 5 ?station geo:hasGeometry ?geoFeature. sosa:hasSimpleResult ?value. 6 ?geoFeature wgs84:geometry ?coor. filter (year(?time)=$year$) 6 7 FILTER ( } 7 (?coor, ($long$,$lat$),$radius$)). GROUP BY (month(?time) as ?month) 8 } ORDER BY ?month 8 9 9 10 10 11 Query 2. Given the country name, it retrieves the total 11 Query 6. Given a date, it retrieves the total number of 12 number of weather station deployed in this country 12 13 observation that were observed in California state 13 14 SELECT (count(?station) as ?total) 14 WHERE SELECT count(?obs) as ?number 15 { WHERE 15 ?station a got:WeatherStation. { 16 ?station geo:hasGeometry ?geoFeature. ?station a got:WeatherStation. 16 ?geoFeature wgs84:geometry ?coor. ?station geo:hasGeometry ?geoFeature. 17 ?geoFeature geoname:parentCountry"$country name$". ?geoFeature geoname:parentADM1"California". 17 18 } ?geoFeature geoname:parentCountry"United States". 18 ?sensor sosa:isHostedBy ?station. 19 ?sensor sosa:observes got:SurfaceTemperatureProperty. 19 ?obs sosa:madebySensor ?sensor; 20 sosa:resultTime ?time. 20 filter (year(?time)=$year$ && month(?time)=$month$ 21 Query 3. Given the country name and year, it detects && day(?time)=$day$). 21 22 the minimum temperature value that has been observed } 22 23 for that specified year 23 24 24 SELECT min(?value) as ?min ?station 25 WHERE 25 26 { Query 7. Given the latitude, longitude and radius , it 26 ?station a got:WeatherStation. 27 ?station geo:hasGeometry ?geoFeature. retrieves the latest visibility observation value of that 27 ?geoFeature geoname:parentCountry"$country name$". 28 ?sensor sosa:isHostedBy ?station. area 28 ?sensor sosa:observes got:SurfaceTemperatureProperty. 29 ?obs sosa:madebySensor ?sensor; SELECT ?value ?time 29 30 sosa:resultTime ?time; WHERE 30 sosa:hasSimpleResult ?value. { 31 filter (year(?time)=$year$). ?station a got:WeatherStation. 31 } ?station geo:hasGeometry ?geoFeature. 32 ?geoFeature wgs84:geometry ?coor. 32 FILTER ((?coor, 33 ($long$,$lat$),$radius$)). 33 34 ?sensor sosa:isHostedBy ?station. 34 ?sensor sosa:observes got:AtmosphericVisibilityProperty. 35 Query 4. Given the area location and radius, it detects ?obs sosa:madebySensor ?sensor; 35 sosa:resultTime ?time; 36 the hottest month of that area in given year sosa:hasSimpleResult ?value. 36 37 } 37 SELECT ?month (avg(?value) as ?avgTemp) ORDER BY DESC (?time) 38 WHERE LIMIT 1 38 { 39 ?station a got:WeatherStation. 39 ?station geo:hasGeometry ?geoFeature. 40 ?geoFeature wgs84:geometry ?coor. 40 41 FILTER ((?coor, 41 ($long$,$lat$),$radius$)). 42 ?sensor sosa:isHostedBy ?station. Query 8. Given a keyword, it retrieves all the places 42 ?sensor sosa:observes got:SurfaceTemperatureProperty. 43 ?obs sosa:madebySensor ?sensor; matching a keyword 43 44 sosa:resultTime ?time; 44 sosa:hasSimpleResult ?value. SELECT ?station ?place ?sc 45 filter (year(?time)=$year$). WHERE 45 } { 46 GROUP BY (month(?time) as ?month) ?station a got:WeatherStation. 46 ORDER BY DESC (avg(?value)) limit 1 ?station geo:hasGeometry ?geoFeature. 47 ?geoFeature geoname:parentADM1 ?place. 47 48 ?place bif:contains"’$keyword$’" OPTION (score ?sc). 48 }ORDER BY ?sc 49 49 50 Query 5. Given the station URI and year, it retrieves 50 51 the average wind speed for each month of year 51 Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data 19

1 Query 9. Given a place name prefix, it summaries the [6] Tim Berners-Lee, James Hendler, and Ora Lassila. The seman- 1 2 number of observation of places that match a given tic web. Scientific american, 2001. 2 3 keyword. The results are grouped by place and ob- [7] Barry Bishop, Atanas Kiryakov, Damyan Ognyanoff, Ivan 3 Peikov, Zdravko Tashev, and Ruslan Velkov. Owlim: A family 4 served property 4 of scalable semantic repositories. Semantic Web, 2(1):33–42, 5 5 SELECT count(?obs) as ?totalNumber ?place ?observedType 2011. 6 WHERE 6 { [8] Christian Bizer, Tom Heath, Kingsley Idehen, and Tim 7 ?station a got:WeatherStation. Berners-Lee. Linked data on the web (ldow2008). In Proceed- 7 ?station geo:hasGeometry ?geoFeature. ings of the 17th World Wide Web. ACM, 2008. 8 ?geoFeature geoname:parentCountry ?place. 8 9 ?place bif:contains"’$name prefix$’*" OPTION (score ?sc). [9] Christian Bizer and Andreas Schultz. The berlin bench- 9 ?sensor sosa:isHostedBy ?station. mark. International Journal on Semantic Web and Information 10 ?sensor sosa:observes ?observedType. 10 ?obs sosa:madebySensor ?sensor. Systems (IJSWIS), 5(2):1–24, 2009. 11 }GROUP BY ?place ?observedType [10] Jeen Broekstra, Arjohn Kampman, and Frank Van Harmelen. 11 12 Sesame: A generic architecture for storing and querying rdf 12 13 and rdf schema. In International semantic web conference, 13 14 pages 54–68. Springer, 2002. 14 [11] Eugene Inseok Chong, Souripriya Das, George Eadon, and 15 Query 10. Given a keyword, it retrieves the average 15 humidity value for places that matches a keywords Jagannathan Srinivasan. An efficient sql-based rdf querying 16 scheme. In Proceedings of the 31st international conference on 16 17 since 2016 Very large data bases, pages 1216–1227. VLDB Endowment, 17 18 SELECT avg(?value) as ?avgValue ?place 2005. 18 19 WHERE [12] Michael Compton, Cory Andrew Henson, Laurent Lefort, Hol- 19 { 20 ?station a got:WeatherStation. ger Neuhaus, and Amit P Sheth. A survey of the semantic 20 ?station geo:hasGeometry ?geoFeature. specification of sensors. 2009. 21 21 ?geoFeature geoname:parentCountry ?place. [13] Michael Compton et al. The ssn ontology of the w3c seman- 22 ?place bif:contains"’$keyword$ *’" OPTION (score ?sc). 22 ?sensor sosa:isHostedBy ?station. tic sensor network incubator group. Web Semantics: Science, 23 ?sensor sosa:observes got:AtmosphericPressureProperty. Services and Agents on the World Wide Web, 17:25–32, 2012. 23 ?obs sosa:madebySensor ?sensor; 24 sosa:resultTime ?time; [14] Olivier Curé and Guillaume Blin. RDF database systems: 24 sosa:hasSimpleResult ?value. triples storage and SPARQL query processing. Morgan Kauf- 25 filter(year(?time)>=2016) 25 mann, 2014. 26 }GROUP BY ?place 26 [15] Orri Erling and Ivan Mikhailov. Virtuoso: Rdf support in a na- 27 tive rdbms. In Semantic Web Information Management, pages 27 28 501–519. Springer, 2010. 28 29 [16] D. Evans. The internet of things: How the next evolution of the 29 Query 11. It retrieves the total number of sensor for 30 internet is changing everything. 2011. 30 each observed properties 31 [17] David C Faye, Olivier Cure, and Guillaume Blin. A survey of 31 rdf storage approaches. Revue Africaine de la Recherche en 32 SELECT (count(?sensor) as ?number) ?obsType 32 WHERE Informatique et Mathématiques Appliquées, 15:11–35, 2012. 33 { [18] George Garbis, Kostis Kyzirakos, and Manolis Koubarakis. 33 ?sensor sosa:observes ?obsType 34 }GROUP BY ?obsType Geographica: A benchmark for geospatial rdf stores (long ver- 34 35 sion). In International Semantic Web Conference, pages 343– 35 36 359. Springer, 2013. 36 [19] Yuanbo Guo, Zhengxiang Pan, and Jeff Heflin. Lubm: A 37 37 benchmark for owl knowledge base systems. Web Semantics: 38 38 References Science, Services and Agents on the World Wide Web, 3(2- 39 3):158–182, 2005. 39 40 [1] Tdb optimizer. https://jena.apache.org/documentation/tdb/ [20] Stephen Harris and Nicholas Gibbins. 3store: Efficient bulk rdf 40 41 optimizer.html. storage. 2003. 41 42 [2] Jans Aasman. Allegro graph: Rdf triple database. Cidade: [21] Stephen Harris and Nigel Shadbolt. Sparql query processing 42 with conventional relational database systems. In International 43 Oakland Franz Incorporated, 17, 2006. 43 [3] Medha Atre, Jagannathan Srinivasan, and James A Hendler. Conference on Web Information Systems Engineering, pages 44 44 Bitmat: A main memory rdf triple store. Tetherless World Con- 235–244. Springer, 2005. 45 stellation, Rensselar Plytehcnic Institute, Troy NY, 2009. [22] Andreas Harth and Stefan Decker. Yet another rdf store: Per- 45 46 [4] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, fect index structures for storing semantic web data with con- 46 47 Richard Cyganiak, and Zachary Ives. Dbpedia: A nucleus for texts. Technical report, DERI Technical Report, 2004. 47 48 a web of open data. In The semantic web, pages 722–735. [23] Joseph M Hellerstein, Jeffrey F Naughton, and Avi Pfeffer. 48 Springer, 2007. Generalized search trees for database systems. September, 49 49 [5] Pierfrancesco Bellini and Paolo Nesi. Performance assessment 1995. 50 of rdf graph databases for smart city services. Journal of Visual [24] Krzysztof Janowicz, Armin Haller, Simon JD Cox, Danh 50 51 Languages & Computing, 45:24–38, 2018. Le Phuoc, and Maxime Lefrançois. Sosa: A lightweight ontol- 51 20 Hoan Nguyen et al. / A Performance Study of RDF Stores for Linked Sensor Data

1 ogy for sensors, observations, samples, and actuators. Journal [36] Eric Prud, Andy Seaborne, et al. Sparql query language for rdf. 1 2 of Web Semantics, 56:1–10, 2019. 2006. 2 3 [25] Pavel Klinov. How to read stardog query plans. https://www. [37] Roshan Punnoose, Adina Crainiceanu, and David Rapp. Rya: a 3 stardog.com/blog/how-to-read-stardog-query-plans/. 4 scalable rdf triple store for the clouds. In Proceedings of the 1st 4 [26] Dave Kolas. A benchmark for spatial semantic web systems. In 5 International Workshop on Scalable Semantic Web Knowledge International Workshop on Cloud Intelligence, page 4. ACM, 5 6 Base Systems, 2008. 2012. 6 7 [27] M. Koubarakis and K. Kyzirakos. Modeling and querying [38] Kurt Rohloff, Mike Dean, Ian Emmons, Dorene Ryder, and 7 8 metadata in the semantic sensor web: the model strdf and the John Sumner. An evaluation of triple-store technologies for 8 query language stsparql. In Proc. ESWC, volume 12, pages 9 large data stores. In OTM Confederated International Confer- 9 425–439, 2010. ences" On the Move to Meaningful Internet Systems", pages 10 [28] Kostis Kyzirakos et al. Strabon: a semantic geospatial dbms. 10 1105–1114. Springer, 2007. 11 In ISWC. Springer, 2012. 11 12 [29] Danh Le-Phuoc et al. The graph of things: A step towards [39] Michael Schmidt, Thomas Hornung, Norbert Küchlin, Georg 12 Lausen, and Christoph Pinkel. An experimental comparison of 13 the live knowledge graph of connected things. Journal of Web 13 Semantics, 37, 2016. rdf data management approaches in a sparql benchmark sce- 14 14 [30] Danh Le-Phuoc, Hoan Nguyen Mau Quoc, Josiane Xavier Par- nario. In International Semantic Web Conference, pages 82– 15 15 reira, and Manfred Hauswirth. The linked sensor middleware– 97. Springer, 2008. 16 connecting the real world and the semantic web. Proceedings 16 [40] Michael Schmidt, Thomas Hornung, Georg Lausen, and 17 of the Semantic Web Challenge, 152:22–23, 2011. 17 Christoph Pinkel. Spˆ 2bench: a sparql performance bench- 18 [31] Baolin Liu and Bo Hu. An evaluation of rdf storage systems for 18 large data applications. In 2005 First International Conference mark. In 2009 IEEE 25th International Conference on Data 19 19 on Semantics, Knowledge and Grid, pages 59–59. IEEE, 2005. Engineering, pages 222–233. IEEE, 2009. 20 [32] Akiyoshi Matono, Said Mirza Pahlevi, and Isao Kojima. Rd- [41] Michael Sintek and Malte Kiesel. Rdfbroker: A signature- 20 21 fcube: A p2p-based three-dimensional index for structural based high-performance rdf store. In European Semantic Web 21 22 joins on distributed triple stores. In Databases, Informa- Conference, pages 363–377. Springer, 2006. 22 tion Systems, and Peer-to-Peer Computing, pages 323–330. 23 [42] Evren Sirin, Bijan Parsia, Bernardo Cuenca Grau, Aditya 23 Springer, 2006. 24 Kalyanpur, and Yarden Katz. Pellet: A practical owl-dl rea- 24 [33] Thomas Neumann and Gerhard Weikum. Rdf-3x: a risc- 25 style engine for rdf. Proceedings of the VLDB Endowment, soner. Web Semantics: science, services and agents on the 25 26 1(1):647–659, 2008. World Wide Web, 5(2):51–53, 2007. 26 27 [34] Zhengxiang Pan and Jeff Heflin. Dldb: Extending relational [43] Cathrin Weiss, Panagiotis Karras, and Abraham Bernstein. 27 28 databases to support semantic web queries. Technical report, Hexastore: sextuple indexing for semantic web data manage- 28 LEHIGH UNIV BETHLEHEM PA DEPT OF COMPUTER 29 ment. Proceedings of the VLDB Endowment, 1(1):1008–1019, 29 SCIENCE AND ELECTRICAL ENGINEERING, 2004. 2008. 30 [35] Matthew Perry and John Herring. Ogc geosparql-a geographic 30 31 query language for rdf data. OGC implementation standard, [44] Kevin Wilkinson and Kevin Wilkinson. Jena property table 31 32 2012. implementation, 2006. 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51