Semantic Web 0 (0) 1 1 IOS Press

1 1 2 2 3 3 4 Systematic Performance Analysis of 4 5 5 6 Distributed SPARQL Query Answering Using 6 7 7 8 8 9 Spark-SQL 9 10 10 * 11 Mohamed Ragab , Sadiq Eyvazov, Riccardo Tommasini and Sherif Sakr 11 12 Data Systems Group, Tartu University, Tartu, Estonia 12 13 E-mail: fi[email protected] 13 14 14 15 15 16 16 17 17 Abstract. Recently, a wide range of Web applications utilize vast RDF knowledge bases (e.g. DBPedia, Uniprot, and Probase), 18 18 and use the SPARQL query language. The continuous growth of these knowledge bases led to the investigation of new paradigms 19 19 and technologies for storing, accessing, and querying RDF data. In practice, modern big data systems like Apache Spark can 20 handle large data repositories. However, their application in the Semantic Web context is still limited. One possible reason is that 20 21 such frameworks are not tailored for dealing with graph data models like RDF. In this paper, we present a systematic evaluation of 21 22 the performance of SparkSQL engine for processing SPARQL queries. We configured the experiments using three relevant RDF 22 23 relational schemas, and two different storage backends, namely, Hive, and HDFS. In addition, we show the impact of using three 23 24 different RDF-based partitioning techniques with our relational scenario. Moreover, we discuss the results of our experiments 24 25 showing interesting insights about the impact of different configuration combinations. 25 26 26 Keywords: Large RDF Graphs, SPARQL, Apache Spark, Spark-SQL, RDF Relational Schema, RDF Partitioning 27 27 28 28 29 29 30 30 1. Introduction data processing and several approaches exist for repre- 31 31 senting the RDF data as relations [6–8]. 32 32 33 The Linked Data initiative is fostering the adop- To the best of our knowledge, a systematic analysis 33 34 tion of semantic technologies like never before [1, 2]. of the performance of Big Data frameworks when an- 34 35 Vast Resource Description Framework (RDF) datasets swering SPARQL Protocol and RDF Query Language 35 36 (e.g. DBPedia, Uniprot, and Probase) are now publicly (SPARQL) queries is still missing. Our research work 36 37 available, and the challenges of storing, managing, and focuses on filling this gap. In particular, we focus on 37 38 querying large RDF datasets are getting popular. Apache Spark that, with the Spark SQL engine, is the 38 39 In this regards, the scalability of native triplestores de-facto standards for processing large datasets. 39 40 like , RDF4J, and RDF-3X is bound In the first phase of our work [9], we presented 40 41 by a centralized architecture. Thus, the Semantic a systematic analysis of the performance of Spark- 41 42 Web community is investigating how to leverage big SQL on a centralized single-machine. In particular, 42 43 data processing frameworks like Apache Spark [3] we measured the execution time required to answer 43 44 to achieve better performance when processing large SPARQL queries. In our evaluation we considered: 44 45 RDF datasets [4, 5]. (i) alternative relational schemas for RDF, i.e., Single 45 46 In fact, despite big data frameworks are not tailored Statement Tables (ST), Vertical Tables (VT), and Prop- 46 47 to perform native RDF processing, they were success- erty Tables (PT); (ii) various storage backends, i.e., 47 48 PostgreSQL Hive HDFS 48 fully used to build engines for large-scale relational , , and , and (iii) and differ- 49 ent data formats (e.g. CSV, Avro, Parquet, ORC). 49 50 In this paper, we present the second phase of our 50 51 *Corresponding author. E-mail: fi[email protected]. investigation. Our experiments include larger dataset 51

1570-0844/0-1900/$35.00 c 0 – IOS Press and the authors. All rights reserved 2 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL

1 than before, in a distributed environment in presence of bug file format; (ii) Parquet1, which stores the data in 1 2 data partitioning. In particular, we evaluate the impact a nested data structure and a flat columnar format that 2 3 of three different RDF-based partitioning techniques supports compression; (iii) Avro2, which contains data 3 4 (i.e., Subject-based, Predicate-based, and Horizontal serialized in a compact binary format and schema in 4 5 partitioning) on our relational data. An additional con- JSON format. (iv) Optimized Row Columnar (ORC 3), 5 6 tributions of the current paper is a deeper and prescrip- which provides a highly efficient way to store and pro- 6 7 tive analysis of Spark SQL performance. Hence, in- cess Hive data. 7 8 spired by the work in [10], we analyze the experiments Last but not least, DataFrames are a convenient 8 9 results in detail and provide a framework for deciding programming abstraction that adds to the flexibil- 9 10 the best configurations combinations of schema, par- ity of RDDs a specific named schema with typed 10 11 titioning, and storage for seeking better performance. columns like in relational databases. Spark-SQL [12] 11 12 In particular, this paper applies existing techniques for is a high-level library for processing DataFrames in 12 13 ranking experimental results, and discusses their pros a relational manner. In particular, it allows querying 13 14 and cons alongside with their limitations. Last but not DataFrames using an SQL-like language, and it relies 14 15 least, the paper shows how to combine ranking criteria on the Catalyst query optimizer4. 15 16 (i.e. relational schemas, partitioning techniques, and 16 17 storage backends) to better investigate the trade-offs 2.2. Relational RDF Schemas 17 18 that occur across these experimental dimensions. 18 19 The remainder of the paper is organized as follows: Although triplestores can be used to efficiently store 19 20 Section 2 presents an overview of the required knowl- and manage RDF data [13] , some research works sug- 20 21 edge to understand the content of the paper. Section 3 gest how to manage RDF data in relational stores [14]. 21 22 describes the benchmarking scenario of our study. Sec- In the following, we present three relational schemas 22 23 tion 4 describes the experimental setup of our bench- that are suitable for representing RDF data. For each 23 24 mark. While, section 5 presents the analysis methodol- schema we give an example of data using Listing 1, 24 25 ogy followed to analyze the results. Section 6 discusses and we provide the respective SQL translation of the 25 26 these results and various insights regarding them. We SPARQL query in Listing 2. 26 27 discuss the related work in Section 7, before we con- 27 28 :Journal1 rdf:type :Journal ; 28 clude the paper in Section 8. dc:title "Journal 1 (1940)" ; 29 dcterms:issued "1940" . 29 30 2. Background :Article1 rdf:type :Article ; 30 dc:title "richer dwelling scrapped" ; 31 dcterms:issued "2019" ; 31 32 In this section, we present the necessary background :journal :Journal1 . 32 33 to understand the content of the paper. We assume that 33 Listing 1: RDF example in N-Triples. Prefixes are 34 the reader is familiar with RDF data model and the 34 omitted. 35 SPARQL query language. 35 36 SELECT ? yr 36 37 2.1. Spark & Spark-SQL WHERE { ?journal rdf:type bench:Journal. 37 ?journal dc:title "Journal 1 ( 1 9 4 0 ) " 38 ?journal dcterms:issued ?yr. } 38 39 Apache Spark [3] is an in-memory distributed com- 39 40 puting framework for large scale data processing. Listing 2: SPARQL Example against RDF graph in 40 41 At the Spark’s core there are Resilient Distributed Listing 1.1. Prefixes are omitted. 41 42 Datasets (RDDs, i.e., immutable distributed collection 42 43 of data elements. Single Statement Table Schema requires to store 43 44 Spark supports different storage backends for read- RDF triples in a single table with three columns that 44 45 ing and writing data. Those that are relevant for our represent the three components of the RDF triple, i.e., 45 46 performance evaluation are Apache Hive, i.e., a data subject, predicate, and object. ST schema is widely 46 47 warehouse built on top of Apache Hadoop for pro- 47 48 viding data query and analysis [11]; the Hadoop Dis- 48 1https://parquet.apache.org/ 49 49 tributed File System (HDFS). In particular, HDFS 2https://avro.apache.org/ 50 supports the following file formats: (i) Comma Sepa- 3https://orc.apache.org/ 50 51 rated Values (CSV), which is a readable and easy to de- 4https://databricks.com/glossary/catalyst-optimizer 51 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL 3

1 adopted [15, 16]. For instance, the major open-source Subject-Based Partitioning (SBP) requires to dis- 1 2 triplestores, i.e., Apache Jena, RDF4J and Virtuoso, tribute triples to the various partitions according to the 2 3 use the ST schema for storing RDF data. Figure 1 hash value computed for the subjects. As a result, all 3 4 shows the ST schema representation of the sample in the triples that have the same subject are assumed to 4 5 Listing 1, and the associated SQL translation for the reside on the same partition. In our scenario, we ap- 5 6 SPARQL query in Listing 2. plied spark partitioning using the subject as a key with 6 7 Vertically Partitioned Tables Schema requires to our different relational schema tables/Dataframes. 7 8 store RDF triples into tables of two columns (subject, Predicate-Based Partitioning (PBP) requires, similar 8 9 object) for each unique property in the RDF dataset. to the SBP, to distribute triples to the various partitions 9 10 VT schema was proposed to speed up the queries over based on the hash value computed for the predicate. 10 11 RDF triple stores [15]. Figure 2 shows the VT repre- As a result, all the triples that have the same predicate 11 12 sentation of the sample RDF graph shown in Listing 1, are assumed to reside on the same partition. In our sce- 12 13 and the associated SQL translation for the SPARQL nario, we applied Spark partitioning using the predi- 13 14 query in Listing 2. cate as a key with our different relational schema ta- 14 15 Property Tables Schema requires to cluster multi- bles/Dataframes. 15 16 ple RDF properties as n-ary table columns for the Also for partitioning techniques, it is worth men- 16 17 same subject to group entities that are similar in struc- tioning that other approaches exist in the literature [7, 17 18 ture. PT schema works perfectly with highly structured 10]. However, The partitioning techniques presented 18 19 data, but not with the poorly structured datasets [14], above are suitable to work within the Spark-SQL 19 20 Hierarchical Par- 20 due to the high number of null values that it might framework. Indeed, techniques like 21 titioning rely one the URIs structure, or are based on 21 incur. Moreover, due to its sparse tables representa- 22 the k-way multi-level RDF partitioning strategy [19]. 22 tion, PT suffers from high storage overheads when a 23 These approaches may require some re-design to fit 23 large number of predicates is present in the RDF data 24 with our relational-based data processing scenario. 24 model [7]. Figure 3 shows the relational flattened prop- 25 Thus, we have selected them among the seven pure 25 erty tables of the RDF graph in Listing 1 and the asso- 26 RDF-based partitioning techniques discussed in [10]. 26 ciated SQL translation for the SPARQL query in List- 27 27 ing 2. 28 28 It is worth mentioning that there are other variants 29 3. Benchmark Datasets & Queries 29 30 of the relational schemas mentioned above. In particu- 30 lar, the Wide Property Tables (WPT) schema [17] and 31 According to Jim Gray [20], a domain-specific 31 32 Extended Vertical Partitioning (ExtVP) [18]. The for- benchmark must be Relevant, Portable, Scalable, 32 33 mer uses a unified table for the entire properties in the and Simple. In our evaluation, we used SP2Bench 33 34 RDF graph; the latter is inspired by the Semi-Join re- (SPARQL Performance Benchmark) [21] because it 34 35 ductions of the possible VP tables join correlations that meets these criteria. In fact, it is also one of the most 35 36 can occur among the SPARQL query triple patterns. popular RDF benchmarks. 36 37 However, we opt to use the most three common RDF SP2Bench is Simple, as it is centered around the 37 38 relational schemas (i.e. ST, VT, and PT). Computer Science DBLP scenario which is easy for 38 39 researchers to understand. It is Scalable, because it 39 40 comprises a data generator that enables the creation of 40 41 2.3. RDF Data Partitioning arbitrarily large DBLP-like documents (in Notation- 41 42 3 format). It is Portable w.r.t. our scenario, as it pro- 42 43 Last but not least, RDF data partitioning is another vides a set of SPARQL queries with their translations 43 44 critical choice in our scenario. We have selected three into SQL for each of the relational schemas we se- 44 45 partitioning techniques for RDF data that are suitable lected. These queries have different complexities, and 45 46 for our experiments on SparkSQL (cf. Figure 4). a high diversity of features [22]. Thus, SP2Bench is 46 47 Horizontal-Based Partitioning (HP) requires to par- also a Relevant benchmark. Moreover, it has a reason- 47 48 tition the data evenly on the number of machines in able low score of Structuredness, making it closer to 48 49 the cluster. In particular, it divides the relational tables, the structure of real-world RDF datasets [22]. 49 50 i.e., in n equivalent chunks where n is number of ma- Since the design of ST and VT schemas is inde- 50 51 chines in the cluster. pendent from the meaning of RDF, we have reused a 51 4 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 Fig. 1. Single Statement Table Schema and an associated SQL query sample. Prefixes are omitted. 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 Fig. 2. Vertical Partitioned Tables Schema and an associated SQL query sample. Prefixes Omitted. 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 Fig. 3. Property Tables Schema and an associated SQL query sample. Prefixes are omitted. 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 Fig. 4. RDF partitioning techniques, (a) Horizontal Partitioning, (b) Subject-based Partitioning, (c) Predicate-based partitioning 50 51 51 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL 5

1 similar PT schema inspired by the relational schema TB SSD drive as the data drive on each node. We used 1 2 proposed by Schmidt et al. [14]. In their experiments, Spark V2.4 to fully support Spark-SQL capabilities. 2 3 the SP2Bench RDF dataset contains nine different re- We used Hive V3.2.1. In particular, our Spark cluster 3 4 lational entities namely, Journal, Article, Book, Per- is consisted of one master node and three worker ma- 4 5 son, InProceeding, Proceeding, InCollecion, PhDThe- chines, while Yarn is used as the resource manager, 5 6 sis, MasterThesis, and WWW documents. This schema which in total uses 330 GB and 84 virtual processing 6 7 is inspired by the original DBLP schema5 that is gen- cores. 7 2 8 erated by SP Bench generator. Benchmark Datasets. Three datasets were gener- 8 9 ated using SP2Bench 100M, 250M, 500M triples in 9 10 3.1. Queries Notation3(.n3) format. We have tested our ex- 10 11 periments on these datasets to check the linearity of 11 2 6 12 SP Bench queries cover a variety of SPARQL op- our results conformance. For the sake conciseness, we 12 13 erators as well as various RDF access patterns. In our show in the paper only results related to 100M and 13 14 experiments, to be compliant with the Spark-SQL, we 500M datasets. Nevertheless, all the results, including 14 15 use the SQL translation of these SPARQL queries, those for the intermediary dataset of 250M, are avail- 15 16 which are provided for relational schemas translation, 8 16 7 able in our GitHub repository . 17 i.e., ST, VT, and PT . We have evaluated all of these 17 Data Partitioning. In Section 2, we describe the parti- 18 11 queries of type SELECT, except Q9 which is not 18 tioning techniques we selected, i.e., HP, SBP, and PBP. 19 applicable (NA) for the PT relational schema. 19 Partitioning impacts data distribution and, thus, Spark- 20 To give an indication of the query complexity, we 20 SQL performance is affected, specially reading from 21 looked at the following query features, i.e., number of 21 data backends and data-joining operations. Therefore, 22 joins, number of filters, and the number of projected 22 when partitioning is required, the goal is to minimize 23 variables. Table 1 summarizes these complexity mea- 23 2 data shuffling. One should select the technique that 24 sures for SP Bench queries in SPARQL, and for the 24 best suits the workload, i.e., the queries to run. Our 25 SQL-translations that are related to each RDF rela- 25 mentioned partitioning techniques were originally de- 26 tional schema. For instance, we use the number of vari- 26 signed for RDF partitioning. Hence, we defined their 27 able projections in the SQL statements as an indicator 27 equivalent version for tabular RDF representation. 28 for the performance comparison between the data for- 28 In Spark-SQL, Join operations are equi-joins, i.e., 29 mats of the storage backends in terms of being row- 29 they require the join key to be the partitioning key. That 30 oriented (e.g., Avro) or columnar-oriented (e.g., Par- 30 means that data must be on the same node. Thus, we 31 quet or ORC). 31 prepared the data in two phases. First, we use custom 32 32 4. Experimental Setup Spark partitioners for creating DataFrames that fulfil a 33 33 certain partitioning technique. Depending on the par- 34 34 In this section, we describe our experimental envi- titioning techniques of choice ( i.e. SBP, PBP, or HP), 35 35 ronment, i.e., (i) we discuss how we configured our ex- 36 we used as partitioning keys respectively subject or 36 perimental hardware and components; (ii) we 37 predicate, or we used the horizontal approach. Then, 37 describe how we prepared, partitioned, and stored the 38 we persisted the DataFrames on HDFS. We fixed the 38 datasets, and (iii) we present the details of the experi- 39 data partition block size on HDFS as the default block 39 ment design. 40 size on Spark (128MB). HDFS manages also the repli- 40 41 Hardware and Software Configurations. Our exper- cation of these partitioned blocks according to a con- 41 42 iments have been executed on a bare metal cluster of figurable replication factor(RF) (i.e. we used the de- 42 43 four machines with a CentOS-Linux V7 OS, running fault RF = 3). 43 44 on a 32-AMD cores per node processors, and 128 GB Data Storage. In our experiments, we use two storage 44 45 of memory per node, alongside with a high speed 2 backends, i.e., HDFS and Hive (see Section 2). Addi- 45 46 tionally, for HDFS we used multiples file formats. 46 47 5DBLP-like RDF data produced by the SP2Bench http://dbis. We used Spark to convert the data from the N3 for- 47 2 48 informatik.uni-freiburg.de/forschung/projekte/SP2B/ mat generated by SP Bench. into Avro, Parquet, and 48 6http://dbis.informatik.uni-freiburg.de/index.php?project=SP2B/ 49 49 queries.php 50 7http://dbis.informatik.uni-freiburg.de/index.php?project=SP2B/ 8https://datasystemsgrouput.github.io/ 50 51 translations.html SPARKSQLRDFBenchmarking/Results 51 6 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL

1 SPARQL ST-SQL VT-SQL PT-SQL 1 2 #Joins Filters #Proj #Joins #Sel #Joins #Sel #Joins #Sel 2 3 Q1 3 0 1 8 6 5 2 2 2 3 4 Q2 8 0 10 28 9 19 0 9 5 4 5 Q3 1 1 1 5 2 3 2 2 2 5 6 Q4 7 1 2 19 16 11 3 8 3 6 7 Q5 5 1 2 16 11 9 3 7 3 7 8 Q6 8 3 2 26 13 15 4 6 3 8 9 Q7 12 2 1 26 15 16 5 2 1 9 10 Q8 10 2 1 23 13 13 7 9 4 10 11 Q9 3 0 1 11 4 5 3 n/a n/a 11 12 Q10 0 0 2 3 1 2 2 5 2 12 13 Q11 0 0 1 2 1 1 0 2 0 13 14 Table 1 14 15 SP2Bench Queries Complexity Analysis, number of joins, and number of projections/selections for SPARQL query and our three considered 15 16 RDF relational schemes (ST, VT, and PT). 16 17 17 18 18 Single Vertical Property Statement 19 (a) Table (b) Table (c) 19 Table 20 20 21 21 22 22 23 23 24 (i) HP (ii)SBP SBP (iii)PBP PBP (i) HP (ii)SBP SBP (iii)PBP PBP (i) HP (ii)SBP SBP (iii)PBP PBP 24 25 25 26 26 Avro 1 Avro 1 Avro 1 Avro 1 Avro 1 Avro 1 Avro 1 Avro 1 Avro 1 27 27 CSV 2 CSV 2 CSV 2 CSV 2 CSV 2 CSV 2 CSV 2 CSV 2 CSV 2 28 28 ORC 3 ORC 3 ORC 3 ORC 3 ORC 3 ORC 3 ORC 3 ORC 3 ORC 3 29 29 Parquet Parquet Parquet Parquet Parquet Parquet Parquet Parquet Parquet 30 4 4 4 4 4 4 4 4 4 30 31 HIVE 5 HIVE 5 HIVE 5 HIVE 5 HIVE 5 HIVE 5 HIVE 5 HIVE 5 HIVE 5 31 32 32 33 Fig. 5. Experiments Architecture. 33 34 34 2 35 ORC. We used Spark for this conversion because of its Experiments Design. We evaluated all the SP Bench 35 36 ability to efficiently handle large files in memory. This queries for all the combinations of schemas, backend- 36 37 feature is required when converting graph data. More- s/formats and partitioning techniques. For each config- 37 38 over, Spark supports reading and writing different file uration, we run the experiment five times (excluding 38 39 formats from and into HDFS. the first cold-start run time, to avoid the warm-up bias, 39 40 We used the same approach to load the data into the and computed an average of the other four run times). 40 41 tables of the Apache Hive data warehouse using three Figure 5 summarizes the experiments configurations, 41 42 created databases, one for each dataset size(100M, guiding the reader through the naming process in our 42 43 250M, and 500M). Loading the data of the CSV files further analysis results and plots, i.e., 43 44 into the Hive data warehouse has been done in a little 44 45 bit different way. In particular, to store data into Hive {Schema}.{Partitioning_Technique}.{Storage_Backend} 45 46 tables, it is a must to enable the support for Hive in the 46 47 Spark session configuration using the enableHiveSup- For instance, (a.ii.4) corresponds to Single ST 47 48 port function. Moreover, it is also important to give the schema, SBP partitioning, and Parquet backend. 48 49 Hive metastore URI using the Thrift URI protocol, also We used the Spark.time function by passing the 49 50 specified in the SparkSession configuration in addition spark.sql(query) query execution function as a param- 50 51 to the warehouse location. eter. The output of this function is the running time of 51 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL 7

1 evaluating the SQL query into the Spark environment best or the worst. However in this level of analysis, we 1 2 using the SparkSession interface. are unable to decide which configuration combination 2 3 (i.e. schema, partitioning, and storage backend) shows 3 4 5. Analysis Methodology the best performance. Moreover, some of the descrip- 4 5 tive results are contradicting in this level of analysis. 5 6 For instance, we will show that in some queries the VT 6 7 schema is the best performing choice. Whereas, for the 7 8 same query with another partitioning technique, the PT 8 9 schema performs better. 9 10 10 11 5.2. Diagnostic Analysis 11 12 12 13 Right below the descriptive analysis, there is di- 13 14 agnostic why 14 Fig. 6. Analysis Methodology analysis allows answering questions. In 15 this level, we combine factual knowledge from the ob- 15 16 served data with knowledge about the world to make 16 In this section, we describe the methodology that 17 sense of the results. We can enrich the descriptive 17 we follow for analyzing our results. We structured our 18 analyses mentioned above with contextual informa- 18 analysis according to Figure 69, which tries to quan- 19 tion about the query complexity and the configuration. 19 tify the cost of decision making starting from different 20 However, we are unable to investigate the trade-offs 20 levels of analysis. 21 in terms of the dimensions affecting the performance. 21 We advocate the need of a decision-making frame- 22 Thus, we advocate the need of better indicators that 22 work for making sense of performance of big data sys- 23 help investigating the impact of each dimension across 23 tems. This gets more crucial, especially when the so- 24 all the queries. 24 25 lution space includes several different variables and 25 unknown trade-offs, which is the standard case for 26 5.3. Prescriptive Analysis 26 27 big-data frameworks performance benchmarking, e.g. 27 28 Spark in our scenario. 28 Before explaining how we structure each level of Last but not least, at bottom level there is prescrip- 29 tive analysis which allows providing actionable in- 29 30 analysis, it is worth noticing that, the predictive anal- 30 ysis level is out of the scope of this work. Predictive sights for the analyst to decide. In practice, this means 31 systematically investigating the impact of each dimen- 31 32 analysis typically leverage statistical models in order 32 to answer questions about the future. In our research, sion of the experiment, i.e. schema, partitioning, and 33 storage, while discussing the trade-offs across these 33 34 we focus on systematically applying a post-hoc evalu- 34 ation of the performance. different dimensions to the extent of identifying an op- 35 timal solution. 35 36 36 5.1. Descriptive Analysis In this regards, ranking criteria, e.g. the one pro- 37 posed in [10] for partitioning, help giving a high- 37 38 level view of the performance of a certain dimension 38 On the top level, descriptive analysis allows to an- 39 across queries. Thus, we have extended the proposed 39 swer factual questions. We extrapolate fine-grain in- 40 ranking techniques to schemas and storage. The fol- 40 sights, e.g., what is happening in the query evalua- 41 lowing equation shows a generalized ranking formula 41 tion level. In particular, we use descriptive analysis to 42 for ranking our relational schemas, partitioning tech- 42 identify which queries are long running, medium run- 43 niques, and storage backends. 43 ning, or short running according to their average run- 44 t 44 45 ning times. In this phase of analysis, we will also be X OD(r) ∗ (t − r) 45 RS D = , 0 < RS D 6 1 (1) 46 able to observe general performance dimensions. For b(t − 1) 46 r=1 47 each query, we can observe which schema, partition- 47

48 ing technique, and storage backend are performing the In Equation 1, RS D defines the Rank Score (RS ) of 48 49 the ranked dimension D (relational schema, partition- 49 50 9https://www.gartner.com/en/newsroom/press-releases/ ing technique, and storage backend). Such that, t rep- 50 51 2014-10-21-gartner-says-advanced-analytics-is-a-top-business-priority resents the total number of the ranked dimension. 51 8 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL

1 Using Equation 1 for ranking the relational schemas has the maximum ranking score of 1 (as, 0 < RSD 1 2 poses t = 3, as we have three different schemas in our <=1 cf. equation 1) in all the vertices. 2 3 paper. It will be the same (i.e. t = 3) while applying the 3 4 4 equation for ranking partitioning techniques, as they Schema (Rs) 5 are also three. Whereas, while applying the equation 5 1.00 6 for ranking the storage backends we pose t = 5, as we 6 0.75 7 7 have five different backends/file formats. While b in the 0.50 8 8 formula, represents the total number of query execu- 0.25 2 9 tions, as we have 11 query executions in our SP Bench 0.00 9 10 benchmark (i.e. b = 11). Finally, OD(r) denotes the 10 11 total number of occurrences of a particular dimension 11 12 D to come in the rank r. Storage Format (Rf) Partitioning (Rp) 12 13 Applying the ranking criteria independently for each 13 14 dimension supports a better explanations of the results. 14 15 Nevertheless, we observed that, we are still unable to 15 16 identify which configuration combination is the best Fig. 7. Triangle Area (Rta) combined Ranking criterion 16 17 performing, since the trade-offs between those dimen- 17 18 sions are still not investigated. Therefore, we advocate 18 19 for combining rankings towards for choosing the best 19 6. Experimental Results 20 performing configuration combination. To this extent, 20 21 we tested three alternative techniques that aim at com- 21 In this section, we discuss the results of our ex- 22 bining the ranking dimensions into one unified ranking 22 periments at the different levels of our analysis. We 23 criterion. 23 present the insights which each analysis criterion un- 24 24 veiled about the performance of the Spark-SQL query 25 – The Average (AVG) criterion aims at combining 25 engine, using various relational RDF storage schema, 26 the three dimensions rankings (Rf, Rp and Rs) by 26 alongside with many partitioning techniques, and on 27 averaging them (Cf. equation 2). Such that, Rf, 27 top of various storage backends. 28 Rp, and Rs are the rankings of storage backend, 28 29 partitioning, and storage backends respectively. 29 1 6.1. Descriptive and Diagnostic Analysis 30 30 AVG = (R f + Rp + Rs) (2) 31 3 31 We start by discussing the descriptive analysis by 32 32 – The Weighted Average (WAvg) criterion extends showing the average query runtimes figures for the 33 33 the Average assigning weights to each individual benchmark queries. We further follow these descrip- 34 34 rank according to its impact in the experiments, tive results with respective diagnosis analysis for an- 35 35 i.e., we have 5 different storage backends, 3 par- swering the ’why’ question for those results. 36 titioning techniques, and 3 relational schemas). 36 37 (Cf. equation 3). 6.1.1. Query Performance Analysis 37 38 1 Figures 8 and 9 show the average execution times 38 WAvg = (R ∗ 5 + R ∗ 3 + R ∗ 3) (3) 2 39 3 f p s for running the SP Bench queries for the 100M and 39 40 500M datasets, respectively. We can immediately ob- 40 41 – The Ranking Triangle Area (Rta) criterion serve that queries Q1, Q3, Q10, and Q11 are the least 41 42 leverages on a geometric interpretation of the impactful queries (have the lowest running times). 42 43 trade-off of our experiments three dimensions. It Thus, we call these queries short-running queries. On 43 44 looks at the triangle subsumed by each ranking the other hand, queries Q2, Q4, and Q8 have the 44 45 criterion (R f , Rp, and Rs). The trade-offs ranking longest runtimes. The remaining queries Q5, Q6, Q7, 45 46 dimensions are presented by the triangle sides. and Q9 are medium-running queries. In the following, 46 47 The criterion aims at maximizing the area of this we focus our analysis on the longest running queries, 47 48 triangle. In other words, the bigger the area of this as they may hide interesting insights about the ap- 48 49 triangle, the better the performance of the three proach limitations. 49 50 ranking dimensions all together. The ideal case is Query Q2 shows a low average execution time when 50 51 represented by the red triangle in Figure 7 which using the ST schema or the VT schema. However, for 51 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL 9

1 1 2 2 Q11 Q11 Q11

3 parquet parquet parquet 3

4 Q10 Q10 Q10 4

5 Q8 Q8 Q8 5 orc orc orc 6 6

7 Q6 Q6 Q6 7

8 hive hive hive 8 Q5 Q5 Q5 9 9

10 Q4 Q4 Q4 10 csv csv csv 11 11 Q3 Q3 Q3 12 12

13 Q2 Q2 Q2 13 avro avro avro 14 14 Q1 Q1 Q1 15 15 0 0 0

50 50 50

300 250 200 150 100 300 250 200 150 100 300 250 200 150 100

Query Runtimes in s in Runtimes Query s in Runtimes Query 16 s in Runtimes Query 16 (c) PTQueries Schema, Horizontally Partitioned Running (f) PT Schema, Predicate-BasedQueries Partitioned Running (i) PT Schema,Queries Subject-Based Partitioned Running 17 17 18 18 19 19 Q11 Q11 Q11

20 parquet parquet parquet 20 Q10 Q10 Q10 21 21 Q9 Q9 22 Q9 22 orc orc orc Q8 Q8 23 Q8 23 24 24 Q7 Q7 Q7 25 25 hive hive hive Q6 Q6 Q6 26 26 Q5 Q5 27 Q5 27 csv csv csv Q4 Q4 28 Q4 28 29 29 Q3 Q3 Q3 30 30 Q2 Q2 Q2 avro avro avro 31 31 Q1 Q1 32 Q1 32 0 0 0 50 50 50

300 250 200 150 100 300 250 200 150 100 33 300 250 200 150 100 33 Query Runtimes in s in Runtimes Query s in Runtimes Query Query Runtimes in s in Runtimes Query (e) VT Schema, Predicate-Based PartitionedQueries Running (h) VT Schema,Queries Subject-Based Partitioned Running 34 (b) VT Schema,Queries Horizontally Partitioned Running 34 35 35 36 36 Q11 Q11 37 Q11 37 parquet parquet parquet

38 Q10 Q10 Q10 38 Q9 Q9 39 Q9 39 orc orc 40 orc 40 Q8 Q8 Q8 41 41 Q7 Q7 Q7 42 42 hive hive hive Q6 Q6 43 Q6 43 Q5 Q5 44 Q5 44 Fig. 8.: 100M dataset Average Query Execution Run-times (full-size figures are on the project github repository) csv csv 45 csv 45 Q4 Q4 Q4 46 46 Q3 Q3 Q3 47 47 Q2 Q2 Q2 avro avro 48 avro 48 Q1 Q1 49 Q1 49 0 0 0

50 50 50 50 50

300 250 200 150 100 300 250 200 150 100 300 250 200 150 100 Query Runtimes in s in Runtimes Query s in Runtimes Query Query Runtimes in s in Runtimes Query (d) ST Schema, Predicate-BasedQueries Partitioned Running (g) ST Schema,Queries Subject-Based Partitioned Running 51 (a) STQueries Schema, Horizontally Partitioned Running 51 10 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL g TShm,SbetBsdPriindRunning Partitioned Subject-Based Queries Schema, ST (g) Running Partitioned Queries Predicate-Based Schema, ST (d) Running Partitioned Horizontally Schema, Queries ST (a) Query Runtimes in s Query Runtimes in s Query Runtimes in s

1 1000 1500 2000 2500 3000 1000 1500 2000 2500 3000 1000 1500 2000 2500 3000 1 500 500 500 2 0 0 0 2 3 Q1 Q1 Q1 3 avro avro avro 4 Q2 Q2 Q2 4

5 Q3 Q3 Q3 5 6 6 Q4 Q4 Q4 csv csv csv

7 repository) github project the on are figures (full-size Run-times Execution Query Average dataset 500M 9.: Fig. 7 Q5 Q5 Q5 8 8 Q6 Q6 Q6 9 hive hive hive 9

10 Q7 Q7 Q7 10 11 11 Q8 Q8 Q8 12 orc orc orc 12 Q9 Q9 Q9 13 13 Q10 Q10 Q10

14 parquet parquet parquet 14

15 Q11 Q11 Q11 15 16 16 17 17 h TShm,SbetBsdPriindRunning Partitioned Subject-Based Queries Schema, VT (h) Running Queries Partitioned Predicate-Based Schema, VT (e) Running Partitioned Horizontally Queries Schema, VT (b) 18 Query Runtimes in s Query Runtimes in s Query Runtimes in s 18 1000 1500 2000 2500 3000 1000 1500 2000 2500 3000 1000 1500 2000 2500 3000 500 500 500

19 0 0 0 19

20 Q1 Q1 Q1 20 avro avro avro

21 Q2 Q2 Q2 21 22 22 Q3 Q3 Q3 23 23 Q4 Q4 Q4 24 csv csv csv 24

25 Q5 Q5 Q5 25

26 Q6 Q6 Q6 26 hive hive hive 27 27 Q7 Q7 Q7 28 28 Q8 Q8 Q8

29 orc orc orc 29

30 Q9 Q9 Q9 30

31 Q10 Q10 Q10 31 parquet parquet parquet

32 Q11 Q11 Q11 32 33 33 34 34

35 Running Partitioned Subject-Based Queries Schema, PT (i) Running Partitioned Queries Predicate-Based Schema, PT (f) Running Partitioned Horizontally Schema, Queries PT (c) 35 Query Runtimes in s Query Runtimes in s Query Runtimes in s 1000 1500 2000 2500 3000 1000 1500 2000 2500 3000 1000 1500 2000 2500 3000 36 500 500 500 36 0 0 0 37 37 Q1 Q1 Q1 38 38 avro avro avro

39 Q2 Q2 Q2 39 40 40 Q3 Q3 Q3

41 csv csv csv 41

42 Q4 Q4 Q4 42 43 43 Q5 Q5 Q5 hive hive hive 44 44

45 Q6 Q6 Q6 45 46 46 orc orc orc Q8 Q8 Q8 47 47

48 Q10 Q10 Q10 48

49 parquet parquet parquet 49 Q11 Q11 Q11 50 50 51 51 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL 11

1 the PT schema, it has much higher runtimes (at some mensions rather than investigating each single query 1 2 cases 2X of runtime). This observation is confirmed result. 2 3 in the 100M, 250M, and 500M datasets, and despite Regarding the relational schema comparison, we 3 4 the partitioning technique of choice. Therefore, we can could observe that the ST schema is mostly the worst 4 5 conclude that PT schema for Query Q2 is not the best performing schema. Indeed, the ST schema is the one 5 6 option to choose. The previous observation is only that requires the maximum number of self-joins (cf 6 7 valid with neglecting the bad performance of the CSV Table 1, ST-SQL column). Moreover, ST schema sin- 7 8 and Avro in the ST schema. However, it gives a clear gle table is the largest table even after partitioning. 8 9 answer of the impact of storage impact on our obser- Whereas, VT is mostly the best performing schema, 9 10 vations. Query Q4 has the highest latency for all the specially when we scale up to higher datasets. The rea- 10 11 relational schema, for our different partitioning tech- son behind this is that VT tables tend to be smaller 11 12 niques, and for the different storage backends. Query than other relational schema tables. Thus, Spark query 12 13 Q8 immediately follows, as the second longest running joins have smaller intermediate results in the shuffle 13 14 query for the different partitioning techniques, and for operations. In addition, VT tends to be more and more 14 15 the different storage backends. Interestingly, Q8 shows efficient with queries with small number of joins (i.e. 15 16 a significant enhancement when using the PT schema. BGB triple patterns). While, the PT schema is yet a 16 17 We can observe that in the 100M dataset Figures 8, and strong competitor to the VT schema, since it is the 17 18 it is even clearer by scaling up to 250M (i.e. figures schema that requires the minimum number of joins 18 19 are kept in the github repository), and 500M dataset, while translating SPARQL into SQL. Indeed, PT in the 19 20 Figures 9. single machine experiments achieved the highest ranks 20 21 Finally, we can also notice that, the ST relational in the majority of query executions [9]. However, scal- 21 22 schema has the worst impact on the majority of the ing up the experiment sizes, PT schema starts to in- 22 23 queries. While, VT schema is mostly the best perform- cur larger intermediate results with higher shuffling 23 24 ing one, directly followed by the PT schema. Regard- costs that degrade its performance. Moreover, parti- 24 25 ing the storage backends, we generally observe that the tioning PT schema over Predicate (PBP) or Horizon- 25 26 columnar file formats of HDFS (ORC, and Parquet) tally (HP) gives a negative effect on the PT schema 26 2 27 are the best performing, followed by Hive. Whereas, performance, especially with SP Bench query set that 27 28 the row-oriented Avro and the CSV textual file for- is highly ’subject’-oriented. We specify this with the 28 29 mat of HDFS are mostly the worst performing back- reasoning about the impact of partitioning in our ex- 29 30 ends. Regarding to the partitioning impact, the SBP ap- periments. 30 31 proach tends to considerably outperform its other op- Regarding the partitioning techniques comparison, 31 32 ponents. Particularly, it directly outperforms HP, leav- in general, the Subject-based partitioning approach 32 33 ing the PBP technique in the worst rank. The partition- tends to considerably outperform its other partitioning 33 34 ing impact observations are shown clearer in the next opponents. Particularly, it directly outperforms the HP 34 35 section ranking figures. approach, leaving the PBP technique in the last/worst 35 36 However, we cannot straightforwardly state that the rank. The reason behind this is that most of the queries 36 2 37 VT schema is outperforming the PT schema. As it has in SP Bench are on shape of ’Star’ or ’Snowflake’ 37 38 been shown in Q8, the PT schema is obviously out- which are mostly oriented to the RDF subject as the 38 39 performing the VT schema. Similarly, we cannot state joining key. Indeed, partitioning by subject allocates 39 40 that Avro file format is always the worst or the sec- the triples with the same subject on the same machine 40 41 ond worst performing storage backend (however it is reducing data shuffling to the minimum, and maximiz- 41 42 in the majority of queries), as Avro has been the best ing the level of parallelism by all workers. Whereas, 42 43 43 performing storage backend several times. this is not satisfied in the Horizontal-based approach, 44 as it randomly splits the tables and only cares about 44 45 6.1.2. Results Diagnostic Analysis distributing them in a balanced way as much as possi- 45 46 Moving to the diagnostic analysis, we try to explain ble regardless grouping of the rows of the same subject 46 47 the previous observations by analyzing the query com- on the same machine. Finally, the predicate-based ap- 47 48 plexity (cf. Table 1) and using our knowledge about proach presents the highest degree of shuffling while 48 49 Spark and the experimental dimensions, i.e. relational joins are run by Spark-SQL in most of the SP2Bench. 49 50 schema, partitioning techniques, storage backends. We Therefore, Predicate-based technique are not recom- 50 51 try to provide diagnostic analysis concerning these di- mended when evaluating ’subject’-oriented queries. 51 12 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL

1 Moreover, Predicate-based partitioning is the most un- as the storage backends comparison in our mentioned 1 2 balanced load partitioning technique with the high- gitHub repository of this project, and its web-page10 . 2 3 2 3 est data skewness [18]. Since our SP Bench RDF 6.2.1. Relational-Schemas Ranking Analysis 4 4 dataset has some predicates with few triples/table en- Figure 10 shows how many times a particular rela- 5 5 tries (i.e. subClassOf ), while others have the most por- tional schema achieves the highest or the lowest rank- 6 6 tions of the RDF graph (i.e. creator,type, and home- ing scores, respectively, considering the results of all 7 7 page). Hence, this unbalanced nature leads to strag- experiments. Specifically, schema ranking scores for 8 8 glers and inefficient join implementations in Spark- the 100M datasets (Figures 10 (a), (c), and (e)), and 9 9 500M triples dataset (Figures 10 (b), (d), and ( f )) 10 SQL. 10 for different storage backends respectively, and for our 11 Last but not least, let us consider the storage back- 11 three partitioning techniques (HP, SBP, and PBP). 12 ends comparison. What we observed is that, colum- 12 For these graphs, we indicate a particular relational 13 nar file formats of HDFS storage backend are outper- 13 schema outperforms others when it has a higher rank- 14 forming the others. In particular, ORC is the best per- 14 ing score, i.e., the higher ranking score, the better per- 15 forming storage format followed by Parquet. While 15 formance. 16 16 Hive directly follows them. Whereas, Avro and CSV For the 100M dataset figures, we observe that the 17 17 file formats of HDFS are the worst-performing back- ST schema is always the worst performing one with 18 18 ends. Interestingly, we can observe that Avro outper- 100% when HP and SBP are chosen as partitioning 19 19 forms the other storage formats in the PBP partition- techniques. However, for the PBP partitioning, the ST 20 20 ing technique. The reason behind these results is that, schema has the second highest scores after the VT 21 21 most of the SP2Bench queries are with a small num- schema with 60%. On the other side, the VT schema 22 22 ber of projections as shown in Table 1. Thus, columnar has the highest ranking scores by 100% in the rank- 23 23 ing scores of the relational schemas. The PT schema 24 file storage backends perform better since they have 24 is falling between the VT schema and the ST schema 25 to scan only a subset of the columns with filtering out 25 by more than 73%. Scaling up to the 250M, and 500M 26 unnecessary columns for the query [23]. Hive is con- 26 triples dataset, our observations are confirmed. 27 sistently following them. On the other side, the textual 27 28 uncompressed CSV and the row-oriented Avro file for- 6.2.2. Partitioning Techniques Ranking Analysis 28 29 mats are shown to have the lowest performing storage Figure 11 shows our different partitioning tech- 29 30 options, respectively. niques (i.e. Horizontal, Subject-based, and Predicate- 30 31 based Partitioning) ranking scores for the 100M (Fig- 31 32 ures 11 (a), (c), and (e)), 500M (Figures 11 (b), (d), 32 6.2. Prescriptive Analysis 33 and ( f )) triples datasets for different relational schema 33 34 ST, VT, and PT respectively. Also for these graphs 34 35 In the following, we attempt to make the perfor- the higher rank score the better it is. Thus, we in- 35 36 mance analysis prescriptive, i.e., we aim at identifying dicate a particular partitioning technique outperforms 36 37 what are the optimal configuration combinations to use other techniques when it achieves higher ranking score 37 38 for SP2Bench, as our RDF benchmarking scenario. amongst them. Notably, when a partitioning technique 38 39 To this extent, we use alternative ranking criteria score column is missing, this indicates that its rank 39 rd 40 (see Section 5, cf. Equation 1). We start by showing score is zero (i.e. it always comes at the last rank [3 40 41 separate ranking analysis results for the experiment di- rank in our case]). 41 42 In the 100M dataset figures, the Subject-based parti- 42 mensions. Then, we discuss the results of combined 43 tioning is the best performing approach. Indeed, it has 43 ranking criteria. We finally discuss which ranking cri- 44 the highest ranking scores with more than 93% of the 44 terion is the most relevant to choose the optimal per- 45 ranking times. Whereas, Predicate-based partitioning 45 46 forming configurations, along side with showing a ta- performs as the worst technique with roughly the same 46 47 ble of the best and the worst configuration combina- ratio (i.e. 93%). On the other hand, the performance 47 48 tions for the queries. of Horizontal partitioning lies somewhere between the 48 49 Notably, we keep all the intermediate ranking result- 49 50 s/tables from which we calculated these final scores of 10https://datasystemsgrouput.github.io/ 50 51 the relational schemas, partitioning techniques as well SPARKSQLRDFBenchmarking/ 51 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL 13

1 1 2 2 3 3

4 1.0 1.0 4 5 5 6 0.8 0.8 6 7 7 8 0.6 0.6 8 9 9 10 0.4 0.4 10 Rank Score Rank Score 11 11 0.2 0.2 12 12 13 13 0.0 0.0 14 AVRO CSV ORC PARQUET HIVE AVRO CSV ORC PARQUET HIVE 14 15 15 VT PT ST VT PT ST 16 16 17 (a) 100M_HP partitioned Relational Schema Ranking Scores (b) 500M_HP partitioned Relational Schema Ranking Scores 17 18 18 1.0 1.0 19 19 20 20 0.8 0.8 21 21 22 22 0.6 0.6 23 23

24 0.4 0.4 24 Rank Score 25 Rank Score 25

26 0.2 0.2 26 27 27

28 0.0 0.0 28 AVRO CSV ORC PARQUET HIVE AVRO CSV ORC PARQUET HIVE 29 29 30 VT PT ST VT PT ST 30 31 31 (c) 100M_SBP partitioned Relational Schema Ranking Scores (d) 500M_SBP partitioned Relational Schema Ranking Scores 32 32 33 1.0 1.0 33 34 34 35 0.8 0.8 35 36 36 37 0.6 0.6 37 38 38 0.4 0.4 39 39 Rank Score Rank Score 40 40 0.2 0.2 41 41 42 42 0.0 0.0 43 AVRO CSV ORC PARQUET HIVE AVRO CSV ORC PARQUET HIVE 43 44 VT PT ST VT PT ST 44 45 45 46 (e) 100M_PBP partitioned Relational Schema Ranking Scores (f) 500M_PBP partitioned Relational Schema Ranking Scores 46 47 47 48 Fig. 10. Relational Schemas Ranking Scores for 100M and 500M Triples datasets (Reading Key: the higher is the better). 48 49 49 50 50 51 51 14 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL

1 1 2 2 3 3

4 1.0 1.0 4 5 5 6 0.8 0.8 6 7 7 8 0.6 0.6 8 9 9 10 0.4 0.4 10 Rank Score Rank Score 11 11 0.2 0.2 12 12 13 13 0.0 0.0 14 AVRO CSV ORC PARQUET HIVE AVRO CSV ORC PARQUET HIVE 14 15 15 S B P H P P B P S B P H P P B P 16 16 17 (a) 100M_ST Schema Partitioning Techniques Ranking Scores (b) 500M_ST Schema Partitioning Techniques Ranking Scores 17 18 18 1.0 1.0 19 19 20 20 0.8 0.8 21 21 22 22 0.6 0.6 23 23

24 0.4 0.4 24

25 Rank Score Rank Score 25

26 0.2 0.2 26 27 27

28 0.0 0.0 28 AVRO CSV ORC PARQUET HIVE AVRO CSV ORC PARQUET HIVE 29 29 30 S B P H P P B P S B P H P P B P 30 31 31 (c) 100M_VT Schema Partitioning Techniques Ranking Scores (d) 500M_VT Schema Partitioning Techniques Ranking Scores 32 32 33 1.0 1.0 33 34 34 35 0.8 0.8 35 36 36 37 0.6 0.6 37 38 38 0.4 0.4 39 39 Rank Score Rank Score 40 40 0.2 0.2 41 41 42 42 0.0 0.0 43 AVRO CSV ORC PARQUET HIVE AVRO CSV ORC PARQUET HIVE 43 44 S B P H P P B P S B P H P P B P 44 45 45 46 (e) 100M_PT Schema Partitioning Techniques Ranking Scores (f) 500M_PT Schema Partitioning Techniques Ranking Scores 46 47 47 48 Fig. 11. Partitioning Techniques Ranking Scores for 100M and 500M Triples datasets (Reading Key: the higher is the better). 48 49 49 50 50 51 51 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL 15

1 other two techniques. However, HP outperforms the quet are sharing the best performing backend rank with 1 2 SBP having a higher rank with the PT schema and for 50% for each of them. Hive directly follows them in 2 3 Avro file format. the third best rank with 100%. HDFS CSV and Avro 3 4 Considering the 500M triples dataset, we still ob- file formats respectively keep performing the worst, 4 5 serve the same pattern of performance. That is, the having the lowest rank scores with 100% of these men- 5 6 Subject-based still outperforms the other techniques tioned cases. However, for the PBP partitioning in 6 7 with more than 73% of the ranking times. The Predicate- 100M with the VT schema, Avro is the best performing 7 8 based partitioning is still performing the worst in the backend, followed by CSV. For the 500M with the VT 8 9 majority of ranking times with more than 46%. How- schema, and for PBP technique, ORC and Parquet re- 9 10 ever, it outperformed the other techniques achieving spectively still in the best ranks, but Avro, interestingly, 10 11 the highest rank by three times with the ST schema outperforms Hive and CSV. 11 12 and for CSV file format, and with the VT schema for Looking at the results of the 100M and 500M triples 12 13 CSV and Avro file formats. datasets with the PT schema and for HP and SBP, we 13 14 can find that, the HDFS ORC and Parquet are still shar- 14 6.2.3. Storage Backends Ranking Analysis 15 ing the best performing backend with 50% for each of 15 Last but not least, we investigate how different stor- 16 them in this ranking group, immediately followed by 16 age backends impact the performance in our experi- 17 Hive as in the third best rank with 100%. CSV and Avro 17 ments. 18 are respectively the worst performing storage formats 18 Figure 12 shows how many times a particular stor- 19 with 100% of this mentioned ranking group. However, 19 age backend achieves the best or the lowest perfor- 20 for the PBP partitioning of both the 100M and 500M 20 mance. The figure presents the results of all experi- 21 datasets, Avro is significantly shown to be the best per- 21 ments, for different relational schema ST, VT, and PT 22 forming backend followed by ORC with 100% of the 22 respectively, and for our three partitioning techniques 23 cases. 23 24 (HP, SB, and PB). Specifically, ranking scores for 24 25 the 100M datasets are on the left, i.e., sub-figures 11 6.2.4. Combined Ranking Analysis 25 26 (a), (c), and (e)); ranking scores for the 500M triples Table 2 shows all our possible different configura- 26 27 dataset are on the right, i.e., sub-figures 11 (b), (d), and tion combinations as shown from Figure 5. For in- 27 28 ( f )). stance the (a.i.1) representing the ST schema, Hori- 28 29 The reading key is still the higher the rank the bet- zontally partitioned, and stored in Avro storage back- 29 30 ter it is. Thus, we indicate a particular storage back- end. From the table, we can see that we have different 30 31 end outperforms others when it achieves higher rank- 45 possible configuration combinations. The next three 31 32 ing score amongst them. Similarly, when a storage for- columns Rf, Rp, and Rs include ranking score values 32 33 mat score column is missing, this indicates its rank which are calculated for storage formats, partitioning 33 th 34 score is zero (i.e. it always comes at the last rank [5 techniques, and relational schemas, respectively. The 34 35 rank in our case]). colored cells indicate the top 3 best-performing con- 35 36 For both the 100M and 500M triples datasets with figurations for each column. 36 37 the ST relational schema and HP and SB partitioning Looking at Table 2, we observe that ranking over 37 38 techniques, we observe that HDFS ORC is the best one of the dimensions and ignoring the others ends 38 39 performing backend with 100%. Hive is immediately up with selecting different configuration. For instance, 39 40 following, and then Parquet by 100% of these rank- ranking over R f , i.e., storage backend/format; Rp, 40 41 ing cases. On the other hand, HDFS CSV and Avro i.e., partitioning technique, or Rs, i.e.,the relational 41 42 file formats are respectively the worst performing on schema, end up selecting different combinations of 42 43 HDFS with 100% of the mentioned cases. The 500M Schema, Partitioning and Storage backends. 43 44 ST schema dataset scores in the PBP has the same pre- Since focusing on one raking at time leads to con- 44 45 vious ranking results for the storage backends. How- tradicting results, as shown in Table 3, we opt for a 45 46 ever, for the 100M ST schema, and in the PBP parti- combined ranking criterion. In Section 5, we presented 46 47 tioning technique, Avro is the second best-performing three i.e., Rta, WAvg and AVG, that are shown in Ta- 47 48 storage format coming after ORC. ble 4. 48 49 While for 100M and 500M triples datasets with the In particular, we use Table 4 to assess each crite- 49 50 VT relational schema and HP and SBP partitioning rion by checking its ranking results (i.e., top results/- 50 51 techniques, we can still observe that ORC and Par- configurations) against the actual queries results with 51 16 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL

1 1 2 2 3 3

4 1.0 1.0 4 5 5

6 0.8 0.8 6 7 7

8 0.6 0.6 8 9 9 10 0.4 0.4 10 Rank Score Rank Score 11 11 12 0.2 0.2 12 13 13 14 0.0 0.0 14 HP SBP PBP HP SBP PBP 15 15 ORC PARQUET HIVE AVRO CSV ORC PARQUET HIVE AVRO CSV 16 16 17 (a) 100M_ST Schema Storage Formats Ranking Scores (b) 500M_ST Schema Storage Formats Ranking Scores 17 18 18 1.0 1.0 19 19 20 20 0.8 0.8 21 21 22 22 0.6 0.6 23 23 24 24 0.4 0.4 Rank Score 25 Rank Score 25 26 26 0.2 0.2 27 27 28 28 0.0 0.0 29 HP SBP PBP HP SBP PBP 29 30 ORC PARQUET HIVE AVRO CSV ORC PARQUET HIVE AVRO CSV 30 31 31 (c) 100M_VT Schema Storage Formats Ranking Scores (d) 500M_VT Schema Storage Formats Ranking Scores 32 32 33 1.0 1.0 33 34 34 35 0.8 0.8 35 36 36 37 0.6 0.6 37 38 38 39 0.4 0.4 39 Rank Score Rank Score 40 40 41 0.2 0.2 41 42 42 43 0.0 0.0 43 HP SBP PBP HP SBP PBP 44 44 ORC PARQUET HIVE AVRO CSV ORC PARQUET HIVE AVRO CSV 45 45 46 (e) 100M_PT Schema Storage Formats Ranking Scores (f) 500M_PT Schema Storage Formats Ranking Scores 46 47 47 48 Fig. 12. Storage Backends Ranking Scores for 100M and 500M Triples datasets (Reading Key: the higher is the better). 48 49 49 50 50 51 51 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL 17

1 1

2 100M Rf Rp Rs Rta WAvg AVG 500M Rf Rp Rs Rta WAvg AVG 2 3 3 a.i.1 0.27 0.45 0.23 0.10 1.13 0.32 a.i.1 0.41 0.68 0.14 0.14 1.50 0.41 4 4 a.i.2 0.00 0.23 0.09 0.01 0.32 0.11 a.i.2 0.00 0.05 0.14 0.00 0.19 0.06 5 5 a.i.3 0.82 0.32 0.27 0.19 1.96 0.47 a.i.3 0.77 0.18 0.23 0.12 1.69 0.39 6 6 a.i.4 0.64 0.55 0.32 0.24 1.94 0.50 a.i.4 0.59 0.41 0.27 0.17 1.66 0.42 7 7 a.i.5 0.77 0.55 0.36 0.30 2.19 0.56 a.i.5 0.73 0.36 0.27 0.19 1.85 0.45 8 8 a.ii.1 0.25 0.86 0.18 0.14 1.46 0.43 a.ii.1 0.30 0.55 0.14 0.09 1.19 0.33 9 9 a.ii.2 0.00 0.68 0.09 0.02 0.77 0.26 a.ii.2 0.00 0.64 0.14 0.03 0.78 0.26 10 10 a.ii.3 0.93 0.95 0.36 0.52 2.86 0.75 a.ii.3 1.00 0.73 0.36 0.45 2.76 0.70 11 11 a.ii.4 0.66 0.95 0.27 0.35 2.32 0.63 a.ii.4 0.59 0.59 0.18 0.19 1.75 0.45 12 12 a.ii.5 0.68 0.95 0.36 0.41 2.44 0.66 a.ii.5 0.61 0.64 0.32 0.26 1.98 0.52 13 13 a.iii.1 0.59 0.18 0.32 0.12 1.48 0.36 a.iii.1 0.32 0.27 0.23 0.07 1.03 0.27 14 14 a.iii.2 0.05 0.59 0.18 0.05 0.85 0.27 a.iii.2 0.00 0.82 0.18 0.05 1.00 0.33 15 15 a.iii.3 0.98 0.23 0.55 0.30 2.41 0.59 a.iii.3 0.95 0.59 0.45 0.42 2.62 0.66 16 16 a.iii.4 0.45 0.00 0.41 0.06 1.16 0.29 a.iii.4 0.59 0.50 0.41 0.25 1.89 0.50 17 17 a.iii.5 0.43 0.00 0.45 0.06 1.17 0.29 a.iii.5 0.64 0.50 0.55 0.32 2.12 0.56 18 18 b.i.1 0.30 0.36 0.73 0.20 1.59 0.46 b.i.1 0.27 0.18 0.86 0.15 1.49 0.44 19 19 b.i.2 0.23 0.55 0.77 0.24 1.70 0.52 b.i.2 0.07 0.36 0.82 0.13 1.30 0.42 20 20 b.i.3 0.77 0.59 0.68 0.46 2.55 0.68 b.i.3 0.80 0.32 0.73 0.36 2.38 0.62 21 21 b.i.4 0.70 0.59 0.68 0.43 2.44 0.66 b.i.4 0.82 0.18 0.82 0.32 2.37 0.61 22 22 b.i.5 0.52 0.55 0.64 0.32 2.06 0.57 b.i.5 0.55 0.55 0.86 0.42 2.33 0.65 23 23 b.ii.1 0.39 0.86 0.73 0.42 2.24 0.66 b.ii.1 0.32 0.59 0.82 0.31 1.94 0.58 24 24 b.ii.2 0.20 0.59 0.77 0.24 1.69 0.52 b.ii.2 0.07 0.55 0.82 0.18 1.49 0.48 25 25 b.ii.3 0.57 0.73 0.68 0.43 2.36 0.66 b.ii.3 0.80 0.59 0.73 0.50 2.65 0.71 26 26 b.ii.4 0.75 0.77 0.73 0.56 2.75 0.75 b.ii.4 0.75 0.77 0.77 0.58 2.79 0.76 27 27 b.ii.5 0.61 0.91 0.68 0.53 2.61 0.73 b.ii.5 0.57 0.91 0.77 0.55 2.63 0.75 28 28 b.iii.1 0.73 0.27 0.86 0.35 2.35 0.62 b.iii.1 0.64 0.73 0.95 0.59 2.75 0.77 29 29 b.iii.2 0.57 0.41 0.91 0.38 2.27 0.63 b.iii.2 0.09 0.59 0.95 0.23 1.69 0.54 30 30 b.iii.3 0.39 0.18 0.77 0.17 1.60 0.45 b.iii.3 0.80 0.59 0.91 0.58 2.83 0.77 31 31 b.iii.4 0.57 0.14 0.86 0.23 1.95 0.52 b.iii.4 0.75 0.55 0.95 0.55 2.75 0.75 32 32 b.iii.5 0.25 0.05 0.82 0.09 1.29 0.37 b.iii.5 0.23 0.05 0.82 0.08 1.25 0.37 33 33 c.i.1 0.42 0.78 0.55 0.33 2.02 0.58 c.i.1 0.39 0.56 0.50 0.23 1.70 0.48 34 34 c.i.2 0.14 0.50 0.64 0.16 1.37 0.43 c.i.2 0.19 0.50 0.55 0.16 1.37 0.41 35 35 c.i.3 0.67 0.56 0.55 0.35 2.22 0.59 c.i.3 0.78 0.50 0.55 0.36 2.35 0.61 36 36 c.i.4 0.83 0.56 0.50 0.39 2.44 0.63 c.i.4 0.67 0.39 0.41 0.23 1.91 0.49 37 37 c.i.5 0.44 0.56 0.50 0.25 1.80 0.50 c.i.5 0.47 0.44 0.36 0.18 1.59 0.43 38 38 c.ii.1 0.28 0.61 0.59 0.23 1.66 0.49 c.ii.1 0.31 0.89 0.55 0.31 1.95 0.58 39 39 c.ii.2 0.14 0.94 0.64 0.27 1.82 0.57 c.ii.2 0.14 0.89 0.55 0.23 1.67 0.53 40 40 c.ii.3 0.86 0.83 0.45 0.49 2.72 0.71 c.ii.3 0.78 0.94 0.41 0.48 2.65 0.71 41 41 c.ii.4 0.75 0.83 0.50 0.47 2.58 0.69 c.ii.4 0.81 0.94 0.55 0.57 2.84 0.77 42 42 c.ii.5 0.47 0.83 0.45 0.33 2.07 0.59 c.ii.5 0.47 0.89 0.41 0.33 2.09 0.59 43 43 c.iii.1 0.86 0.11 0.32 0.14 1.87 0.43 c.iii.1 0.75 0.06 0.32 0.10 1.63 0.38 44 44 c.iii.2 0.50 0.06 0.41 0.09 1.30 0.32 c.iii.2 0.14 0.11 0.36 0.04 0.70 0.20 45 45 c.iii.3 0.56 0.11 0.18 0.06 1.22 0.28 c.iii.3 0.72 0.06 0.14 0.05 1.40 0.31 46 46 c.iii.4 0.28 0.11 0.23 0.04 0.80 0.21 c.iii.4 0.44 0.17 0.14 0.05 1.05 0.25 47 47 c.iii.5 0.31 0.11 0.23 0.04 0.85 0.22 c.iii.5 0.44 0.17 0.14 0.05 1.05 0.25 48 Table 2 48 49 Configuration Ranking Criteria for 100M, and 500M triples datasets 49 50 50 51 51 18 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL

1 100M 500M 100M, results are as follows, 30%, 33%, 39%, 64%, 1 2 1st 2nd 3rd 1st 2nd 3rd 55% and 64% for Rf, Rp, Rs, AVG, WAvg and Rta 2 3 Rf a.iii.3 a.ii.3 c.ii.3 a.ii.3 a.iii.3 b.i.4 ranking criteria, accordingly. While for 500M, the re- 3 4 Rp a.ii.3 a.ii.4 a.ii.3 c.ii.3 c.ii.4 b.ii.5 sults of applying formula 4 are 48%, 55%, 55%, 64%, 4 5 Rs b.iii.2 b.iii.1 b.iii.4 b.iii.1 b.iii.2 b.iii.4 67%, 76% for Rf, Rp, Rs, AVG, WAvg and Rta, ac- 5 6 Table 3 cordingly. 6 7 Non-overlapping top 3 best-performing configuration combinations From the 100M and 500M results, it appears that 7 8 for the Rf, Rp, and Rs ranking criteria the triangle area (Rta) criterion is the most accurate 8 9 for measuring the performance of configurations in 9 10 these configurations. To achieve that, we pick up the our SP2Bench query workloads scenario. The average 10 11 top three configurations (coloured in Table 2) with (AVG) and the weighted average (WAvg) follow right 11 12 the highest rank score of the different ranking criteria after. Notably, we used the 100M triples dataset ta- 12 13 columns (i.e. R f , Rp, Rs, Rta, Avg, and WAvg). Af- ble to describe how ranking calculations are done. We 13 14 terwards, we list the ranking of these configurations omit tables of the other datasets for sake of concise- 14 15 for each query (Q1,...Q11) by calculating the perfor- ness. However, data and analysis can be found in the 15 16 mance rank position of this configuration in this query. repository web page. 16 17 For example, the configuration combination (a.iii.3) Finally, Table 5 summarizes the experiments analy- 17 18 has the 29th performance rank position for Q1, 20th for sis, highlighting the best and worst combinations 18 19 Q2,..etc. of schema, partitioning techniques, and storage back- 19 20 The coloured cells in Table 4 indicate the top 15 out ends for each query in the SP2Bench workload. 20 21 of 45 ranking positions for the selected top configura- 21 22 tions for the queries. In addition, we have calculated 7. Related Work 22 23 the average of the number of times a specific config- 23 24 uration combination to be in the top 15 ranking posi- Several related experimental evaluation and com- 24 25 tions for each criteria. For instance, the R f ranking cri- parisons of the relational-based evaluation of SPARQL 25 26 teria with the first top selected configuration (a.iii.3) queries over RDF databases have been presented in 26 27 occurred only once to be 11th ranking position (i.e. in the literature [8, 14]. For example, Schmidt et.al. [14] 27 28 the first 15 actual query ranking positions) only for one performed an experimental comparison between ex- 28 2 29 query (Q11). While, the second selected configuration isting RDF storage approaches using the SP Bench 29 30 (a.ii.3) achieves that goal (to be in the first 15 ranks) by performance suite, and the pure relational models of 30 31 4 times for queries Q2, Q4, Q9, and Q10 respectively. RDF data implementations namely, Single Triples re- 31 32 While, the (c.ii.3) configuration achieves that goal by lation, Flattened Tables of clustered properties rela- 32 33 5 times for queries Q3, Q4, Q5, Q6, Q8. Therefore, the tion, and Vertical partitioning Relations. In particular, 33 34 average of the R f criterion to be relevant for choosing they compared the native RDF scenario using Seasme 34 11 35 the best configuration is 3 (i.e we neglect the fractions) SPARQL engine (known currently as RDF4j ) that is 35 2 36 relied on a native RDF store using SP Bench dataset, 36 as shown in the table. The rest of ranking criteria, and 2 37 combined ranking techniques are calculated in a simi- with a pure translation of the same SP Bench sce- 37 38 lar way. nario into pure relational database technologies. An- 38 39 The next step is to calculate the accuracy of each other experimental comparison of the single triples ta- 39 40 criteria using the formula in equation 4 in order to see ble and vertically partitioned relational schemes was 40 41 which ranking criteria should be used to determine the conducted by Alexaki et. al. [24] in which the addi- 41 42 best configuration combination to choose. tional costs of predicate table unions in the vertical 42 43 3 partitioned tables scenario are clearly shown. This ex- 43 X N(i) 44 Acc(cr) = , cri ∈ {R f, Rp, Rs, ...} (4) periment was also similar to the ones performed by 44 33 45 i=1 Abadi et.al. [15], followed by Sidirourgos et.al. [25] 45 12 46 who used the Barton library catalog data scenario 46 47 In the formula 4, i indicates certain configuration to evaluate a similar comparison between the Single 47 48 and N(i) is the number of times a certain configuration Triples schema and the Vertical schema. On another 48 49 occurred to be in a query ranking position less than 49 50 the 15 rank, for each ranking criteria (cr). Applying 11https://rdf4j.eclipse.org/ 50 51 this formula for each criteria, we observed that, for the 12http://simile.mit.edu/rdf-test-data/barton 51 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL 19

1 Rf Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 ranking<15 AVG 1 2 a.iii.3 29 20 32 26 27 30 18 41 20 22 11 1 2 3 a.ii.3 26 14 18 1 17 25 17 17 5 9 16 4 3 3 4 c.ii.3 15 15 1 9 5 1 31 3 31 27 26 5 4 5 5 6 Rp Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 ranking<15 AVG 6 7 a.ii.3 26 14 18 1 17 25 17 17 5 9 16 4 7 8 a.ii.5 27 24 26 2 13 27 16 24 7 14 19 4 4 8 9 a.ii.4 23 23 24 4 24 26 19 23 8 13 21 3 9 10 10 11 Rs Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 ranking<15 AVG 11 12 b.iii.2 5 22 29 32 18 19 11 19 25 17 15 2 12 13 b.iii.1 18 17 17 41 9 18 12 30 14 12 9 5 4 13 14 b.iii.4 12 6 23 44 23 22 14 36 4 7 10 6 14 15 15 16 AVG Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 ranking<15 AVG 16 17 b.ii.4 9 10 12 27 2 7 6 18 1 8 2 9 17 18 a.ii.3 26 14 18 1 17 25 17 17 5 9 16 4 7 18 19 b.ii.5 16 2 10 28 14 14 7 16 2 6 4 8 19 20 20 21 21 WAvg Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 ranking<15 AVG 22 22 a.ii.3 26 14 18 1 17 25 17 17 5 9 16 4 23 23 b.ii.4 9 10 12 27 2 7 6 18 1 8 2 9 6 24 24 c.ii.3 15 15 1 9 5 1 31 3 31 27 26 5 25 25 26 26 Rta Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 ranking<15 AVG 27 27 b.ii.4 9 10 12 27 2 7 6 18 1 8 2 9 28 28 b.ii.5 16 2 10 28 14 14 7 16 2 6 4 8 7 29 29 a.ii.3 26 14 18 1 17 25 17 17 5 9 16 4 30 30 Table 4 31 31 100M Triples Dataset Ranking Criteria Comparison 32 32 33 33 100M 500M grograph13, BigOWLIM14) using different RDF bench- 34 34 BEST WORST BEST WORST 15 35 marks (e.g., LUBM ) and RDBMS benchmarks (e.g., 35 Q1 c.i.2 c.iii.4 c.ii.2 a.i.2 36 The Transaction Processing Performing Council fam- 36 Q2 b.ii.3 a.i.2 b.iii.4 a.i.2 16 37 ily (TPC-C) benchmark) . This work is focused on a 37 Q3 c.ii.3 a.i.2 c.ii.3 a.i.2 38 pure detailed RDF stores comparison using SPARQL 38 Q4 a.ii.3 b.iii.3 a.ii.3 a.i.2 39 beyond any relational schemes implementations or 39 40 Q5 c.i.5 a.i.2 b.iii.3 a.i.2 comparisons. To the best of our knowledge, our bench- 40 41 Q6 c.ii.3 a.iii.2 c.ii.3 a.i.2 marking study [9] (that we are extending here in 41 42 Q7 b.ii.1 a.i.2 b.ii.4 a.i.2 this paper but in a distributed deployment), was the 42 43 Q8 c.iii.4 a.iii.5 c.iii.4 a.i.2 first that considers evaluating and comparing various 43 44 Q9 b.ii.4 a.i.2 b.iii.4 a.i.2 relational-based schemes for processing RDF queries 44 45 Q10 b.iii.3 c.iii.4 b.iii.3 c.iii.2 on top of the big data processing framework, Spark, 45 46 Q11 b.i.3, b.ii.4 c.iii.4 b.i.3, b.ii.5 c.iii.2 and evaluating different backend storage techniques. 46 47 Table 5 47 48 Best and Worst Configurations for Running SP2Bench Queries 48 13https://franz.com/agraph/allegrograph3.3/ 49 49 14http://www.proxml.be/products/bigowlim.html 50 side, Owens et.al [26] performed benchmarking exper- 15http://swat.cse.lehigh.edu/projects/lubm/ 50 51 iments for comparing different RDF stores (eg. Alle- 16http://www.tpc.org/tpcc/ 51 20 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL

1 A parallel and recent similar research to this work References 1 2 was conducted by Victor Anthony et al.[27]. Authors 2 3 there performed similar experiments to evaluate also [1] I. Abdelaziz, R. Harbi, S. Salihoglu, P. Kalnis and 3 4 the performance of Spark-SQL engine querying four N. Mamoulis, SPARTex: A Vertex-Centric Framework 4 for RDF Data Analytics, PVLDB 8(12) (2015), 1880– 5 relational-based RDF schemes, namely a single Triples 5 1883. doi:10.14778/2824032.2824091. http://www.vldb.org/ 6 6 Table (TT), Vertical Partitioned Table (VP), Domain- pvldb/vol8/p1880-abdelaziz.pdf. 7 dependant schema (DDS), and (differently from our [2] M. Hallo, S. Luján-Mora, A. Maté and J. Trujillo, Current state 7 8 paper) the Wide Property Tables schema (WPT). They of Linked Data in digital libraries, Journal of Information Sci- 8 9 have shown that the WPT relational schema is superior ence 42(2) (2016). 9 10 to the other relational opponent approaches. They have [3] M. Zaharia, R.S. Xin, P. Wendell, T. Das, M. Armbrust, 10 11 only evaluated the Hive with Parquet storage back- A. Dave, X. Meng, J. Rosen, S. Venkataraman, M.J. Franklin, 11 A. Ghodsi, J. Gonzalez, S. Shenker and I. Stoica, Apache 12 12 end, while as we have mentioned in this paper, we Spark: a unified engine for big data processing, Commun. ACM 13 are evaluating other storage alternative backends of 59(11) (2016), 56–65. doi:10.1145/2934664. 13 14 Spark. Moreover, partitioning in that paper is made by [4] G. Agathangelos, G. Troullinou, H. Kondylakis, K. Stefani- 14 15 Spark, partitioning on the Subject only, while in this dis and D. Plexousakis, RDF Query Answering Using Apache 15 16 paper, we evaluate three different partitioning strate- Spark: Review and Assessment, in: 34th IEEE International 16 17 gies logically and physically (i.e. done also by Spark), Conference on Data Engineering Workshops, ICDE Work- 17 shops 2018, Paris, France, April 16-20, 2018, 2018, pp. 54–59. 18 18 namely, Subject-based, Predicate-based, and, Horizon- doi:10.1109/ICDEW.2018.00016. 19 tal partitioning. In addition, this work evaluates only [5] M. Wylot and S. Sakr, Framework-Based Scale-Out RDF 19 20 the micro-benchmarking level of Spark-SQL system. Systems, in: Encyclopedia of Big Data Technologies., 2019. 20 21 doi:10.1007/978-3-319-63962-8_225-1. 21 22 8. Conclusion [6] S. Sakr, GraphREL: A Decomposition-Based and Selectivity- 22 23 Aware Relational Framework for Processing Sub-graph 23 Queries, in: DASFAA, 2009. 24 Apache Spark is a prominent big data framework 24 [7] I. Abdelaziz, R. Harbi, Z. Khayyat and P. Kalnis, A survey and 25 that offers a high-level SQL interface, Spark-SQL, op- experimental comparison of distributed SPARQL engines for 25 26 timized by means of the Catalyst query optimizer. In very large RDF data, Proceedings of the VLDB Endowment 26 27 this paper, we perform a systematic evaluation for the 10(13) (2017), 2049–2060. 27 28 performance of the Spark-SQL query engine for an- [8] S. Sakr and G. Al-Naymat, Relational processing of RDF 28 29 swering SPARQL queries over different relational en- queries: a survey, ACM SIGMOD Record 38(4) (2010), 23–28. 29 [9] M. Ragab, R. Tommasini and S. Sakr, Benchmarking Spark- 30 coding for RDF datasets on a distributed setup. In 30 SQL under Alliterative RDF Relational Storage Backends 31 31 particular, we studied the performance of Spark-SQL (2019). 32 using two different storage backends, namely, Hive [10] A. Akhter, A.-C.N. Ngonga and M. Saleem, An empirical eval- 32 33 and HDFS. For HDFS we compared four different uation of RDF graph partitioning techniques, in: European 33 34 data formats, namely, CSV, ORC, Avro, and Parquet. Knowledge Acquisition Workshop, Springer, 2018, pp. 3–18. 34 2 35 We used SP Bench to generate our experimental RDF [11] J. Camacho-Rodríguez, A. Chauhan, A. Gates, E. Koifman, 35 O. O’Malley, V. Garg, Z. Haindrich, S. Shelukhin, P. Jay- 36 datasets. We translated the benchmark queries into 36 achandran, S. Seth, D. Jaiswal, S. Bouguerra, N. Bangarwa, 37 37 SQL, storing the RDF data using Spark’s DataFrame S. Hariappan, A. Agarwal, J. Dere, D. Dai, T. Nair, N. Dem- 38 abstraction. To this extent, we evaluated three differ- bla, G. Vijayaraghavan and G. Hagleitner, Apache Hive: From 38 39 ent approaches for RDF relational storage, i.e., Sin- MapReduce to Enterprise-grade Big Data Warehousing, in: 39 40 gle Triples Table schema, Vertically Partitioned Ta- Proceedings of the 2019 International Conference on Man- 40 41 bles schema, and Property Tables schema. We show agement of Data, SIGMOD Conference 2019, Amsterdam, The 41 Netherlands, June 30 - July 5, 2019., 2019, pp. 1773–1786. 42 also the impact of partitioning the mentioned rela- 42 doi:10.1145/3299869.3314045. 43 43 tional schemas with three different partitioning tech- [12] M. Armbrust, R.S. Xin, C. Lian, Y. Huai, D. Liu, J.K. Bradley, 44 niques, namely Horizontal-based, Subject-based, and X. Meng, T. Kaftan, M.J. Franklin, A. Ghodsi and M. Za- 44 45 Predicate-based partitioning, which applied on our five haria, Spark SQL: Relational Data Processing in Spark, 45 46 mentioned storage formats. in: Proceedings of the 2015 ACM SIGMOD International 46 47 As a future extension of this work, we aim to con- Conference on Management of Data, Melbourne, Victoria, 47 Australia, May 31 - June 4, 2015, 2015, pp. 1383–1394. 48 duct our benchmarking study with other benchmarking 48 doi:10.1145/2723372.2742797. 49 49 such as benchmarks of WatDiv,and the state-of the-art [13] K. Alaoui, A categorization of RDF triplestores, in: Proceed- 50 LDBC with different types of query shapes and com- ings of the 4th International Conference on Smart City Appli- 50 51 plexities. cations, 2019, pp. 1–7. 51 Mohamed Ragab et al. / Systematic Performance Analysis of Distributed SPARQL Query Answering Using Spark-SQL 21

1 [14] M. Schmidt, T. Hornung, N. Küchlin, G. Lausen and C. Pinkel, and Parquet, Concurrency and Computation: Practice and Ex- 1 2 An Experimental Comparison of RDF Data Management Ap- perience (2019), e5523. 2 3 proaches in a SPARQL Benchmark Scenario, in: The Seman- [24] S. Alexaki, V. Christophides, G. Karvounarakis and D. Plex- 3 tic Web - ISWC 2008, 7th International Semantic Web Confer- ousakis, On Storing Voluminous RDF Descriptions: The Case 4 4 ence, ISWC 2008, Karlsruhe, Germany, October 26-30, 2008. of Web Portal Catalogs, in: Proceedings of the Fourth Inter- 5 Proceedings, 2008, pp. 82–97. doi:10.1007/978-3-540-88564- national Workshop on the Web and Databases, WebDB 2001, 5 6 1_6. Santa Barbara, California, USA, May 24-25, 2001, in conjunc- 6 7 [15] D.J. Abadi, A. Marcus, S.R. Madden and K. Hollenbach, Scal- tion with ACM PODS/SIGMOD 2001. Informal proceedings, 7 8 able semantic web data management using vertical partition- 2001, pp. 43–48. 8 ing, in: VLDB, 2007. [25] L. Sidirourgos, R. Goncalves, M.L. Kersten, N. Nes and 9 9 [16] E.I. Chong, S. Das, G. Eadon and J. Srinivasan, An efficient S. Manegold, Column-store support for RDF data manage- 10 SQL-based RDF querying scheme, in: VLDB, 2005. ment: not all swans are white, PVLDB 1(2) (2008), 1553–1563. 10 11 [17] A. Schätzle, M. Przyjaciel-Zablocki, A. Neu and G. Lausen, doi:10.14778/1454159.1454227. http://www.vldb.org/pvldb/1/ 11 12 Sempala: Interactive SPARQL query processing on hadoop, 1454227.pdf. 12 13 in: International Semantic Web Conference, Springer, 2014, [26] A. Owens, N. Gibbins et al., Effective benchmarking for RDF 13 pp. 164–179. stores using synthetic data (2008). 14 14 [18] A. Schätzle, M. Przyjaciel-Zablocki, S. Skilevic and [27] V.A. Arrascue Ayala, P. Koleva, A. Alzogbi, M. Cossu, M. Fär- 15 G. Lausen, S2RDF: RDF querying with SPARQL on spark, ber, P. Philipp, G. Schievelbein, I. Taxidou and G. Lausen, Re- 15 16 Proceedings of the VLDB Endowment 9(10) (2016), 804–815. lational schemata for distributed SPARQL query processing, 16 17 [19] G. Karypis and V. Kumar, A fast and high quality multilevel in: Proceedings of the International Workshop on Semantic Big 17 18 scheme for partitioning irregular graphs, SIAM Journal on sci- Data, 2019, pp. 1–6. 18 entific Computing 20(1) (1998), 359–392. [28] M. Wylot, M. Hauswirth, P. Cudré-Mauroux and S. Sakr, RDF 19 19 [20] J. Gray, Database and Transaction Processing Performance data storage and query processing schemes: A survey, ACM 20 Handbook., 1993. Computing Surveys (CSUR) 51(4) (2018), 84. 20 21 [21] M. Schmidt, T. Hornung, G. Lausen and C. Pinkel, SPˆ2Bench: [29] T. Neumann and G. Weikum, RDF-3X: a RISC-style engine 21 22 A SPARQL Performance Benchmark, in: Proceedings of the for RDF, Proceedings of the VLDB Endowment 1(1) (2008), 22 23 25th International Conference on Data Engineering, ICDE 647–659. 23 2009, March 29 2009 - April 2 2009, Shanghai, China, 2009, [30] C. Weiss, P. Karras and A. Bernstein, Hexastore: sextuple 24 24 pp. 222–233. doi:10.1109/ICDE.2009.28. indexing for semantic web data management, PVLDB 1(1) 25 [22] M. Saleem, G. Szárnyas, F. Conrads, S.A.C. Bukhari, (2008). 25 26 Q. Mehmood and A.-C. Ngonga Ngomo, How Representative [31] J.J. Levandoski and M.F. Mokbel, RDF data-centric storage, in: 26 27 Is a SPARQL Benchmark? An Analysis of RDF Triplestore 2009 IEEE International Conference on Web Services, IEEE, 27 28 Benchmarks?, in: The World Wide Web Conference, ACM, 2009, pp. 911–918. 28 2019, pp. 1623–1633. [32] A. Akhter, A.N. Ngomo and M. Saleem, An Empirical Evalu- 29 29 [23] T. Ivanov and M. Pergolesi, The impact of columnar file for- ation of RDF Graph Partitioning Techniques, in: EKAW, 2018. 30 mats on SQL-on-hadoop engine performance: A study on ORC doi:10.1007/978-3-030-03667-6_1. 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51