Quick viewing(Text Mode)

A Comparison of Different Graph Database Types

A Comparison of Different Graph Database Types

A comparison of different graph types

Jieru Yao

August 22, 2018

MSc in High Performance with Data Science The University of Edinburgh Year of Presentation: 2018

Abstract In an era of information and , there is a great amount of data produced and changed every day by different enterprises and individuals. Database Management System (DBMS) as an effective and efficient way for data storage, data management, data maintaining and data security, has been highly popular in various area like Business, Industry and Education. System as the most familiar one has been known by a great many people. Relational Database uses ‘Tables’ as basic storage units. The data with various types and categories in our real lives could be abstracted into different ‘Tables’ as entities, while the relationships are represented by the correlation between two entities, which means a correlation table will be created when there is a relationship with two entities. Then, there will be a negative effect that a great deal of correlation tables will be produced if the relationships between entities is more than one. In other words, if the relationships among entities are complex, it is difficult for designers to model the data using a Relational Database. Therefore, an alternative type of database based on Graph structure need to be introduced to improve performance when above problems happen, which is . Graph Database uses nodes and edges representing data with complex relationships. This basic concepts help DBMS arrange and simplify the sophisticated relationships of mass data sets, which contributes to improve the performance of database. More detailed information will be demonstrated and discussed in later chapters. The dissertation mainly focuses on the performance of two different types (i.e. RDF and LPG types) of Graph . Their different storage data types will be tested and analysed according to different size of dataset running on Windows 10 System and Ubuntu 16.04 System.

Contents

Chapter 1 Introduction ...... 1

1.1 The importance of Graph Database ...... 1

1.2 Objectives ...... 4

Chapter 2 Literature Review ...... 5

2.1 Graph Databases ...... 5

2.2 RDF Graph Databases ...... 6

2.2.1 RDF graph ...... 6

2.2.2 SPARQL ...... 8

2.2.3 OpenLink Virtuoso ...... 8

2.3 Labeled Property Graph Database ...... 9

2.3.1 Labeled Property Graph ...... 9

2.3.2 ...... 12

2.4 Open source data sets available in RDF ...... 12

2.4.1 DBpedia ...... 12

2.5 Introduction of two formats of RDF data sets ...... 13

2.5.1 ...... 13

2.5.2 N-triples ...... 13

Chapter 3 Research Methodology ...... 14

3.1 Data sets preparation from DBpedia ...... 14

3.2 Loading RDF data sets into OpenLink Virtuoso database ...... 15

3.2.1 Loading RDF data sets into OpenLink Virtuoso database on Windows 10 system ...... 15

3.2.2 Loading RDF data sets into OpenLink Virtuoso database on Ubuntu 16.04 system ...... 18

3.3 Loading RDF data sets into Neo4 ...... 19

3.3.1 Loading RDF data sets into Neo4j database on Windows 10 system ...... 19

i 3.3.2 Loading RDF data sets into Neo4j database on Ubuntu 16.04 system...... 20

3.4 Measuring loading times on Windows 10 system ...... 20

3.4.1 Measuring loading times of Virtuoso database on Windows 10 system ...... 21

3.4.2 Measuring loading times of Neo4j database on Windows 10 system ...... 23

3.5 Measuring loading times on Ubuntu 16.04 system ...... 24

3.5.1 Measuring loading times of Virtuoso database on Ubuntu 16.04 system .....25

3.5.2 Measuring loading times of Neo4j database on Ubuntu 16.04 system ...... 26

3.6 Measuring query times on Windows 10 system ...... 26

3.6.1 Measuring query times of Virtuoso database on Windows 10 system ...... 26

3.6.2 Measuring query times of Neo4j database on Windows 10 system ...... 27

3.7 Measuring query times on Ubuntu 16.04 system ...... 27

3.7.1 Measuring query times of Virtuoso database on Ubuntu 16.04 system ...... 27

3.7.2 Measuring query times of Neo4j database on Ubuntu 16.04 system ...... 27

Chapter 4 Experimental Work Carried ...... 28

4.1 Hardware and Software configurations of test systems ...... 28

4.1.1 Windows 10 system ...... 28

4.1.2 Ubuntu 16.04 system ...... 28

4.2 Virtuoso Installation ...... 29

4.2.1 Virtuoso Installation on Windows system ...... 29

4.2.2 Virtuoso Installation on Ubuntu 16.04 ...... 31

4.3 Neo4j Installation ...... 32

4.3.1 Neo4j installation on Windows 10 system ...... 32

4.3.2 Neo4j installation on Ubuntu 16.04 system ...... 33

4.4 Transformation process from RDF graph to LPG ...... 33

4.4.1 How to use the transformation plugin in Neo4j on Windows 10 system ...... 39

4.4.2 How to use the transformation plugin in Neo4j on Ubuntu 16.04 system ....41

Chapter 5 Results and Analysis ...... 43

5.1 Loading time of two database systems on Windows 10 system ...... 43

ii 5.1.1 Loading time of Virtuoso on Windows 10 system...... 43

5.1.2 Loading time of Neo4j on Windows 10 system ...... 44

5.1.3 Comparison of loading time of both database systems on Windows system 45

5.2 Loading time of two database systems on Ubuntu system ...... 46

5.2.1 Loading time of Virtuoso on Ubuntu system ...... 46

5.2.2 Loading time of Neo4j on Ubuntu system ...... 47

5.2.3 Comparison of loading time of both database systems on Ubuntu system ...49

5.3 Comparison of loading time on Windows system and Ubuntu system for Virtuoso ...... 50

5.4 Comparison of loading time on Windows system and Ubuntu system for Neo4j ...... 51

5.5 Query time of Virtuoso on both systems ...... 52

5.5.1 Query time of Virtuoso on Windows system ...... 52

5.5.2 Query time of Virtuoso on Ubuntu system...... 53

5.5.3 Comparison of query time on Windows system and Ubuntu system for Virtuoso ...... 54

5.6 Query time of Neo4j on both systems ...... 54

5.6.1 Query time of Neo4j on Windows system ...... 54

5.6.2 Query time of Neo4j on Ubuntu system ...... 55

5.6.3 Comparison of query time on Windows system and Ubuntu system for Neo4j ...... 56

Chapter 6 Conclusion and Further Work ...... 57

Appendix A Database Installation Package………………………………………...59

A.1 OpenLink Virtuoso on Windows 10 system…………...……………...59

A.2 Neo4j on Windows 10 system…………...……………………….…...59

Appendix B Original data sets from DBpedia……………………………………...60

B.1 Original data sets from DBpedia…………………………………...…60

Appendix Detailed measurements results………………………………………..61

C.1 Loading time measured in Virtuoso on Windows 10 system………….61

C.2 Loading time measured in Neo4j on Windows 10 system……………61 iii C.3 Loading time measured in Virtuoso on Ubuntu 16.04 system………...61

C.4 Loading time measured in Neo4j on Ubuntu 16.04 system……………62

C.5 Query time measured in Virtuoso on Windows 10 system……………62

C.6 Query time measured in Neo4j on Windows 10 system………………62

C.7 Query time measured in Virtuoso on Ubuntu 16.04 system…………..62

C.8 Query time measured in Neo4j on Ubuntu 16.04 system……………..63

C.9 Query results in Neo4j………………………………………………...63

References ...... 64

iv List of Tables

Table 1: Measuring function for loading time in Virtuoso database………………..…23

Table 2: Statements executed for measuring loading time in each measurement on Windows 10 system………………………………………………………………...…24

Table 3: Statements executed for measuring loading time in each measurement on Ubuntu 16.04 system…………………………………………………….……………26

Table 4: Representation of RDF triples based on XML type………………….……….38

v List of Figures

Figure 1: ‘Film’ instance represented in a Relational Database……………………….2

Figure 2: Film instance modelling in a Relational Database………………………...... 3

Figure 3: Film instance modelling in a Graph Database……………………………….4

Figure 4: Relational Database based on ‘Film’ instance…………………………….....5

Figure 5: Graph Database based on ‘Film’ instance…………………………………...6

Figure 6: An instance of RDF graph……………………………………………...... 8

Figure 7: An instance of RDF graph………………………………………………….10

Figure 8: An instance of LPG…………………………………………………………10

Figure 9: An instance in LPG…………………………………………………………11

Figure 10: An segment of Turtle syntax………………………………………………13

Figure 11: A segment of N-triples format……………………………………………...13

Figure 12: A segment of the original RDF data set written in N-Triples type…………16

Figure 13: ‘Quad Store Upload’ function in Virtuoso Conductor…………………..…17

Figure 14: SPARQL Execution in Virtuoso Conductor……………………………….17

Figure 15: ‘DirsAllowed’ parameter in Virtuoso configuration file on Windows 10 system…………………………………………………………………………………19

Figure 16: ‘DirsAllowed’ parameter in Virtuoso configuration file on Ubuntu 16.04 system[28]……………………………………………….…………………………....19

Figure 17: An example of loading function statement in Neo4j browser on Windows 10 system…………………………………………………………………………………20

Figure 18: Loading status and results after executing the statement in Figure 17……..21

vi Figure 19: ‘DB.DBA.LOAD_LIST’ table after ld_dir() function……………………..22

Figure 20: Results of loading time of RDF data set ‘rdfdata1.ttl’……………………..23

Figure 21: Loading time result for loading RDF data set ‘rdfdata1.ttl’…………..……25

Figure 22: Query statement for counting the number of distinct ‘Subject’ in RDF triples………………………………………………………………………………….27

Figure 23: query for counting the number of distinct labels of nodes in RDF data sets…………………………………………………………………………………….28

Figure 24: Two Eleanor Cloud instances for test systems…………………………….30

Figure 25: Main files and directories listed in Virtuoso Home Directory on Windows 10 system…………………………………………………………………………………31

Figure 26: Commands to have operations on Windows service for Virtuoso………….32

Figure 27: Virtuoso conductor visited from localhost………………………………....32

Figure 28: Commands for Virtuoso installation on Ubuntu 16.04 system……………..33

Figure 29: UI of Neo4j Desktop version……………………………………………….34

Figure 30: Commands used in Ubuntu system for Neo4j installation…………………34

Figure 31: Principle 1(Basic) of Transformation from RDF graph to LPG……………35

Figure 32: Principle 2(Basic) of Transformation from RDF graph to LPG……………36

Figure 33: Principle 3(Basic) of Transformation from RDF graph to LPG……………36

Figure 34: An instance shown in RDF graph from W3C page[31]……………………37

Figure 35: Principle 4(Improvement) Transformation from RDF graph to LPG……....38

Figure 36: Transformation result for the specific instance mentioned above………….39

Figure 37: A part of UI of Neo4j Desktop version 3.3.4………………………………41

Figure 38: A part of UI of Neo4j Desktop version 3.3.4……………………………….41

Figure 39: The prefix listed that the transformation plugin requires…………………..42

Figure 40: Average loading time of Virtuoso on Windows system…………………...45

Figure 41: Average loading time of Neo4j on Windows system……………………...46

vii Figure 42: Average loading time of Virtuoso and Neo4j on Windows system………..47

Figure 43: Average loading time of Virtuoso on Ubuntu system……………………...48

Figure 44: Average loading time of Neo4j on Ubuntu system - (100k to 1.5m)……….49

Figure 45: Average loading time of Neo4j on Ubuntu system - (after 1.5m)…………..50

Figure 46: Average loading time of Virtuoso and Neo4j on Ubuntu system……….…..51

Figure 47: Average loading time in Virtuoso on Windows system and Ubuntu system…..……………………………………………………………………………..52

Figure 48: Average loading time in Neo4j on Windows system and Ubuntu system……………………………………………………………………………...….53

Figure 49: Average query time in Virtuoso on Windows system…………………..….54

Figure 50: Average query time in Virtuoso on Ubuntu system……..………………….54

Figure 51: Average query time in Virtuoso on Windows system and Ubuntu system….55

Figure 52: Average query time in Neo4j on Windows system ………………………...56

Figure 53: Average query time of Neo4j on Ubuntu system……………………….…..56

Figure 54: Average query time in Neo4j on Windows system and Ubuntu system…….57

viii Acknowledgements

Firstly, I would like to appreciate my supervisor Dr. Charaka Palansuriya, with his kind help and support, I completed this dissertation and understood more in the area of Graph Database System.

Next, I would like to appreciate all EPCC staffs and my classmates for their patient help and accompanies in this year of study.

Finally, I would like to thank my parents for their financial and spiritual support.

ix Chapter 1

Introduction

1.1 The importance of Graph Database

Database Management System (DBMS), as an effective and efficient container and repertory for data storage, data processing and querying, bring a large number of advantages and benefits for various enterprises and individuals. Three main stream database systems that were common in the past were Relational Databases, Hierarchical databases and Network Databases, while in recent years, database systems are divided into Relational Databases and NoSQL Databases [1][2]. In more detail, NoSQL databases have four basic types, which includes Key-Value-based, -based, Document-based and Graph-based [3][6]. In terms of the topics of the dissertation, Graph-based type NoSQL databases will be described and discussed from abstract concepts to real practice and applications with a series of experiments design and results. But, at the beginning, it is important to explain why people need Graph Databases compared with Relational Databases in some circumstances. In fact, relational databases are still widely used [4], For instance, it is popular used in financial transactions where you need ACID transactions, while Relational Databases have limitations, for example representing data with complex relationships. Relational Databases use tables as data structure. In Relational Databases, different entities are stored in different tables and there are diverse relationships connected between entities, while in Graph Databases, a record is represented as a node and the connection between nodes represents the relationships. In detailed, there are two types of Graph Databases based on different data storage types, which are Labeled Property Graph(LPG) type and RDF type [5][6] that will be introduced in further sections.

1 In recent years, network becomes popular and a huge amount of data are described as a network-based type. For instance, it is simple to understand with talking of ‘film’ topic. In a ‘film’ circumstance, a designer would encounter a series of sophisticated problems in designing and modelling processes when using Relational Databases. For instance, actors or actresses in a movie often have leading roles, supporting roles or directors. In a Relational Database, people in a movie are abstracted as ‘Person’ type which is corresponding to a table as storage. At the same time, a director can be an actor in another movie or TV series, a singer, or even an investor in some Film and Television Companies. And these Film & Television Companies are usually the investors of a series of movies and TV series. Ae can see, the interconnected relationships are extremely complex. In addition, there are always multiple different relationships between two entities at the same time. Such kind of relationships are shown in Figure 1.

Figure 1: ‘Film’ instance represented in a Relational Database

When designers try to use a Relational Database to model these relationships, they need to set up a series of tables to represent all kinds of entities. In more detail, a table representing a person, a table representing a movie, a table representing a TV series, a table representing a film company need to be listed completely. These tables usually need to be associated with a series of associative tables that record exactly what movies,

2 TV series, songs and companies a person has been involved in. At the same time, the designer need to create a series of correlation tables to record who is the leading actor in a movie, who is the supporting actor, who is the director, and who is the special effects. As we can see, the designer need a large number of association tables to record this series of complex relationships. As more entities are introduced, the designer will need more and more association tables to make the modelling and solution more cockamamie and fallible on this circumstance based on Relational Databases.

The problem above is caused by Relational Databases itself, which are designed for the basic idea of entity modelling. The designing concept does not provide direct support for the relationships between these entities. When the designer need to describe the relationships between these entities, the designer often need to create an association table to record the relationships between these data, and these associative tables are not usually used to record data except the . As for a primary key, it is used to uniquely identify a record – a in a table, which is also as a foreign key, used to reference a record in another table. In other words, these associate tables only simulate the relationships between entities through the existing functions of Relational Databases. This simulation results will lead to two negative results in a high probability. The Relational Databases need to maintain the relationships between entities indirectly through the association tables, which results in the inefficient execution of the database, and the number of associated tables increases sharply.

Figure 2: Film instance modelling in a Relational Database

3 As Figure 2 shown, the modelling representing the film instance in a Relational Database is more complex compared with that in a Graph Database which is shown in Figure 3. As we can see in Figure 3, the relationships between nodes can have properties, which is convenient without needing a new correlation table instead.

Figure 3: Film instance modelling in a Graph Database

1.2 Objectives

The aim of the project is to compare two popular Graph Database Management Systems – Neo4j and OpenLink Virtuoso where the former database is based on a Labeled Property Graph data structure, while the lather applies RDF graph. The comparison has two main parts, including the data type transformation between these two Graph Databases and some performance tests with different size of data sets.

4 Chapter 2

Literature Review

2.1 Graph Databases

Graph Database is a type of NoSQL database with nodes-edges model for data representing [1]. Compared with Relational Databases, a ‘node’ in a Graph Database represents a row in a ‘table’(entity) in a Relational Database, while ‘edges’ in Graph Databases describe relationships which are correlational tables between ‘tables’ in Relational Database. Take ‘Film’ as an instance, Figure 4 shows the tables storing different entities (rectangles) with corresponding correlation tables representing relationships (rhombuses) between entities. In terms of each table, there are diverse properties which represent columns (ovals) stored in corresponding tables.

Figure 4: Relational Database based on ‘Film’ instance

5 On the other side, Figure 5 presents the same ‘Film’ instance in terms of Graph Databases. In Figure 5, there are nodes and edges which represents instances and relationships respectively. The structure shown in Figure 5 with a Graph Database is obvious simpler than it shown in Figure 4 with a Relational Database.

Figure 5: Graph Database based on ‘Film’ instance

Both Figure 4 and Figure 5 show a limited size of instance. If there are thousands or millions of data stored like this model, a Relational Database would be more complex with a great many correlational tables so that the querying for such sophisticated model may be a big burden with a relatively low efficiency. However, a Graph Database could be an effective way for alternation and the efficiency for query would be improved [7][8].

2.2 RDF Graph Databases

2.2.1 RDF graph RDF is the abbreviation of Resource Description Framework, which is a W3C standard for web resources descriptions[9][10], for instance, the titles of the web pages, the authors of the web pages, the modification dates of the web pages, the contents, and the copyright informations. In general, RDF is created to describe the resources on the Web. It is written with XML syntax for descriptions [10][11]. In detailed, metadata is the data for data description or the information of information. RDF could be read by computers but not displayed for users. With using RDF, search engines can understand

6 the exact meanings of metadata. However, such RDF concepts are not limited with Web Resource [11]. RDF can also be applied to the data that are able to be recognised by Web.

Coming back to the concept of graph, RDF is a that has three types of nodes, which are resource node, literal node and [10]. These three nodes with different categories represent different status of nodes. A resource node represents a thing as resource, while a literal node contains a value with a ‘String’. There are strict principles that an edge can point to either a resource node or a literal node from a resource node [9][10]. On the contrary, a literal node are not able to point to a resource node [9][10]. Next comes the concepts of RDF triples. The data in RDF is described as an S-P-O triples [9]. With detailed explanations, ‘S’ refers to ‘Subject’, ‘P’ represents ‘Predicate’ and ‘O’ means ‘Object’. Basically, Subjects and Predicate are defined depending on URI (Universal Resource Identifier). An URI can identify a resource uniquely. It is important to distinguish URI with URL (Uniform Resource Locator). Because URLs are all URIs, however, a URI is not definite a URL.

For example, If there is a website with the URL ’https://www.jackWebsite/homePage’ that is created by ‘Jack’. Then, this representation based on natural language can be transformed in RDF type with ‘Subject’ ‘https://www.jackWebsite/homePage’, ‘Predicate’ ‘Author’ and ‘Object’ ‘Jack’. In a RDF graph, triples are mapped to nodes and edges respectively. In general, a node represents ‘Subject’ and ‘Object’, whereas an edge refer to ‘Predicate’. The instance above shown as RDF graph is displayed in Figure 6.

7

Figure 6: An instance of RDF graph

In terms of the modelling language of RDF data, RDFS (RDF Scheme) and OWL (Ontology Web Language) can be both used for Ontology description[12][13]. Ontology is important because it is used for describing a thing with its structures, inherent relationships and some constraints if exist, which is convenient to observe intuitively. There are some difference between these two modelling languages that OWL has more abundant vocabularies for data description compared with RDFS [15]. Not only the amount but also the richness of the vocabularies make OWL become more and more popular and the corresponding applications based on OWL in is widespread[14].

2.2.2 SPARQL SPARQL Protocol and RDF (SPARQL) is a query language based on RDF [17]. It can be used in a RDF type graph database, for instance, OpenLink Virtuoso. According to the structure of a RDF graph, SPARQL is also based on S-P-O triples. The most difference of SPARQL compared with other query language, for instance, SQL for relational databases, is that SPARQL can query and operate on the data sets in local databases but also the open source data sets through URL in the network [16].

2.2.3 OpenLink Virtuoso OpenLink Virtuoso is the next generation of universal server and service manager [18]. It provides traversal for development and deployment of new generation network and it can connect different databases and merge data sources together. In additional, Virtuoso

8 supports RDF graph model and SPARQL is the language for querying. The superior performance of Virtuoso has been recognized by the academic community. A great number of evaluation results show that it is one of the best database system in open source software to deal with RDF type data [18].

2.3 Labeled Property Graph Database

2.3.1 Labeled Property Graph Labeled Property Graph (LPG) is a data structure which is mainly used for data storage and data querying. LPG has basic vertices and edges structure, which is the same with RDF graph. The vertices and edges in LPG represents nodes and relationships respectively. In terms of a LPG, there is a unique ID for each node or each relationship to identify each of them. Certainly, a RDF graph has an equivalent unique ID which is the URI for each tuple in its S-P-O triples structure. But there are several difference between LPG and RDF graph. In a LPG graph, each node or relationship has a set of key-value pairs and a type which are used with unique ID for features description. Therefore, a set of key-value pairs used for feature description in a LPG called an internal structure which is the most significant difference compared with a RDF graph [19].

Certainly, there are other difference between a LPG graph and a RDF graph [19]. However, the rest of difference is based upon the most importance one which is described above. Firstly, in a RDF Graph, if a type of relationship that connects two nodes has been already existed, then more relationships with the same type cannot be appeared between the same two nodes. The reason is that relationship instance in a RDF graph cannot be distinguished uniquely. For instance, we have a sentence that ‘Jack loves Mary’ in natural language. If we put this data in a RDF graph, the structure is shown in Figure 7, which is no problem to represent that sentence with such graph. However, if ‘Jack loves Mary’ for another two times, there will be a problem. With using SPARQL to create this relationship for twice, it seems that the creation is good. But if we do a query for count the number of times that ‘Jack loves Mary’, then the result for that query will return just 1. The reason why this bad circumstance appears is that multiple relationships

9 with the same type cannot be identified uniquely in a RDF graph. Fortunately, such difficulty can be resolved with a LPG graph. If we do the same creation with Cypher in a LPG, the structure of the data will be shown like in Figure 8. In addition, when the same query for counting the number of times that ‘Jack loves Mary’ is tested, the result will return ‘three’ which is equal to the fact when we create the data and store them in a LPG. The importance is that a LPG has internal structure with a set of key-value pairs to identify each relationship uniquely [19].

Figure 7: An instance of RDF graph

Figure 8: An instance of LPG

Next comes to the second difference between a RDF graph and a LPG. In a RDF graph, a particular connection between two vertices cannot be specified and restricted. For instance, we have a sentence ‘Beijing is one thousand kilometres away from Shanghai and the cost of the trip between these two city is 150 pounds.’ in our natural language. If

10 we put this data into a RDF graph, a problem will appear when we create the cost and distance between ‘Beijing’ and ‘Shanghai’. The reason why this happens is that each connection or cost is generalized and global for the whole graph in a RDF graph, which means if there are more cities inserted into the graph, we cannot find and match the distance or cost with the corresponding two cities. This problem can be solved if an intermediate vertex is introduced [19], however, with too much redundant vertices added, the complexity and difficulties for querying will rise dramatically so that the performance will get worse. Fortunately, the problem can be solved with using a LPG graph which is shown in Figure 9.

Figure 9: An instance in LPG

Next difference is that the same ‘Subject’ and ‘Predicate’ can have diverse ‘Object’ in a RDF graph so that the pattern with multi-attributes can be easily represented with using S-P-O triples, while arrays need to be introduced if the equivalent effect are required [19].

Finally, in terms of query language, RDF graph need SPARQL while LPG requires Cypher.

11 2.3.2 Neo4j Neo4j is a popular graph database based on LPG [20]. In the present trend, Neo4j is one of the most widely used open source frameworks based on Java [21]. It has basic three versions including Desktop version, community version for free and enterprise version with some charge. Desktop version has an embedded web browser which is convenient to see the results with a LPG after executing Cypher statements. In terms of Neo4j community version, the corresponding results after a Cypher statement could be seen through localhost with web browser. Neo4j has a variety of that can be used for data creation and data queries, which is easy to use. In a server mode, the server allows clients to use HTTP to send requests in JSON format via REST API, which is supported by Cypher queries so that clients on any platforms can access Neo4j servers [22].

Cypher is designed to be a query language with good readability for human, which is convenient for developers and professional operators to do with point-to-point queries in Neo4j [23]. Cypher is an declarative language that focuses on how to retrieve the some data from a big size of data set, but not how to do it, which means it pays more attention on how to optimize the queries[22][23]. Cypher supports queries with arguments, which makes queries easier without additional string for query.

2.4 Open source data sets available in RDF

2.4.1 DBpedia DBpedia is an Open source data sets and it is a special application of semantic web which extracts structured data from Wikipedia with improving searching capabilities. It allows external data sets that link to Wikipedia. DBpedia is one of the largest knowledge ontology that applies in multiple domains in the world, which is also a part of . DBpedia [24] was regarded as the best semantic web application service.

12 2.5 Introduction of two formats of RDF data sets

2.5.1 Turtle Terse RDF Triple Language (Turtle) is a format of RDF [25]. Turtle is widely regarded as more human readable. Turtle supports for namespace prefixes, list and shorthand for string. An example Turtle segment is shown in Figure 10

Figure 10: An segment of Turtle syntax

2.5.2 N-triples N-triples is of RDF data sets. N-triples format is written line by line [25]. For instance, Figure 11 illustrate a segment in N-triples syntax. As can be seen in the segement that each line includes a S-P-O triples in RDF.

Figure 11: A segment of N-triples format

13 Chapter 3

Research Methodology

Chapter 3 describes what Research methodology was in our project. We identified a suitable open source dataset (i.e., DBpedia data set). Then we loaded data to both Virtuoso ad Neo4j and measured loading performance. Then we measured query performance on each and compared results.

3.1 Data sets preparation from DBpedia

The data sets in our project is obtained from DBpedia. DBpedia as an Open Source data sets available in RDF provides a great deal of data. We downloaded the original data sets from official website of DBpedia. We downloaded file ‘mappingbased_objects_wkd_uris_sv.ttl.bz2’ (in Appendix B) and unzipped it in a suitable location. The original data set is written in N-Triples structure, and there is a short segment of it, which is shown in Figure 12. As we notice, the name of the file has a suffix with ‘.ttl’, which means the type of this file is ‘Turtle’. This could be explained that ‘Turtle’ type have compatibility for ‘N-Triples’ type in some extent[26], which will be proved in following experiments when we load RDF data in both Neo4j and Virtuoso database.

14

Figure 12: A segment of the original RDF data set written in N-Triples type

With respect to the original file itself, there are 5,232,657 lines totally for real data except the first and the last lines about file description. Each line contains an S-P-O triple as we can see in Figure 12. In the project, we planned to do experiments with different size of data sets. We extracted different number of lines from the original data set. There are total thirty sub-files extracted from the original file. The smallest size of sub-file has 100,000 lines which is equivalent to 100,000 S-P-O triples, while the largest size of sub-file contains 3,000,000 lines with the same number of S-P-O triples. The size of the sub-file increases every one hundred thousand lines from one hundred thousand lines to three million lines.

3.2 Loading RDF data sets into OpenLink Virtuoso database

3.2.1 Loading RDF data sets into OpenLink Virtuoso database on Windows 10 system There are two methods to load data sets into Virtuoso database. Both two methods were tried on Windows 10 system in the project. The first way is to use Virtuoso Conductor. Figure 13 shows the user interface for function ‘Quad Store Upload’. This function could be found through clicking ‘Linked Data’ and then ‘Quad Store Upload’ button. For example, as Figure 13 shown, we selected a file named ‘rdfdata1.ttl’ which is the smallest data set with 100,000 S-P-O triples. In the area of ‘ IRI’, we just

15 wrote a simple namespace with ‘http://example/data’. With clicking ‘Upload’ button, the selected file could be imported into Virtuoso database with corresponding namespace.

Figure 13: ‘Quad Store Upload’ function in Virtuoso Conductor

We could do some queries on the file we uploaded from ‘SPARQL’ button shown in Figure 14. We need to enter the correct namespace we give in Figure 13 into the area ‘Default Graph IRI’ shown in Figure 14. In Figure 14, it shows an example SPARQL query with counting the number of S-P-O triples and the result for the query is shown after we click ‘Execute’ button.

Figure 14: SPARQL Execution in Virtuoso Conductor

16 The second way to load data sets into Virtuoso database is operated through Virtuoso Interactive SQL tool, that is, iSQL. We used iSQL from CMD prompt on Windows 10 system. In CMD prompt, firstly, we entered the location where the ‘bin’ directory of Virtuoso in. Next, we inputted command ‘isql’ to enter the Virtuoso Interactive SQL tool. This operation was based on the situation that a Virtuoso service was running. In iSQL, we used ‘Bulk Loader’ function [27] for loading RDF data sets. The process of Bulk Loading can be divided into two parts, including loading files into table ‘DB.DBA.LOAD_LIST’ and loading the files in the table to Virtuoso database. In the first part, we used function ‘ld_dir (' /path/to/files ', 'File Types', ‘Graph name’);’ to load files in the table that we can regard as a waiting list. At this stage, the files were not loaded into Virtuoso database. Next, in the second part, we performed the bulk load through using function ‘rdf_loader_run();’. After executing this function in iSQL, the data sets can be actually loaded into Virtuoso database. This method can be used as single loading or multiple loading, which is very convenient.

In the project, we chose to use the Bulk Loading for importing RDF data sets into database, which is convenient with command lines. In terms of the function ‘ld_dir (' /path/to/files ', 'File Types', ‘Graph name’);’, the first parameter means the path to the files that we want to load. The path need to be set in the configuration file of Virtuoso database. The configuration file called ‘virtuoso’ is located in the directory ‘virtuoso/database’. Figure 15 shows the lines we need to modify for the path of the files. We created a folder called ‘rdflist’ on our computer with location ‘C:\Users\yaojr\Desktop\rdflist’. In the configuration file, the third parameter of ‘DirAllowed’ represents the path where Bulk Loading functions can enter. After we modified it with a correct path for Windows system, the configuration file were saved and the Virtuoso service was required to be restarted. We could use command ‘select cfg_item_value (virtuoso_ini_path (), 'Parameters','DirsAllowed');’ to check whether the path in configuration file works or not. After we configured successfully for the path, we loaded the first fifteen data sets with increasing size into Virtuoso database with replacing each data set in file ‘rdflist’ which is located in ‘C:\Users\yaojr\Desktop’.

17

Figure 15: ‘DirsAllowed’ parameter in Virtuoso configuration file on Windows 10 system

3.2.2 Loading RDF data sets into OpenLink Virtuoso database on Ubuntu 16.04 system In the project, we chose to use Bulk Loader for loading RDF data sets into Virtuoso database on Ubuntu 16.04 system, because our Ubuntu system is a remote virtual instance which is not convenient to open Virtuoso Conductor from ‘localhost’. The operation is similar with that on Windows 10 system. The most important thing we need to do is to choose a location as the source of files. The location was set to be ‘/home/ubuntu’. Next, we uploaded thirty RDF data sets one by one to the location through WinSCP. WinSCP, as a visualized tool, is similar with ‘scp’ command in a Linux system. The available path in function ‘ld_dir (' /path/to/files ', 'File Types', ‘Graph name’);’ can be set in the configuration file ‘virtuoso.ini’. The configuration file is in the directory ‘/etc/virtuoso-opensource-6.1’. Figure 16 shows the specific modification of parameter ‘Dirsallowed’ in file ‘virtuoso.ini’. After changing parameter ‘DirsAllowed’ in file ‘virtuoso.ini’ and saving it, the current Virtuoso service need to be stopped and started again with commands ‘sudo /etc/init.d/virtuoso-opensource-6.1 stop’ and ‘sudo /etc/init.d/virtuoso-opensource start’. Then, we entered iSQL by the statement ‘/usr/bin/isql-vt’ and we loaded thirty RDF data sets into Virtuoso database respectively.

Figure 16: ‘DirsAllowed’ parameter in Virtuoso configuration file on Ubuntu 16.04 system[28]

18 3.3 Loading RDF data sets into Neo4

3.3.1 Loading RDF data sets into Neo4j database on Windows 10 system In the project, we used the transformation plugin written by J.BARRASA[29]. The specific introduction on how the plugin works is described in Chaper 4. On Windows 10 system, we used Neo4j Desktop and the data was operated in Neo4j browser. There are several functions with different use in the plugin. We used ‘semantics.importRDF("file:///...","DataType", {shortenUrls: true})’ function to load thirty RDF data sets. For instance, the first parameter in the function was set with the specific path as ‘file:///C:\\Users\\yaojr\\Desktop\\rdflist\\rdfdata1.ttl’. In the example path, ‘rdfdata1.ttl’ is the RDF data set with 100,000 S-P-O triples. In addition, the second parameter in the function was set as ‘Turtle’ depending on the type of the RDF data sets we loaded. However, we could set the type as ‘N-Triples’ because ‘Turtle’ has compatibility for ‘N-Triples’ in some extent as we mentioned in previous section. Therefore, both types could be identified in the function, but we need to match the suffix of the file with the type we set of the second parameter in the function. For example, we need to use ‘rdfdata1.nt’ with parameter ‘N-Triples’ and ‘rdfdata1.ttl’ with parameter ‘Turtle’.

After setting all the parameters in the function, we called the whole function for loading RDF data set ‘rdfdata1.ttl’ with the statement which is shown in Figure 17. Figure 18 shows the loading status and results after executing the statement in Figure 17. We need to change the specific name of fifteen RDF data set that we need to load into Neo4j database.

Figure 17: An example of loading function statement in Neo4j browser on Windows 10 system

19

Figure 18: Loading status and results after executing the statement in Figure 17.

3.3.2 Loading RDF data sets into Neo4j database on Ubuntu 16.04 system We used the same transformation plugin for loading RDF data sets into Neo4j database on Ubuntu 16.04 system. Similar to the operations on Windows 10 system, we need to modify the parameter in function semantics.importRDF("file:///...","DataType", {shortenUrls: true}). We uploaded RDF data sets from Windows 10 system to the directory ‘/home/ubuntu’ on Ubuntu system through WinSCP tool. Therefore, the whole function statement we called in Cypher-Shell is ‘CALL semantics.importRDF("file:///home/ubuntu/rdfdata1.ttl","Turtle", {shortenUrls: true});’, which loaded data set ‘rdfdata1.ttl’ in Neo4j database. The biggest difference on the use of semantics.importRDF("file:///...","DataType", {shortenUrls: true}) function for two operating systems is the path of the file, that is, the first parameter in the function. In order to load thirty RDF data sets one by one, we need to change the specific name of the first parameter in loading function.

3.4 Measuring loading times on Windows 10 system

In the experiment, we measured the loading times for different size of RDF data sets, because we planned to find the trend and pattern of loading times with increasing the 20 number of S-P-O triples. We tested the RDF data sets that increases every ten hundred thousand lines from one hundred thousand lines to one and a half million lines on Windows 10 system.

3.4.1 Measuring loading times of Virtuoso database on Windows 10 system We measured loading times of Bulk Loader in Virtuoso database. As we mentioned in previous section, there is a ‘DB.DBA.LOAD_LIST’ table that records the files with status. For example, we imported the RDF data set with 100,000 S-P-O triples in the empty database with statement ‘ld_dir ('C:\\Users\\yaojr\\Desktop\\rdflist', '*.ttl', 'http://dbpedia.org');’ and ‘rdf_loader_run();’. Next, we used statement ‘select * from DB.DBA.LOAD_LIST;’ to check the current table. The result after executing the statement is shown in Figure 19.

Figure 19: ‘DB.DBA.LOAD_LIST’ table after ld_dir() function

As Figure 19 shown, ‘ll_state’ with value ‘2’ represents that the loading of the data set is done completely. Parameters ‘ll_started’ and ‘ll_done’ record the timestamp of starting status and completed status of a file. Therefore, the difference between parameters ‘ll_started’ and ‘ll_done’ is the loading time after we executing the statement ‘rdf_loader_run();’. The function we used for calculating the loading time is shown in 21 Table 1. The parameter for ‘’ should be replaced to the actual graph-name we used in ld_dir() function. As Figure 19 shown, the name of the graph is ‘http://dbpedia.org’. The result of loading time for the current file is shown in Figure 20. As we can see, the value ‘2’ for parameter ‘delta’ is the loading time with measurement in seconds.

Measuring function for loading time in Virtuoso database select min(ll_started) as start, max(ll_done) as finish, datediff('second', 1 min(ll_started), max(ll_done)) as delta from load_list where ll_graph like '';

Table 1: Measuring function for loading time in Virtuoso database

Figure 20: Results of loading time of RDF data set ‘rdfdata1.ttl’

In order to get a relative accurate result, we measured the loading time for each RDF data set with ten times and we planned to calculate and obtained the average. For each measurement of a RDF data set, we deleted all data in the current database for a clean test environment, because the occupied and remaining memory space would have an influence on the performance of a measurement. Therefore, we used the statements shown in Table 2 for each measurement. In Table 2, line 1 is to delete the current loading list for avoiding conflicts for the next loading. Line 2 is used to loading data sets into loading list and line 4 is to load data into database. The statements for line 3 and line 5 are aimed to check the status of data sets in loading list. The former checking statement is to guarantee the current data set has been in the loading list, while the latter is to guarantee that the current data set has been loaded into database completely. Next, line 6

22 is executed for measuring loading time for the current loading process, while line 7 is to delete all data in the database for a clean environment preparing for next measurement. Theses seven statements in Table 2 are executed for each measurement of a RDF data set. In the experiment, we measured fifteen RDF data sets with size from 100,000 lines to one and half million lines. We measured each of RDF data set for ten times and the results are shown in Chapter 5.

Statements executed for measuring loading time in each measurement on Windows 10 system

1 delete from db.dba.load_list;

2 ld_dir ('C:\\Users\\yaojr\\Desktop\\rdflist', '*.ttl', 'http://dbpedia.org');

3 select * from DB.DBA.LOAD_LIST;

4 rdf_loader_run();

5 select * from DB.DBA.LOAD_LIST;

select min(ll_started) as start, max(ll_done) as finish, datediff('second', 6 min(ll_started), max(ll_done)) as delta from load_list where ll_graph like 'http://dbpedia.org';

7 SPARQL CLEAR GRAPH ;

Table 2: Statements executed for measuring loading time in each measurement on Windows 10 system

3.4.2 Measuring loading times of Neo4j database on Windows 10 system The approach for measuring loading time of RDF data sets in Neo4j database is simpler and intuitive. On Windows 10 system, the measurement was operated in Neo4j browser which is in Neo4j Desktop. For example, we imported RDF data set ‘rdfdata1.ttl’ in

23 Neo4j database by using the importing function in the transformation plugin. As Figure 21 shown, the results displayed below the executing window. In Figure 21, the area with a red rectangle shows the time spent for executing the statement for importing function. The reason why we directly used the executing time of the statement is that it is difficult to find a way to calculate the loading time more precise. In this example, the loading time was 15005 milliseconds, which is roughly equivalent to 15.01 seconds. For each measurement, we delete all the data in database after we loading data with Cypher statement ‘MATCH (n) DETACH DELETE n’. The deleting operation is to guarantee a clean environment for next loading. With this approach, the rest fourteen RDF data sets were tested one by one with ten times and the results are shown in Chapter 5.

Figure 21: Loading time result for loading RDF data set ‘rdfdata1.ttl’

3.5 Measuring loading times on Ubuntu 16.04 system

The loading time measurement were aimed at thirty RDF data sets with size from one hundred thousand S-P-O triples to three million triples. The maximum size of RDF data size is doubled on Ubuntu system compared with Windows system, because the difference on configurations of the hardware on both of the operating systems.

24 3.5.1 Measuring loading times of Virtuoso database on Ubuntu 16.04 system The method for measuring loading times of Virtuoso database on Ubuntu 16.04 system is the same with it on Windows 10 system. We used the difference of the timestamp representing starting status and completed status in ‘Loading List’ table to calculate the loading time of the current RDF data set listed in the table. The only difference on executed statements on Ubuntu system is the parameter of loading path compared with that on Windows system, which is shown in line 2 in Table 3. We then measured all thirty RDF data sets in Virtuoso database with ten times each. The corresponding results are shown in Chapter 5.

Statements executed for measuring loading time in each measurement on Ubuntu 16.04 system

1 delete from db.dba.load_list;

2 ld_dir ('home/ubuntu', '*.ttl', 'http://dbpedia.org');

3 select * from DB.DBA.LOAD_LIST;

4 rdf_loader_run();

5 select * from DB.DBA.LOAD_LIST;

select min(ll_started) as start, max(ll_done) as finish, datediff('second', 6 min(ll_started), max(ll_done)) as delta from load_list where ll_graph like 'http://dbpedia.org';

7 SPARQL CLEAR GRAPH ;

Table 3: Statements executed for measuring loading time in each measurement on Ubuntu 16.04 system

25 3.5.2 Measuring loading times of Neo4j database on Ubuntu 16.04 system On Ubuntu 16.04 system, we took the same approach as we did on Windows 10 system. The only difference is that we used Neo4j browser on Windows 10 system, while we used Cypher-Shell on Ubuntu 16.04 system. The reason is that our Ubuntu 16.04 system is a remote virtual instance so that it is difficult to open a Neo4j web browser from localhost. The difference of Neo4j browser and Cypher-Shell is that the former has a visualized user interface for graph displays, while the latter uses command lines and the results for a statement are shown in texts and tables.

3.6 Measuring query times on Windows 10 system

We planned to measure the times of counting the number of S-P-O triples in Virtuoso database and the number of distinct labels of nodes in Neo4j database. The aim is to find the trends and features of these queries with increasing the size of RDF data sets. On Windows system, we tested for fifteen RDF data sets, and the maximum RDF data set contains 1.5 million S-P-O triples.

3.6.1 Measuring query times of Virtuoso database on Windows 10 system In terms of Virtuoso database, we used SPARQL to execute query statements in iSQL. In iSQL, each of query statement with SPARQL need a prefix ‘SPARQL’, while such prefix is not required in a Virtuoso Conductor. The query statement for counting the number of triples which is shown in Figure 22. The query result for each RDF data set should be the same with the size of the corresponding data set. In the experiment, we measured ten times for each RDF data set with this query statement and the results are shown in Chapter 5.

Figure 22: Query statement for counting the number of distinct ‘Subject’ in RDF triples

26 3.6.2 Measuring query times of Neo4j database on Windows 10 system In Neo4j database, the queries were executed in Neo4j browser on Windows 10 system. We used Cypher query shown in Figure 23 for counting the number of distinct labels of nodes. The results of queries are shown in Appendix C. In our experiment, we measured each RDF data set for ten times with this query statement and the results are shown in Chapter 5.

Figure 23: Cypher query for counting the number of distinct labels of nodes in RDF data sets

3.7 Measuring query times on Ubuntu 16.04 system

On Ubuntu 16.04 system, we tested for thirty RDF data sets, and the maximum size of RDF data set is 3 million lines.

3.7.1 Measuring query times of Virtuoso database on Ubuntu 16.04 system We executed the same query statement using SPARQL in Virtuoso database on Ubuntu system. The results for thirty data sets are shown in Chapter 5.

3.7.2 Measuring query times of Neo4j database on Ubuntu 16.04 system The same query statement using Cypher were tested in Neo4j database on Ubuntu system and the result for all data sets are shown in Chapter 5.

27 Chapter 4

Experimental Work Carried

4.1 Hardware and Software configurations of test systems

4.1.1 Windows 10 system One of the test environment is based Windows 10 system on the Lenovo ThinkPad T450. In more detail, the operating system for this test system is 64-bit Windows 10 Home Edition version with the specific version number 17134. The CPU of this computer is Intel() Core(TM) i7-5500U CPU @ 2.40GHz (4 CPUs), ~2.4GHz. The memory of the machine is 8192MB RAM, while the capacity of the disk is 1TB.

The test for connecting neo4j is implemented using Java with version 1.8.0_74. The Java(TM) SE Runtime Environment is ‘build 1.8.0_74-b02’ and Java HotSpot(TM) 64-Bit Server VM is build 25.74-b02, mixed mode. In addition, the process for loading RDF type data sets into neo4j is tested based on Maven driver.

4.1.2 Ubuntu 16.04 system Another test system for the experiments is on a virtual machine with the operating system of Ubuntu 16.04. The virtual machine is created by using Eleanor Cloud service. In more detail, both two test instance for Virtuoso and Neo4j are created with the same software and hardware configurations to guarantee an equal environment. Each of the instance is configured with RAM 16GB, 8 VCPU and 160GB for the disk.

28 The instance for Virtuoso tests is called ‘testForVirtuoso’, while the instance for Neo4j tests is named as ‘testForNeo4j’, which are shown in Figure 24. As neo4j is implemented on Java, Java Runtime Environment (JRE) is required. The Java version used on ‘testForNeo4j’ instance is ‘1.8.0_181’. In addition, the OpenJDK Runtime Environment is ‘build 1.8.0_181-8u181-b13-0ubuntu0.16.04.1-b13’ and the OpenJDK 64-Bit Server VM is ‘build 25.181-b13, mixed mode’.

Figure 24: Two Eleanor Cloud instances for test systems

4.2 Virtuoso Installation

4.2.1 Virtuoso Installation on Windows system In the experiment, we downloaded OpenLink Virtuoso with version 7.2.4 for Windows 10 system, because this version is the latest publication. The zip file named ‘virtuoso-opensource-win-x64-20160425’ which is shown in Appendix A, was downloaded from the official website of OpenLink and the corresponding package was moved to ‘Program Files’ directory after decompression process. The directory of Virtuoso package is shown in Figure 25. According to the instructions of installation, the Microsoft Visual C++ 2012 Redistributable Package was downloaded for pre-built environment requirements. After setting up the prerequisites environment correctly, the machine with Windows 10 system needed to be restarted to guarantee that the environment had become effective. Then, the setting for the system path of environment variables was indispensable. After all preparation works were configured successfully,

29 we tested and verified whether Virtuoso had been installed in our system by typing the command ‘virtuoso-t -?’ from CMD prompt. With some usage instructions of Virtuoso showing back from the system in CMD prompt, it was proved that the installation was successful.

In terms of specific steps to start service, firstly, we went to the location where the Virtuoso file is from CMD prompt. Then, we went to ‘database directory’ as Figure 10 shown and typed ‘virtuoso-t +service create +instance "New Instance Name" +configfile virtuoso.ini’ to start a bran-new Windows service with replacing the content in "New Instance Name".

Figure 25: Main files and directories listed in Virtuoso Home Directory on Windows 10 system

30 Next, we used the command line ‘virtuoso-t +instance "Instance Name" +service start’ to start the service we just created with the correct service name. We could use the rest of the commands shown in Figure 26 to list all of the services and start, stop or delete any service.

Figure 26: Commands to have operations on Windows service for Virtuoso

In addition, Virtuoso has a visualised Conductor that could be visited from localhost with the address which is shown in Figure 27.

Figure 27: Virtuoso conductor visited from localhost

4.2.2 Virtuoso Installation on Ubuntu 16.04 The Ubuntu system for Virtuoso tests was created on Eleanor Cloud. To connect the service, firstly, we logged into the virtual machine by typing ‘ssh -i key_file2 [email protected]’ in CMD prompt. In the system, we used the commands shown in 31 Figure 28 to install Virtuoso step by step. The version of the Virtuoso database is ‘virtuoso-opensource-6.1’. The initial username and password for Virtuoso database is ‘dba’ and ‘dba’ respectively. To start the service, we used command ‘sudo /etc/init.d/virtuoso-opensource-6.1 start’. Then we checked whether the service was running by command ‘sudo /etc/init.d/virtuoso-opensource-6.1 status’. After seeing the status of ‘Active’ parameter is shown with ‘active(running)’ from the system, the service was proved to be working. With typing command ‘/usr/bin/isql-vt ’, we entered isql binary for more operations with Virtuoso database.

Figure 28: Commands for Virtuoso installation on Ubuntu 16.04 system

4.3 Neo4j Installation

4.3.1 Neo4j installation on Windows 10 system In the experiments, we chose Neo4j Desktop version as our tested database on Windows 10 system, because Neo4j desktop version has an intuitive User Interface which is more convenient compared with a Neo4j community version with CMD required.

We downloaded drive ‘neo4j-desktop-offline-1.0.15-setup’ on Neo4j official website, which is shown in Appendix A. Then we run it and installed Neo4j desktop version easily. The UI of database is shown in Figure 29. After creating a database for tests, we started the service by clicking ‘Start’ button. Next, we entered Neo4j browser.

32

Figure 29: UI of Neo4j Desktop version

4.3.2 Neo4j installation on Ubuntu 16.04 system The Ubuntu system for Neo4j tests was created on Eleanor Cloud. We logged into the virtual machine by typing ‘ssh -i key_file [email protected]’ in CMD prompt. In the system, we used the commands shown in Figure 30 to install Virtuoso step by step. After restarting Neo4j service, we could enter Neo4j database with default username and password. For data security, we change the password immediately.

Figure 30: Commands used in Ubuntu system for Neo4j installation

4.4 Transformation process from RDF graph to LPG

As we mentioned in Chapter 2, Neo4j is a Labeled Property Type graph database, while RDF data is based on RDF graph. Although LPG and RDF graph are both graph with similar structure including vertices and edges, there are still some difference between

33 RDF graph and LPG in terms of specific structure. The specific Therefore, a transformation between these two graph types is required when we need load RDF data into Neo4j. The transformation method is written by J.BARRASA [29]. He posted the method on the blog. In this project, the process of loading RDF data sets into Neo4j is implemented based on his method.

As J.BARRASA provides, the first step is to match different components in RDF S-P-O triples to nodes-relationships structure in a LPG. In a RDF graph, all ‘Subject’ and ‘Predicate’ are resource which are identified by unique URI. However, an ‘Object’ can be a resource, a literal node or a blank node. A literal node cannot be a ‘Subject’ of an S-P-O triple, which means a literal node is not able to be a root node and there is no more nodes can extend from a literal node.

To make the process clear, we have three principles to match S-P-O triples in RDF graph to nodes, properties of nodes or relationships and relationships. The first principle is that a ‘Subject’ in RDF triples matches a node in a LPG, which is shown in Figure 31.

Figure 31: Principle 1(Basic) of Transformation from RDF graph to LPG

Next, we have another two principles to decide the way to match a ‘Predicate’ to the components in a LPG depending on different types of ‘Object’, which are described in the second and the third principles. In terms of the second principle, a ‘Predicate’ is regarded as a property of a node in a LPG when the corresponding ‘Object’ which is in the same RDF triple with that ‘Predicate’ is a literal. The second principle is shown in Figure 32. Next comes the third principle, which is concerning the circumstance when an ‘Object’ in a RDF S-P-O triple is a resource with a uniquely identified URI. As Figure 33 shown, the third principle is that a ‘Predicate’ in a triple is matched as the

34 relationship between the corresponding nodes in LPG, if and only if the ‘Object’ in the current S-P-O triple is a resource.

Figure 32: Principle 2(Basic) of Transformation from RDF graph to LPG

Figure 33: Principle 3(Basic) of Transformation from RDF graph to LPG

In order to understand these principles easily, we introduces an instance of a RDF graph. Figure 34 shows a RDF model from W3C page. The ovals and rectangles in Figure 34 represent resource and literal respectively. We can translate the graph in Figure 34 in our natural language like – ‘There is a page with URL ‘http://www.w3.org/Home/Lassila’ that is created by a staff with ID number 85740, and the name and the email of this staff is and [email protected] respectively’. In this sentence, we can take three RDF triples. The first triple is comprised of ‘Subject’ – ‘http://www.w3.org/Home/Lassila’, ‘Predicate’ – ‘Creator’ and ‘Object’ - ‘http://www.w3.org/staffId/85740’. The second S-P-O triple consists of ‘Subject’ – ‘http://www.w3.org/staffId/85740’, ‘Predicate’ – ‘Name’, and ‘Object’ – ‘Ora Lassila’, while ‘Subject’ – ‘http://www.w3.org/staffId/85740’, ‘Predicate’ – ‘Email’ and ‘Object’ – ‘[email protected]’ form the third RDF triple in Figure 34[31].

35

Figure 34: An instance shown in RDF graph from W3C page[31]

The representation in natural language is understandable, however, the information should be constructed in a more intuitive and simple way. Therefore, we express these data in a XML-based type. The expression is shown in Table 4. In Table 4, the first line express the first triple where ‘w3’ represents ‘http://www.w3.org/’, ‘c’ indicates prefix URI of ‘Creator’ that is omitted in Figure 34, and ‘w3staff’ stands for http://www.w3.org/staffId/. The second line in Table 4 represent the second triple where ‘n’ indicates prefix URI of ‘Name’ that is omitted in Figure 34 and the last line expresses the third triple where ‘e’ indicates prefix URI of ‘Email’ that is left out in Figure 34. Next, we can transform this RDF type to a Labelled Property Type Graph with the three principles described in previous contents. In the first triple shown in Table 4, with using the first and the third principle, the data with RDF type can be converted as ‘(:Resource { uri:"w3:Lassila"})-[:'c:creator']->(:Resource { uri:"w3staff:85740"})’ in a Labelled Property Type. In terms of the second triple, we can use the first two principles to transform the data as ‘(:Resource { uri:"w3staff:85740", 'n:name': "Ora Lassila"})’. Similarly, the third triple can be converted to a LPG type as ‘(:Resource { uri:"w3staff:85740", 'e:email': "[email protected]"})’ through principle one and principle two.

36 Representation of RDF triples based on XML type

Predicate-> Subject->Node Object(Literal)->Properties Relationship

1 w3:Lassila c:creator w3staff:85740

2 w3staff:85740 n:name “Ora Lassila”

3 w3staff:85740 e:email “[email protected]

Table 4: Representation of RDF triples based on XML type

In addition, the method has the fourth principle for improvement. As we mentioned in previous sections, RDF data can express both data and metadata. Generally, metadata, that is, rdf:type in RDF describe the classes where the corresponding data resides. Such classes in a RDF graph is equivalent to the concepts of Categories in a LPG. The Categories means Labels in a Labeled Property Graph. This principle is also helpful to the situation when we have thousands or millions of data with the same category and we need to delete all of these data. For example, we have millions of data describing different people. When we want to clear all these people in our database, we can delete all of them directly by using Label ‘Person’. Alternatively, if we do not have a Label, we can point from data to an external node that represents ‘Person’ category. However, this is an effective but not an efficient way to deal with a great deal of data. Therefore, principle 4 is actually a necessary improvement when there is a large number of data in the database, which is shown in Figure 35.

Figure 35: Principle 4(Improvement) Transformation from RDF graph to LPG

Finally, in order to see the results of transformation more intuitive, we represent it in a LPG which is shown in Figure 36. As Figure 36 shown, two circles containing ‘Resource’ and ‘Person’ respectively represent two nodes in LPG, and the connection 37 from node ‘Resource’ and node ‘Person’ express the relationship ‘Creator’ between them. In addition, the contents in the braces next to node ‘Person’ is the properties of node ‘Person’. The properties describe that the name of the node ‘Person’ is ‘Ora Lassila’ and the staff ID number of node ‘Person’ is ‘85740’. Fortunately, so far we have completed the process of transformation from a RDF graph to a LPG through the specific example provided from W3C page.

Figure 36: Transformation result for the specific instance mentioned above

So far, it is necessary for us to recall what components in both a RDF graph and a LPG we have matches properly and what special circumstance we need to consider. In a RDF graph, an S-P-O triple is the basic structure. As we mentioned before, ‘Subject’ must be a Resource with a uniquely identified URI. ‘Predicate’ can be either a Relationship between two nodes or a Property of a node, which depends on the ‘Object’ it links. An ‘Object’ has two statuses, including a Resource or a Literal. However, there is a special status for an ‘Object’ if is blank. In this situation, the ‘Object’ is regarded as a especial Resource called Blank Node with a specific URI to distinguish which part of segment a Blank Node is contained in a LPG when we implement the process of transformation.

The process above including three basic principles and an additional principle for improvement are implemented by J.BARRASA [29]. The method is written by Java within a plugin.

The Java plugin implemented by J.BARRASA can be downloaded from his blog and the link is listed in reference [30]. There were three versions of the implemented code. Each

38 version has some modification compared with the previous version, including some bugs fixed or principles added. In addition, each version of release is suitable for corresponding versions of Neo4j. We used the transformation plugin on both Windows 10 system and Ubuntu 16.04 system, but there is some small difference when using the plugin in different operating system.

4.4.1 How to use the transformation plugin in Neo4j on Windows 10 system In our project, the version of Neo4j on Windows 10 system is 3.3.4. Therefore, the project use ‘neosemantics-3.3.0.2.jar’ package as transformation plugin on Windows system, because it is the release which is suitable for Neo4j version 3.3.x.

After we downloaded ‘neosemantics-3.3.0.2.jar’ package from the branch of J.BARRASA [29] on github [30], we started Neo4j Desktop version 3.3.4 and put ‘neosemantics-3.3.0.2.jar’ into the directory ‘/plugins’. We can easily find that directory from the UI of Neo4j Desktop. As Figure 37 shown, we need to enter from ‘Manage’ button at the bottom of the user interface and then the UI will jump into the next page, which is shown in Figure 38. The directory for ‘/plugins’ is in the list of the drop - down menu ‘Open Folder’. Then we stopped the service of Neo4j and closed Neo4j Desktop and restarted it to make sure the plugin could work.

39

Figure 37: A part of UI of Neo4j Desktop version 3.3.4

Figure 38: A part of UI of Neo4j Desktop version 3.3.4

After restarting Neo4j service on Neo4j Desktop, we checked whether the plugin could work normally. However, before we used the functions in the plugin, we added the

40 cypher statements through Neo4j browser, which are listed in Figure 39. The statements is the prefix lists the plugin requires as we mentioned about the process of the transformation from RDF graph to LPG in previous section. Then, we tested the function for importing RDF data into Neo4j with the statement ‘CALL semantics.importRDF("file:///...","Turtle", {shortenUrls: true}).’ At that time, we did not fill in an effective path as the first parameter of the function, because we just wanted to test whether the plugin could work or not. Therefore, after we typed that statement, the plugin were proved to be available without showing a feedback with ‘Function Not Found’.

Figure 39: The prefix listed that the transformation plugin requires

4.4.2 How to use the transformation plugin in Neo4j on Ubuntu 16.04 system In the project, we used the transformation plugin with Release 3.4.0.1 which is the latest version. We downloaded ‘neosemantics-3.4.0.1.zip’ package from the branch of J.BARRASA [30] on github, and we unzipped the file and installed Maven on Ubuntu system. Next we run the project with command ‘mvn clean package’ in Ubuntu 16.04 system. There are some tested embedded in the project and will be tested when the environment of the project is built successfully. After running successfully, two files were produced in the current directory. The production were a pom. file and a ‘target’ folder. We moved these produced files into directory /var/lib/neo4j/plugins, which is similar as we did on Windows 10 system. Next, we restarted the service by typing commands ‘sudo service neo4j stop;’ and ‘sudo service neo4j start;’ to guarantee that the plugin could work normally. Considering our Ubuntu system is a remote virtual instance, it would be too slow or not even possible to access Neo4j via a GUI browser. Therefore, we chose to depend entirely on Cypher Shell for data operation in the database. This is different compared to the Neo4j browser in Neo4j Desktop version on Window 10 system.

41 We used command ‘cypher-shell;’ to enter Cypher Shell on our Ubuntu system. We were required to input the username and matched password to enter the Cypher Shell. The initial username and password were both ‘neo4j’, and we were required to change them immediately for data security. Next, we checked whether the transformation plugin worked or not by calling the importer function by typing the statement ‘CALL semantics.importRDF("file:///...","Turtle", {shortenUrls: true});’. We did not change a correct path for the first parameter in the function because we just wanted to test the availability of the plugin at that time. The specific use is described in Chapter 4 when we load data into database. If there is no feedback from the Cypher Shell that there is no such function, it is proved that the plugin could work normally.

42 Chapter 5

Results and Analysis

5.1 Loading time of two database systems on Windows 10 system

5.1.1 Loading time of Virtuoso on Windows 10 system Figure 40 illustrates the average loading times of Virtuoso database on Windows 10 system. The loading times were measured in seconds. Each average time of a RDF data set was calculated from ten measurements. The specific results for all measurement are shown in Appendix C. As can be seen in Figure 40, the average loading time of the RDF data set with size one hundred thousand is approximately 1 second, while the RDF data set with size 1.5 million has the result at about 12 seconds, which is over ten times the result of RDF data set with size 100,000. The average loading time rises steadily with the size of RDF data sets increasing, although there are some fluctuations during the whole trend. These slight fluctuations may be caused by some background processes on Windows 10 system.

43

Figure 40: Average loading time of Virtuoso on Windows system

5.1.2 Loading time of Neo4j on Windows 10 system Figure 41 is the description of average loading time of Neo4j database on Windows 10 system. The original measurement unit is millisecond, but we transformed the unit to second, which provides convenience for the comparison of both Virtuoso and Neo4j in next section.

As can be seen in Figure 41, the RDF data set with size 100,000 has the minimum average loading time with approximately 6 seconds, while the average loading time of the data set with maximum size 1.5 million triples reaches the peak at nearly 153 seconds, which is roughly 25 times of the result of minimum RDF data set. The average loading time shows an obvious increasing trend with the size of RDF data sets raise.

44

Figure 41: Average loading time of Neo4j on Windows system

5.1.3 Comparison of loading time of both database systems on Windows system Figure 42 describes the average loading time of Virtuoso and Neo4j on Windows system. As we can see in the graph, the trend of times measured in Neo4j shows a sharply upward trend, while the results in Virtuoso has a relatively steadily trend compared with Neo4j. At the beginning, the difference in loading time between Virtuoso and Neo4j is small, where the loading time for RDF data set with size 100k in Virtuoso and Neo4j are both in the range from 0 to 10 seconds. However, the difference in loading time between two database systems increases gradually with the size of RDF data sets increasing. As can be seen in the graph, the time of the data set with 500,000 triples in Virtuoso is about 4 seconds, which is in the same range compared with the time of the data set with size 100k, while the time of the data set with 500,000 triples in Neo4j spends approximately 45 seconds, which is nearly eight times of the time measured with the data set that contains 100k triples in Virtuoso. Therefore, at the point with RDF data set that contains 500k triples, the time measured in Neo4j is nearly ten times of the time in Virtuoso. At further point at the data sets with 1 million and 1.5 million triples, the time in Neo4j reaches over ten times of the corresponding time in Virtuoso. The reason why there is a huge difference between these two databases on loading time is that Neo4j need a transformation process to convert the data types from RDF to LPG, while Virtuoso, as a RDF type graph database, can load RDF data sets directly without any redundant steps.

45

Figure 42: Average loading time of Virtuoso and Neo4j on Windows system

However, the growth from data sets with 1.2 million and 1.5 million in Neo4j is gradually levelling off, which may be caused by the limitations of hardware configurations on Windows 10 system.

5.2 Loading time of two database systems on Ubuntu system

5.2.1 Loading time of Virtuoso on Ubuntu system Figure 43 illustrates the loading time measured in Virtuoso database on Ubuntu system. As can be seen in the graph, the RDF data set with 100,000 triples spends 3.4 seconds on average, and the time continues increasing and reaches a peak at 106.6 seconds spent by RDF data set that contains 3 million triples. The maximum average loading time with the most triples in the data set is over 30 times of the minimum average loading time spent by the RDF data set with minimum triples. The trend of increasing on average loading time in Virtuoso on Ubuntu system is relatively even.

46

Figure 43: Average loading time of Virtuoso on Ubuntu system

5.2.2 Loading time of Neo4j on Ubuntu system Figure 44 and Figure 45 describe the average loading time measured in Neo4j on Ubuntu 16.04 system. In the experiment, we planned to measure all thirty RDF data sets from the size of 100,000 triples to 3 million triples. Figure 44 shows the results from the data set with 100,000 triples to 1.5 million triples. As can be seen in Figure 44, the average loading time increases gradually from the data set with 100k triples to 1.5 million triples. The minimum spent loading time is approximately 5 second, while the maximum average loading time is about 75 seconds, which is fifteen times of the corresponding minimum one. The upward trend of average loading time is steady without any fluctuations.

47

Figure 44: Average loading time of Neo4j on Ubuntu system - (100k to 1.5m)

However, the loading time measured with the RDF data that contains 1.6 million triples is extremely high compared with the time measured from the data set with 1.5 million triples.

As we can see in Figure 6, the loading time of the data set with 1.6 million triples spends approximately 750 seconds that is equivalent to about 12.5 minutes for a single measurement, while the average loading time spent by the data set with 1.5 million triples is just one tenth of the time measured with 1.6 million triples. We continued measurements to test and verify whether this phenomenon is a special case. We tested the data sets with 1.7 million and 1.8 million triples, but the average loading time was still high which was at the same level with the data set with 1.5 million triples. As can be seen in Figure 45, the average loading time reaches about 950 seconds for the data set with 1.8 million triples. Therefore, we stopped measuring the loading time after the data set with 1.8 million triples because it would be time-consuming with more triples. The sudden extremely increase after the data set with 1.5 million triples may be caused by Java itself. As we known, Neo4j is written based on Java and the transformation plugin is written in Java. Therefore, when we load the data set with too many triples, the running time is influenced by the allocation memory of Java.

48

Figure 45: Average loading time of Neo4j on Ubuntu system - (after 1.5m)

5.2.3 Comparison of loading time of both database systems on Ubuntu system Figure 46 shows the loading time measured in Virtuoso and Neo4j on Ubuntu 16.04 system. Depending on the results shown in Section 5.2.2, we just draw the graph for Neo4j with the maximum RDF data sets with 1.5 million triples in Figure 46. As can be seen in the graph, the average loading time of both Virtuoso and Neo4j show an upward trend with the size of RDF data sets increasing. But the upward trends in two database systems have some difference. At the beginning of data, the average loading time of the data sets with 100k triples in Neo4j takes about 5 seconds, which is just 2 more than the corresponding result in Virtuoso. However, with the size of data sets increasing, the difference between Virtuoso and Neo4j becomes bigger and bigger. At the point with the data set that contains 1 million triples, the loading time measured in Neo4j is approximately 1.5 times of the corresponding loading time in Virtuoso and the difference reaches about 14 seconds, which is approximately 7 times of the difference at the beginning of the data set. The incremental difference is caused by the transformation process in Neo4j. With the size of RDF data sets increasing, the time spent on transformation from RDF to LPG in Neo4j becomes larger, while RDF data sets can be loaded into Virtuoso directly.

49

Figure 46: Average loading time of Virtuoso and Neo4j on Ubuntu system

5.3 Comparison of loading time on Windows system and Ubuntu system for Virtuoso

Figure 47 describes the average loading time in Virtuoso database on two operating systems. As can be seen in Figure 47, the loading time measured on both operating systems show upward trends, but the extent of growth on Ubuntu system is larger than that on Windows system. On Windows system, the background applications may have an influence on loading, while on Ubuntu system, the environment of background are simpler.

50

Figure 47: Average loading time in Virtuoso on Windows system and Ubuntu system

5.4 Comparison of loading time on Windows system and Ubuntu system for Neo4j

Figure 48 illustrates the average loading time in Neo4j database on two operating systems. As can be seen in Figure 48, the average loading time spent on Windows system and Ubuntu system grow with the size of the data sets increasing. But the difference between two operating systems is obvious. We can see that there are several fluctuations in the trend of Windows system, which may be caused by the unstable background environment on Windows 10 system, while the trend of Ubuntu system shows a steady growth. In addition, the loading time spent on Windows system is much more than that on Ubuntu system. On Windows 10 system, we used Neo4j Desktop version with Neo4j browser so that the corresponding LPG would be shown after executing a statement, while on Ubuntu system, we used commands with Cypher-Shell so that the results after statements execution are shown with texts and tables, which means the LPG displayed need more time than a text version, leading to the difference in Figure 48.

51

Figure 48: Average loading time in Neo4j on Windows system and Ubuntu system

5.5 Query time of Virtuoso on both systems

5.5.1 Query time of Virtuoso on Windows system Figure 49 illustrates the average query time of Virtuoso on Windows 10 system. The query statement is aimed to count the number of triples in the RDF data set. The results were the same with the size of the corresponding data set. In the experiment, we tested for 15 RDF data sets from the minimum size with 100k triples to maximum size with 1.5 million triples. As can be seen in Figure 49, the average query time is about 0.03 seconds with the data set contains 100,000 triples, and the times continue grow with the size of data sets increasing, reaching a peak at about 0.52 seconds, which is over 15 times of the time spent with the data set that contains 1.5 million triples.

52

Figure 49: Average query time in Virtuoso on Windows system

5.5.2 Query time of Virtuoso on Ubuntu system As shown in Figure 50, the graph illustrates the average query time spent in Virtuoso on Ubuntu system. In the experiment, we tested for thirty RDF data sets with the maximum size of data set containing 3 million triples. The query time increases about 0.02 seconds with the size increasing every 100k triples.

Figure 50: Average query time in Virtuoso on Ubuntu system

53 5.5.3 Comparison of query time on Windows system and Ubuntu system for Virtuoso Figure 51 illustrates the results of query times in Virtuoso on both operating systems. As can be seen in the graph, the time spent on query on two operating systems show upward trends, which is proportionate to the number of triples in the corresponding data set(i.e. the results of query). But, it is obvious that there are some fluctuations on Windows 10 system, while the growth trend on Ubuntu system is stable. This is because the background environment in Windows 10 system is much more complex than that on Ubuntu system.

Figure 51: Average query time in Virtuoso on Windows system and Ubuntu system

5.6 Query time of Neo4j on both systems

5.6.1 Query time of Neo4j on Windows system Figure 52 describes the average query time in Neo4j on Windows system. In the experiment, the query in Neo4j is to count the number of distinct labels in the data set. As can be seen at the beginning in the graph, the RDF data set with 100k triples spends about 0.125 seconds on querying, while the data set with 1.5 million triples uses about 1.348 seconds, which is over ten times of the corresponding time of the data set with 100,000 triples. The overall trend of the query times rises steadily with the size of data sets increasing.

54

Figure 52: Average query time in Neo4j on Windows system

5.6.2 Query time of Neo4j on Ubuntu system Figure 53 describes the average query time measured in Neo4j on Ubuntu system As we can see in the graph, the query time slightly grow with the size of the data sets increasing. The average query times of thirty RDF data sets are in the range from 3.5 seconds to just over 5.5 seconds.

Figure 53: Average query time of Neo4j on Ubuntu system

55 5.6.3 Comparison of query time on Windows system and Ubuntu system for Neo4j Figure 54 illustrates the average query time measured in Neo4j in both operating systems. As can be seen in Figure 54, the trends on two operating system are similar, which are both proportionate to the number of distinct labels in the corresponding data set (i.e. the result of query, which is shown in Appendix C). The difference of the query time for every two adjacent data sets is small. Therefore, we can see that the difference of the query time for the minimum and maximum data sets are not large on both operating systems, which means the query times spent do not increase too much with the size of the data sets increasing.

Figure 54: Average query time in Neo4j on Windows system and Ubuntu system

Next, if we compare the graph in Figure 51 and Figure 54, we can know that the query time measured in Virtuoso increase obviously with the size of the data sets increasing, while in Neo4j, the change of query time is small with increasing the size of the data sets.

56 Chapter 6

Conclusion and Further Work

In this project, different size of RDF data sets were loaded into OpenLink Virtuoso database and Neo4j on both Windows 10 systems and Ubuntu 16.04 system. In Virtuoso database, RDF data sets were loaded directly because it is a RDF type database, while in Neo4j database, the data sets needed to be transformed from RDF type to LPG type by using a transformation plugin written in Java. The transformation process indicates the data between these two graph database systems is interoperable. In addition, we took measurements on loading times and query times of different size of RDF data sets in Virtuoso and Neo4j on two operating systems. These experiments are aimed to figure out how loading times and query times change as the size of RDF data sets increase. The results shows Neo4j need more time on loading RDF data sets because of the transformation process whatever on Windows 10 system or Ubuntu 16.04 system, while the query times spent in Neo4j increase slightly than that in Virtuoso with the size of the RDF data sets increasing, which means the performance of querying in Neo4j is not influenced too much by the increase of size of data sets.

In terms of two operating systems, Windows 10 system was tested as a local testing environment which loaded relatively small size of data sets, while Ubuntu 16.04 system, as a remote virtual instance on Eleanor Cloud, loaded double size of the data sets in Windows system. Concerning the operating methods of two graph database systems, we used iSQL in Virtuoso on both operating systems, while we used different approaches in Neo4j on different operating systems. We chose Neo4j browser which is a visualized UI with showing LPG graphs as results on Windows system, while we used Cypher-Shell 57 with commands on Ubuntu system. This is because Neo4j Desktop version is more simper and intuitive with an embedded Neo4j browser. In addition, Ubuntu system as a remote virtual instance created in Eleanor Cloud, is difficult to open a Neo4j web browser from localhost so that we used Cypher-Shell alternatively. But the difference of approaches on two operating systems leads to relatively time-consuming results of some queries which would be shown in LPG on Windows system.

For future works, we plan to use both Cypher-Shell on two operating systems, which means we plan to use Neo4j community version with CMD prompt and execute Cypher statements by entering Cypher-Shell on Windows system. Therefore, time cost on showing LPG results in Neo4j browser can be ignored. In addition, to make the data sets more readable, we plan to use the default graphs in Neo4j as examples. We plan to extract and export the default graphs that is a LPG into a RDF file, which means we can import such data into Virtuoso database and we can see the difference more clear between a LPG type and a RDF type. In addition, the code for transformation plugin may be improved with efficiency and performance based on the existed version and the allocation memory for Java should be considered more with experiments. With respect to more queries in Neo4j and Virtuoso, we plan to do more experiments. In addition, more readable and understandable queries in both data sets can be executed based the data sets with more practice significance.

In conclusion, from the experiments, we can see that Neo4j is more suitable to management a certain amount of data sets, for instance, Neo4j may be more suitable for a new company, because a new company has not too much data at the beginning. In addition, Neo4j is good with querying data with complex structure. As for Virtuoso, it may be more suitable for big size of data sets. Fortunately, there are suitable areas of applications in both types of these graph database systems.

58 Appendix A

Database Installation Package

A.1 OpenLink Virtuoso on Windows 10 system

Microsoft Visual C++ 2012 Redistributable (x64) - 11.0.61030 virtuoso-opensource-win-x64-20160425

A.2 Neo4j on Windows 10 system neo4j-desktop-offline-1.0.15-setup

59 Appendix B

Original data sets from DBpedia

B.1 Original data sets from DBPedia

mappingbased_objects_wkd_uris_sv.ttl.bz2

– (http://downloads.dbpedia.org/2016-10/tmp/data/sv/raw/)

60 Appendix C

Detailed measurements results

C.1 Loading time measured in Virtuoso on Windows 10 system

C.2 Loading time measured in Neo4j on Windows 10 system

C.3 Loading time measured in Virtuoso on Ubuntu 16.04 system

61 C.4 Loading time measured in Neo4j on Ubuntu 16.04 system

C.5 Query time measured in Virtuoso on Windows 10 system

C.6 Query time measured in Neo4j on Windows 10 system

C.7 Query time measured in Virtuoso on Ubuntu 16.04 system

62 C.8 Query time measured in Neo4j on Ubuntu 16.04 system

C.9 Query results in Neo4j

63 References

[1] Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y., & Wilkins, D. (2010, April). A comparison of a graph database and a relational database: a data provenance perspective. In Proceedings of the 48th annual Southeast regional conference (p. 42). ACM. [2] Wotring, S. C., & Ripley, J. R. (2005). U.S. Patent No. 6,853,997. Washington, DC: U.S. Patent and Trademark Office. [3] Nayak, A., Poriya, A., & Poojary, D. (2013). Type of NOSQL databases and its comparison with relational databases. International Journal of Applied Information Systems, 5(4), 16-19. [4] Zhang, C., Naughton, J., DeWitt, D., Luo, Q., & Lohman, G. (2001, May). On supporting containment queries in relational database management systems. In Acm Sigmod Record (Vol. 30, No. 2, pp. 425-436). ACM. [5] Wood, D., Zaidman, M., Ruth, L., & Hausenblas, M. (2014). Linked Data. Manning Publications Co.. [6] Angles, R. (2012, April). A comparison of current graph database models. In Data Engineering Workshops (ICDEW), 2012 IEEE 28th International Conference on (pp. 171-177). IEEE. [7] Riesen, K., & Bunke, H. (2008, December). IAM graph database repository for graph based pattern recognition and machine learning. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR) (pp. 287-297). Springer, Berlin, Heidelberg. [8] Robinson, I., Webber, J., & Eifrem, E. (2013). Graph databases. " O'Reilly Media, Inc.". [9] Lassila, O., & Swick, R. R. (1999). Resource description framework (RDF) model and syntax specification. [10] CambridgeSemantics page for RDF. https://www.cambridgesemantics.com/blog/semantic-university/learn-rdf/ - Accessed on 2nd August. [11] CambridgeSemantics page for RDF and XML https://supportcenter.cambridgesemantics.com/semantic-university/rdf-vs-xml - Accessed on 2nd August.

64 [12] Matono, A., Amagasa, T., Yoshikawa, M., & Uemura, S. (2003, September). An Indexing Scheme for RDF and RDF Schema based on Suffix Arrays. In SWDB (pp. 151-168). [13] Antoniou, G., & Van Harmelen, F. (2004). : Owl. In Handbook on ontologies (pp. 67-92). Springer, Berlin, Heidelberg. [14] CambridgeSemantics page for OWL ontology. https://www.cambridgesemantics.com/blog/semantic-university/learn-owl-rdfs/ owl-101/ - Accessed on 3rd August. [15] CambridgeSemantics page for RDFS and OWL. https://www.cambridgesemantics.com/blog/semantic-university/learn-owl-rdfs/ - Accessed on 2nd August [16] Harris, S., Seaborne, A., & Prud’hommeaux, E. (2013). SPARQL 1.1 query language. W3C recommendation, 21(10). [17] CambridgeSemantics page for SPARQL. https://www.cambridgesemantics.com/blog/semantic-university/learn-sparql/- Accessed on 4th August. [18] Erling, O. (2012). Virtuoso, a Hybrid RDBMS/Graph Column Store. IEEE Data Eng. Bull., 35(1), 3-8. [19] Neo4j blog page for LPG and RDF graph. https://neo4j.com/blog/rdf-triple-store-vs-labeled-property-graph-difference/ - Accessed on 3rd August [20] Developers, N. (2012). Neo4j. Graph NoSQL Database [online]. [21] Miller, J. J. (2013, March). Graph database applications and concepts with Neo4j. In Proceedings of the Southern Association for Information Systems Conference, Atlanta, GA, USA (Vol. 2324, p. 36). [22] Hecht, R., & Jablonski, S. (2011, December). NoSQL evaluation: A use case oriented survey. In Cloud and Service Computing (CSC), 2011 International Conference on (pp. 336-341). IEEE. [23] Wikipedia page of Neo4j. https://en.wikipedia.org/wiki/Neo4j - Accessed on 3rd August. [24] DBpedia introduction page. https://wiki.dbpedia.org/about - Accessed on 3rd August. [25] Consortium. (2014). RDF 1.1 concepts and abstract syntax. [26] W3C page for turtle and n-triples. https://www.w3.org/TeamSubmission/turtle/ - Accessed on 5th August. [27] OpenLink Page for BulkLoader in Virtuoso. http://vos.openlinksw.com/owiki/wiki/VOS/VirtBulkRDFLoader - Accessed on 5th August.

65 [28] Virtuoso Installation. http://vos.openlinksw.com/owiki/wiki/VOS/VOSUbuntuNotes - Accessed on 5th August. [29] Transformation plugin page. https://jbarrasa.com/2016/06/07/importing-rdf-data-into-neo4j/ - Accessed on 3rd August. [30] GitHub of transformation plugin. https://github.com/jbarrasa/neosemantics/releases - Accessed on 3rd August. [31] An example of RDF graph. https://www.w3.org/TR/1999/REC-rdf-syntax-19990222/ - Accessed on 7th August.

66