Representing and Storing Semantic Data in a Multi-Model Database
Total Page:16
File Type:pdf, Size:1020Kb
Representing and Storing Semantic Data in a Multi-Model Database Simen Dyve Samuelsen Thesis submitted for the degree of Master of science in Programming and Networks 60 credits Department of Informatics Faculty of mathematics and natural sciences UNIVERSITY OF OSLO Spring 2018 Representing and Storing Semantic Data in a Multi-Model Database Simen Dyve Samuelsen c 2018 Simen Dyve Samuelsen Representing and Storing Semantic Data in a Multi-Model Database http://www.duo.uio.no/ Printed: Reprosentralen, University of Oslo Abstract With the emergence of NoSQL multi-model databases (natively support- ing scalable and unified storage and querying of various data models such as graph, documents, key-values, relational, etc.) arise new opportunities for efficient representation and efficiently storing of data. Whereas tradi- tional systems relay on multiple databases and/or using databases that’s not optimized for the data that needs storing, multi-model databases allow for more flexibility and are built around the concept of database distribu- tion and availability. The multi-model structure also allow for one database to do what several databases nowadays are combined to do in a polyglot structure. Semantic data with its graph-oriented structure is one type of data struc- ture that could benefit from the use of multi-model databases, both for rep- resenting and storing the data. RDF is a popular model for semantic data, but RDF management systems are facing challenges when it comes to scal- ability and generality and the scalability challenge is particularly urgent. Working with RDF graphs, which are typically highly connected and dis- tributed, results in querying large volumes of data, thus making the scal- ability issue more pressing. Earlier approaches to make better storage for RDF data have been done through the use of relational databases, but even though they are optimized for data handling they are not very flexible and semantic data doesn’t necessarily fit within a pre-defined rigid schema in- side the relational database. NoSQL databases allow for better flexibility and do not enforce any pre-defined schema to the data stored, thus better supporting the variety of data within the semantic data domain. This thesis explores and defines different approaches to represent and store RDF data within a multi-model NoSQL database. Id identifies various as- pects of representing the RDF data structure into a multi-model data struc- ture and discusses their advantages and disadvantages. In addition, the v thesis also describes an approach to represent the semantic spacetime data model introduced by Mark Burgess, compering how two different semantic models (RDF and spacetime) can be represented in the same multi- model database. Furthermore, the thesis proposes a prototype implementation of the two representation and storage approach in ArangoDB — a popular multi-model database. vi Acknowledgements I would like to thank everyone helping and contributing to the thesis. I am so grateful for all the time invested. First, I will thank my supervisors Dumitru Roman (UiO and SINTEF), and Nikolay Nikolov (SINTEF), for all the guidance, motivation, technical and concept discussions and contribution to the development and writing pro- cess. Including me in the project and the group they have been essential for the thesis, and the end result. In addition, thanks Dumitru for inviting me to do presentations and hands-on on NoSQL and multi-model databases at the University of Oslo challenging me to present and discuss features of the technology. I also want to thank everyone else within the Smart Data group at SINTEF for their contribution, and discussions of implementation. And without doubt for including me in the discussion of how the connection between ArangoDB and DataGraft could be implemented. Second, I would like to especially thank my family for supporting me through this process and so being patient with me. vii viii Contents 1 Introduction 15 1.1 Context............................... 15 1.2 Motivation............................. 16 1.3 Research questions........................ 17 1.4 Research design.......................... 17 1.5 Thesis outline........................... 18 2 Background 21 2.1 Semantic data........................... 21 2.1.1 Semantic web....................... 21 2.1.2 Semantic spacetime.................... 23 2.2 Multi-model databases...................... 24 2.2.1 Overview of multi-model databases.......... 26 2.2.2 ArangoDB......................... 28 3 Modeling RDF in ArangoDB 31 3.1 Representing the RDF data model in the ArangoDB data model 32 3.1.1 Direct representation................... 32 3.1.2 Direct representation with edge values........ 32 3.1.3 RDF flattening - a document representation of RDF. 33 3.2 Implementation in the DataGraft platform........... 34 3.2.1 Overview of the DataGraft platform.......... 34 3.2.2 Extension to the DataGraft platform.......... 35 3.2.3 Implementation details................. 37 4 Modeling spacetime in ArangoDB 47 4.1 Representing the spacetime data model in the ArangoDB data model............................. 48 4.1.1 Direct representation................... 49 4.1.2 Flattened representation................. 49 ix 4.1.3 Direct representation with edge values........ 50 4.2 Implementation using ArangoDB................ 51 4.2.1 Storing spacetime data in ArangoDB.......... 51 4.2.2 Querying spacetime data in ArangoDB........ 52 4.2.3 Implementation details................. 53 5 Evaluation 57 5.1 Database evaluation....................... 57 5.1.1 Evaluation scenarios................... 57 5.1.2 Test environment..................... 59 5.1.3 Results........................... 61 5.2 Evaluation of RDF data storage approaches in ArangoDB.. 67 5.2.1 Evaluation scenarios................... 67 5.2.2 Test environment..................... 68 5.2.3 Results and discussion.................. 68 5.3 Evaluation of spacetime data storage approaches in ArangoDB 71 5.3.1 Evaluation scenarios................... 71 5.3.2 Test environment..................... 71 5.3.3 Results / discussion................... 72 6 Conclusion and outlook 75 6.1 Conclusion............................. 75 6.2 Future work............................ 78 References 81 Appendix A 83 x List of Figures 1.1 Iteration process.......................... 18 2.1 RDF graph............................. 22 2.2 ArangoDB NoSQL benchmark results 14/02/18....... 27 3.1 DataGraft dashboard....................... 35 3.2 DataGraft assets.......................... 36 3.3 Graph mapping in Grafterizer.................. 37 3.4 DataGraft connection administration.............. 38 3.5 DataGraft ArangoDB database administration........ 38 3.6 Localscript components..................... 39 3.7 String hash function, to hash URIs............... 41 3.8 web service components..................... 42 3.9 RDF flattened representation in JSON............. 43 3.10 The different options to handle the results when using Grafterizer............................. 44 4.1 Example spacetime data..................... 49 4.2 Example spacetime data represented in ArangoDB...... 50 4.3 The association menu types................... 51 4.4 The STtypes as referred to from the association menu.... 52 4.5 AQL query retrieving all connected nodes from the node n. 53 5.1 Survey question 1......................... 63 5.2 Survey question 2......................... 63 5.3 Survey question 3......................... 64 5.4 Survey question 4......................... 64 5.5 Survey question 5......................... 65 5.6 Survey question 6......................... 65 5.7 Survey question 7......................... 66 5.8 Survey question 8......................... 66 xi 5.9 Survey question 9......................... 66 6.1 Polyglot persistence........................ 76 xii List of Tables 2.1 Spacetime STtypes........................ 24 5.1 Benchmark results........................ 62 5.2 Benchmark results RDF..................... 69 xiii xiv Chapter 1 Introduction 1.1 Context The adoption of the linked data paradigm and the RDF format1 has grown significantly over the past decade – the Linked Open Data (LOD) cloud2 initiative reports close to 1200 datasets (up from just 32 in 2008), and the current total size of the Data Web is estimated at almost 3000 distinct datasets and around 150 billion triples3. The linked data paradigm pro- motes the publishing of semantically enriched data on the Web through the use of self-describing data/relations and interlinking based on associating globally unique identifiers of data. Every entity or thing in RDF is repre- sented by a Uniform Resource Identifier (URI) that can be dereferenced, which allows integration of data in a cross-domain graph. Even though RDF data is getting a wider acceptance, there are still chal- lenges with the practical use of RDF. ”RDF data management systems are facing two challenges: namely, systems’ scalability and generality. The challenge of scalability is particularly urgent”[15]. Working with RDF graphs, which are typically highly connected and distributed, results in matching and querying large volumes of data, thus making the issue with scalability more pressing. The article [9] describes new trends of consumers that triggered the need of new ways to store large amounts of data. The entire thing is an eternal loop; more users generate more data, more data leads to better algorithms, 1https://www.w3.org/RDF/ 2http://lod-cloud.net/ 3http://stats.lod2.eu/ 15 better algorithms make for a better user experience that