Tygrstore: a Flexible Framework for High Performance Large Scale RDF Storage University of Zurich

University of Zurich Department of Informatics Tygrstore: a Flexible Framework for High BachelorThesis March18,2011 Performance Large Scale RDF Storage Yannick Koechlin of Basel BS, Switzerland Student-ID: 03-056-314 [email protected] Advisor: Cosmin Basca Prof. Abraham Bernstein, PhD Department of Informatics University of Zurich http://www.ifi.uzh.ch/ddis Acknowledgements First and foremost i would like to thank my Advisor Cosmin Basca for the Faith, Time and Work he spent with me on this project. Without his help and his numerous recapitulations of some topics, i would not have achieved any working version of Tygrstore. I also would like to thank the fellow students who helped me and especially the SEAL master students in the project room, with whom i had a lot of fun. Last but not least my girlfriend Anna Ehrenklau supported me whenever she could. Thank You! I also had a lot of fun learning Python/Cython and gaining knowledge about Triplestores. Both of these topics where new to me. Abstract This Thesis describes the Architecture of a highly flexible Triplestore Framework. Its main fea- tures are: pluggable backend storage facilities, horizontal Scalability, a simple API and the generation of endless result Streams. Special attention has been paid on easy extensibility. First a detailed view of the architecture is given, later more details on the actual implementation are revealed. In the end two possible triplestore setups are benchmarked and profiled. It is shown that the currently limiting factors do not lie within the architecture but in the library code for the backends. In the end possible solutions and enhancements to the framework are discussed. Zusammenfassung Die vorliegende Arbeit beschreibt und dokumentiert den Tygrstore. Tygrstore ist ein flexibles, performantes Framework um Tripelstores zu erstellen. Die Hauptmerkmale sind auswechselbare Backend-Speicherloesungen, eine einfache API und die Möglichkeit endlose Resultate zu strea- men. Tygrstore ermöglicht ausserdem die horizontale Skalierung. Es wurde ein spezielles Augen- merk darauf gelegt, alle Module einfach erweitern zu können. Zu Beginn wird die Architektur besprochen und danach die Einzelheiten der eigentlichen Implementation . Dann werden zwei mögliche Setups getestet und diese werden dann genauer analysiert. Zum Schluss werden noch Wege fuer die Weiterentwicklung vorgeschlagen. Contents Contents ix 1 Introduction 1 1.1 Context ............................................ 2 1.2 Motivation .......................................... 2 2 Tygrstore Architecture 5 2.1 Architecture Overview ................................... 5 2.1.1 Execution Model .................................. 5 2.2 Index Manager ....................................... 7 2.3 Storage Backends ...................................... 9 2.3.1 Storage Backends .................................. 9 2.4 Sparql Parser ......................................... 11 2.5 String Store .......................................... 12 2.6 QueryEngine ......................................... 13 2.6.1 Query Engine Architecture ............................ 13 2.7 Installation and Configuration ............................... 16 2.7.1 Dependencies .................................... 16 2.8 Importer ........................................... 18 3 Example Implementations 21 3.1 Kyoto Cabinet Backend ................................... 21 3.1.1 Data-model ..................................... 22 3.1.2 KVIndexKC Implementation ........................... 22 3.2 MongoDB Backend ..................................... 25 3.2.1 About MongoDB .................................. 25 3.2.2 KVIndexMongo Backend .............................. 26 3.2.3 MongoDB Stringstore ............................... 29 x CONTENTS 4 Evaluation 31 4.0.4 Dataset ........................................ 31 4.0.5 Hardware ...................................... 31 4.0.6 Software ....................................... 31 4.0.7 Benchmarking .................................... 32 4.1 Profiling ........................................... 34 5 Limitations 39 5.1 Architectural Limitations .................................. 39 5.2 Functional Limitations ................................... 39 5.3 Limits caused by Evolution ................................ 40 6 Future Work 41 6.1 Internal Enhancements ................................... 41 6.2 Ecosystem .......................................... 43 7 Conclusions 45 A Appendix 47 A.1 Glossary ........................................... 47 A.2 Detailed Benchmark Results ................................ 48 A.2.1 Results ........................................ 49 A.3 LUBM Queries ........................................ 49 List of Figures 51 Bibliography 53 1 Introduction This Thesis introduces, documents and evaluates a new triplestore called Tygrstore. Besides being a triplestore which evaluates SPARQL queries, Tygrstore is also designed to serve as a founda- tion framework for rapid prototyping of new techniques in RDF storage. According to that, the general focus was on extensibility rather than a specific optimal behavior. Tygrstore is written in Cython 1 which permits fine grained performance optimization written in c but also enables the full benefits of Pythons dynamic nature. Based on the TokyoTyGR triple store (used in [Basca and Bernstein, 2010]), Tygrstore imple- ments the Hexastore model described in [Weiss et al., 2008]and[Weiss and Bernstein, ]. In Section 1.1 the context of triple stores is briefly discussed as it is the base of the Motivation for this thesis. The detailed Architecture of Tygrstore is covered throughout Chapter 2. While Tygrstore can be working in the same way as TokyoTyGR, it can also easily be adapted to other storage facilities than Tokyo Cabinet. This flexibility is shown in Chapter 3 where the two main backend storage drivers are explained in detail. Further characteristics of Tygrstore are explored in the Evaluation in Chapter 4. 1http://cython.org/ 2 Chapter 1. Introduction 1.1 Context In the recent years, the semantic web became a vivid topic in research. Especially the community of graph databases was revitalized. The characteristics of RDF Schema as graph model and SPARQL as a query language unfolded new problems. Among these topics the scalability, processing speed and optimization of triple stores are still heavily discussed. RDF Storage research has concentrated on three main areas: join processing [Vidal et al., 2010],[Senn, 2010], [Kim et al., 2009], query optimization [Schmidt et al., 2010] and data-structures [Weiss and Bernstein, ],[Senn, ], [Atre and Hendler, ] and [Atre et al., 2010]. The combined question still is, on how these going to scale. Brewers CAP theorem [Gilbert and Lynch, 2002] clearly reveals that traditional RDBMS will not solve the problems at hand. Meanwhile the Lean Startup methodology2 and the advance of the internet on mobile (and other platforms) originated in todays abundance of SaaS and PaaS offerings3. These systems in turn have also scaling needs beyond the possibilities of ACID con- form RDBMS. This resulted in a new generation of DBMS, often subsumed as NoSQL stores. This typology usually includes: • Key/Value Stores • Distributed Key/Value Stores • Graph Oriented Stores • Column Oriented Stores • Document Oriented Stores • Map-Reduce based Systems • Mixtures of the above Each of these having at least a hand full of implementations4. So this business oriented community solves at least similar problems as the semantic web community. Where they differ is, that while the semantic web works towards a standardization of each components via the W3C, the interfaces in NoSQL world lack any there of. This leads to the many proprietary query language in existence and the fallback to (compared to SPARQL) relatively primitive data retrieval protocols (REST or JSON). 1.2 Motivation These two communities are solving similar problems and each of these offer methods the other lacks and needs. In that sense, Tygrstore could provide a framework to help those two worlds 2http://en.wikipedia.org/wiki/Lean_Startup 3this includes social networks, google apps, social games etc 4a list can be found on http://nosql-database.org/ 1.2 Motivation 3 converge, while not changing any of the sides at its core. In this Thesis i show an architecture which allows us to leverage, the already highly optimized, production ready, NoSQL stores and supply them with an standardized way to access data therein. Furthermore it should be possible to rapidly prototype and evaluate new concepts in other areas, especially in query optimization and join processing. 2 Tygrstore Architecture 2.1 Architecture Overview The core of Tygrstore consists of five main classes. The Sparql parser, the QueryEngine, the String- store, the IndexManager and the Index. Put together, these modules allow the execution of SPAQRL SELECT queries. Figure 2.1 shows the call hierarchy of these classes. TygrstoreServer and the shell script tygrstore.sh are simple wrapper scripts to make the QueryEngines functionality avail- able via HTTP, msgpack1 and in a posix console. The Sparql parser is a separate Python package called cysparql. It contains Cython bindings to Rasqal 0.9.25 2 as well as methods to save data. This package was kindly provided by Cosmin Basca. 2.1.1 Execution Model After the query has been parsed, the Query Engine uses the Stringstore to convert the RDF Liter- als to their internal representation (database ID). As we will see, this database ID is usually a hash of the

Load more