Reducing Storage Requirements of Multi-Version Graph Databases Using Forward and Reverse Deltas

Reducing storage requirements of multi-version graph databases using forward and reverse deltas Thibault Mahieu Supervisor: Prof. dr. ir. Ruben Verborgh Counsellors: Ir. Ruben Taelman, Dr. ir. Miel Vander Sande Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering Department of Electronics and Information Systems Chair: Prof. dr. ir. Koen De Bosschere Faculty of Engineering and Architecture Academic year 2017-2018 Reducing storage requirements of multi-version graph databases using forward and reverse deltas Thibault Mahieu Supervisor: Prof. dr. ir. Ruben Verborgh Counsellors: Ir. Ruben Taelman, Dr. ir. Miel Vander Sande Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering Department of Electronics and Information Systems Chair: Prof. dr. ir. Koen De Bosschere Faculty of Engineering and Architecture Academic year 2017-2018 Acknowledgments I would like to thank my promoter prof. dr. ir. Ruben Verborgh and my supervisors Ruben Taelman and dr. ir. Miel Vander Sande for their help and advice during the course of this Mas- ter's Dissertation. Their constant guidance and support helped shape this Master's Dissertation to what it is today. I would also like to thank my friends and family, who have supported me throughout the years. Especially my brother, Christof Mahieu, whom I always could turn to for support. Thibault Mahieu May 31, 2018 iv Usage The author gives permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use. In the case of any other use, the limitations of copyright have to be respected, in particular with regard to the obligation to state expressly the source when quoting results from this master dissertation. Thibault Mahieu May 31, 2018 v Reducing Storage Requirements of Multi-Version Graph Databases Using Forward and Reverse Deltas by Thibault Mahieu Master's dissertation submitted in order to obtain the academic degree of Master in Computer Science Engineering Academic year 2017{2018 Supervisor: Prof. dr. ir. Ruben Verborgh Counsellors: Ir. Ruben Taelman, Dr. ir. Miel Vander Sande Faculty of Engineering and Architecture University of Ghent Department of Electronics and Information Systems Chairman: Prof. dr. ir. Koen De Bosschere Summary This master's dissertation presents a potential storage optimization for change-based multi- version graph databases. This storage optimization is then applied to an existing RDF archive called OSTRICH. Finally, the implementation of the presented storage optimization is compared with the original OSTRICH RDF archive. Keywords RDF Versioning, RDF Archiving, Semantic Data Versioning, Bidirectional Delta Chain Reducing Storage Requirements of Multi-Version Graph Databases Using Forward and Reverse Deltas Thibault Mahieu Supervisors: prof. dr. ir. Ruben Verborgh, ir. Ruben Taelman, dr. ir. Miel Vander Sande Abstract—Linked Datasets evolve over time for numerous reasons, such B. RDF Stores as the addition of new data. Capturing this evolution via versioned data archives can provide new insights. This master’s dissertation presents RDF stores are storage systems designed for storing RDF a potential storage optimization for change-based multi-version graph data. databases. This storage optimization is then applied to an existing RDF HDT [6] is a RDF store focussed on compression that consists archive called OSTRICH, for which the implementation is called COBRA. Finally, COBRA is compared with the original OSTRICH RDF archive. of three parts: Our experiments show that COBRA lowers the storage size compared to - contains metadata and serves as an entry point to • Header OSTRICH, but not for every benchmark. the data Keywords—RDF Versioning, RDF Archiving, Bidirectional Delta Chain - mapping between triple components and unique • Dictionary identifiers, referred to as dictionary encoding I. PREFACE - structure of the underlying RDF graph after dictio- • Triples Datasets change over time for numerous reasons, such as the nary encoding addition of new information. Capturing this evolution allows for HDT resolves queries on the compressed data, but only has one historical analyses which can provide new insights. Linked Data index (SP-O), making certain triple patterns hard to resolve. In is no exception to this. In fact, archiving Linked Open Data has addition, by design HDT stores are immutable after creation, been an area of research for a few years [2]. making them unsuitable for volatile datasets. One particular research focus is enabling offsettable query HDT-FoQ [7] is an extension on HDT [6] that focusses on streams for RDF archives since query streams are more resolving queries faster. For this reason, HDT-FoQ adds two ad- memory-efficient for large query results and the offset allows ditional indexes, namely PS-O and OP-S, to cover more access for faster queries when only a subset is needed. OSTRICH [3] patterns. The PS-O makes use of a wavelet tree, while the OP-S is state-of-the-art when it comes to offset-enabled RDF archives. index uses adjacency lists, similar to the SP-O index. OSTRICH stores versions in a delta chain that starts with a fully materialized snapshot followed by a series of changesets relative C. Non-RDF Archives to the snapshots, referred to as deltas. However, OSTRICH has Many techniques from non-RDF archives and Version Con- a large ingestion time for large dataset versions. The ingestion trol System (VCS) can be repurposed for versioning RDF time can be reduced by introducing additional snapshots how- archives. ever this, in turn, can increase the storage size. RCS [8] is a delta-based VCS, wherein each delta consists In this work, we will explore if we can reduce the resulting of insertions and deletions of lines. The latest version is stored storage size increase of the multiple snapshot approach, while completely and older revisions are stored in so-called reverse maintaining the ingestion time reduction, by restructuring the deltas, resulting in quick access to the latest version. To add a delta chain. new revision the system stores the latest revision completely and replaces the previous revision by its delta, keeping the rest of the II. BACKGROUND chain intact. A. Linked Data D. RDF Archives In 2001, Tim Berners Lee, the inventor of the World-Wide- Web, proposed the idea of the Semantic Web [4]. The goal of RDF archives are versioned RDF stores, Fernandez´ et al. the Semantic Web is to make data on the Web understandable to [2] distinguish three archiving policies for Linked Open Data machines so that they can perform complex tasks. In order to (LOD): - Every version is stored fully ma- make this vision a reality, Linked Data (LD) was introduced. As • Independant Copies (IC) described by Bizer et al. [5], LD refers to data published on the terialized. - Only changes between versions are Web in such a way that it is: machine-readable, its meaning is • Change-Based (CB) explicitly defined, it is linked to other external data sets, and in stored. - Triples are annotated with their turn, can be linked to from external data sets. The standard for • Timestamp-Based (TB) representing LD is RDF, a graph-based data model that uses a temporal validity. <subject, predicate, object> triple structure. RDF data can be queried using SPARQL, a graph-based pattern matching query D.1 Independant Copies Archive Policy language, where the graph-based patterns are made up of triple SemVersion [9] is IC versioning system for RDF, that tries to patterns, and consist of a subject, predicate and object. emulate classical Concurrent Versions System (CVS) systems for version management. Each version is stored separately in RDF stores that conform to a certain API, that manages said Fig. 1: Non-Aggregated unidirectional delta chain, as done in TailR. versions. D.2 Change-Based Archive Policy Cassidy et al. [10] propose a CB RDF archive that is built Fig. 2: Aggregated unidirectional delta chain where all deltas are relative to the on Darcs theory of patches [11] - a mathematical model that snapshot at the beginning of the chain, as done in OSTRICH. describes how patches can be manipulated in order to get the desired version in the context of software. This model describes fundamental operations, such as the commute operation, the re- D.4 Hybrid Archive Policy vert operation and the merge operation. Cassidy et al. adapt TailR [21] interleaves fully materialized versions (snapshots) these operations so that are applicable to RDF stores as well. in between the delta chain, as seen in Figure 1. The snapshots Im et al. [12] introduced a CB store with a Relational reset the version materialization cost but can lead to a higher Database Management Systems (RDBMS). They propose an ag- storage requirement. gregated deltas approach wherein not only the delta between a OSTRICH [3] is another hybrid solution that interleaves fully parent and child, but all possible deltas are stored. This results materialized snapshots in between the delta chain, as seen in in an increased storage overhead, but a decreased version mate- Figure 2. However, unlike TailR, OSTRICH uses aggregated rialization cost compared to the classic sequential delta chain. deltas [12], deltas who directly refer to the snapshot, instead of Vander Sande et al. [13] introduce R&WBase - a distributed the previous version. Moreover, the delta chain is stored by an- CB RDF archive, wherein version are stored as a consecutive notating each triple with version information, making it a IC, CB deltas. Deltas between versions consist of an addition set and a and TB hybrid. Ingestion can be done using an in-memory batch deletion set, respectively listing which triples haven been added algorithm or a streaming algorithm. OSTRICH supports offset- and deleted. Since deltas are stored in the same graph, triples are table query result streams.

Load more