Reducing storage requirements of multi-version graph databases using forward and reverse deltas

Thibault Mahieu

Supervisor: Prof. dr. ir. Ruben Verborgh Counsellors: Ir. Ruben Taelman, Dr. ir. Miel Vander Sande

Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering

Department of Electronics and Information Systems Chair: Prof. dr. ir. Koen De Bosschere Faculty of Engineering and Architecture Academic year 2017-2018

Reducing storage requirements of multi-version graph databases using forward and reverse deltas

Thibault Mahieu

Supervisor: Prof. dr. ir. Ruben Verborgh Counsellors: Ir. Ruben Taelman, Dr. ir. Miel Vander Sande

Master's dissertation submitted in order to obtain the academic degree of Master of Science in Computer Science Engineering

Department of Electronics and Information Systems Chair: Prof. dr. ir. Koen De Bosschere Faculty of Engineering and Architecture Academic year 2017-2018 Acknowledgments

I would like to thank my promoter prof. dr. ir. Ruben Verborgh and my supervisors Ruben Taelman and dr. ir. Miel Vander Sande for their help and advice during the course of this Mas- ter’s Dissertation. Their constant guidance and support helped shape this Master’s Dissertation to what it is today. I would also like to thank my friends and family, who have supported me throughout the years. Especially my brother, Christof Mahieu, whom I always could turn to for support.

Thibault Mahieu May 31, 2018

iv Usage

The author gives permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use. In the case of any other use, the limitations of copyright have to be respected, in particular with regard to the obligation to state expressly the source when quoting results from this master dissertation.

Thibault Mahieu May 31, 2018

v Reducing Storage Requirements of Multi-Version Graph Databases Using Forward and Reverse Deltas

by

Thibault Mahieu

Master’s dissertation submitted in order to obtain the academic degree of Master in Computer Science Engineering

Academic year 2017–2018

Supervisor: Prof. dr. ir. Ruben Verborgh Counsellors: Ir. Ruben Taelman, Dr. ir. Miel Vander Sande Faculty of Engineering and Architecture University of Ghent

Department of Electronics and Information Systems Chairman: Prof. dr. ir. Koen De Bosschere

Summary

This master’s dissertation presents a potential storage optimization for change-based multi- version graph databases. This storage optimization is then applied to an existing RDF archive called OSTRICH. Finally, the implementation of the presented storage optimization is compared with the original OSTRICH RDF archive.

Keywords

RDF Versioning, RDF Archiving, Semantic Data Versioning, Bidirectional Delta Chain Reducing Storage Requirements of Multi-Version Graph Databases Using Forward and Reverse Deltas Thibault Mahieu Supervisors: prof. dr. ir. Ruben Verborgh, ir. Ruben Taelman, dr. ir. Miel Vander Sande

Abstract—Linked Datasets evolve over time for numerous reasons, such B. RDF Stores as the addition of new data. Capturing this evolution via versioned data archives can provide new insights. This master’s dissertation presents RDF stores are storage systems designed for storing RDF a potential storage optimization for change-based multi-version graph data. databases. This storage optimization is then applied to an existing RDF HDT [6] is a RDF store focussed on compression that consists archive called OSTRICH, for which the implementation is called COBRA. Finally, COBRA is compared with the original OSTRICH RDF archive. of three parts: Our experiments show that COBRA lowers the storage size compared to - contains metadata and serves as an entry point to • Header OSTRICH, but not for every benchmark. the data Keywords—RDF Versioning, RDF Archiving, Bidirectional Delta Chain - mapping between triple components and unique • Dictionary identifiers, referred to as dictionary encoding I.PREFACE - structure of the underlying RDF graph after dictio- • Triples Datasets change over time for numerous reasons, such as the nary encoding addition of new information. Capturing this evolution allows for HDT resolves queries on the compressed data, but only has one historical analyses which can provide new insights. Linked Data index (SP-O), making certain triple patterns hard to resolve. In is no exception to this. In fact, archiving Linked Open Data has addition, by design HDT stores are immutable after creation, been an area of research for a few years [2]. making them unsuitable for volatile datasets. One particular research focus is enabling offsettable query HDT-FoQ [7] is an extension on HDT [6] that focusses on streams for RDF archives since query streams are more resolving queries faster. For this reason, HDT-FoQ adds two ad- memory-efficient for large query results and the offset allows ditional indexes, namely PS-O and OP-S, to cover more access for faster queries when only a subset is needed. OSTRICH [3] patterns. The PS-O makes use of a wavelet tree, while the OP-S is state-of-the-art when it comes to offset-enabled RDF archives. index uses adjacency lists, similar to the SP-O index. OSTRICH stores versions in a delta chain that starts with a fully materialized snapshot followed by a series of relative C. Non-RDF Archives to the snapshots, referred to as deltas. However, OSTRICH has Many techniques from non-RDF archives and Version Con- a large ingestion time for large dataset versions. The ingestion trol System (VCS) can be repurposed for versioning RDF time can be reduced by introducing additional snapshots how- archives. ever this, in turn, can increase the storage size. RCS [8] is a delta-based VCS, wherein each delta consists In this work, we will explore if we can reduce the resulting of insertions and deletions of lines. The latest version is stored storage size increase of the multiple snapshot approach, while completely and older revisions are stored in so-called reverse maintaining the ingestion time reduction, by restructuring the deltas, resulting in quick access to the latest version. To add a delta chain. new revision the system stores the latest revision completely and replaces the previous revision by its delta, keeping the rest of the II.BACKGROUND chain intact. A. Linked Data D. RDF Archives In 2001, Tim Berners Lee, the inventor of the World-Wide- Web, proposed the idea of the Semantic Web [4]. The goal of RDF archives are versioned RDF stores, Fernandez´ et al. the Semantic Web is to make data on the Web understandable to [2] distinguish three archiving policies for Linked Open Data machines so that they can perform complex tasks. In order to (LOD): - Every version is stored fully ma- make this vision a reality, Linked Data (LD) was introduced. As • Independant Copies (IC) described by Bizer et al. [5], LD refers to data published on the terialized. - Only changes between versions are Web in such a way that it is: machine-readable, its meaning is • Change-Based (CB) explicitly defined, it is linked to other external data sets, and in stored. - Triples are annotated with their turn, can be linked to from external data sets. The standard for • Timestamp-Based (TB) representing LD is RDF, a graph-based data model that uses a temporal validity. triple structure. RDF data can be queried using SPARQL, a graph-based pattern matching query D.1 Independant Copies Archive Policy language, where the graph-based patterns are made up of triple SemVersion [9] is IC versioning system for RDF, that tries to patterns, and consist of a subject, predicate and object. emulate classical Concurrent Versions System (CVS) systems for version management. Each version is stored separately in RDF stores that conform to a certain API, that manages said Fig. 1: Non-Aggregated unidirectional delta chain, as done in TailR. versions.

D.2 Change-Based Archive Policy Cassidy et al. [10] propose a CB RDF archive that is built Fig. 2: Aggregated unidirectional delta chain where all deltas are relative to the on theory of patches [11] - a mathematical model that snapshot at the beginning of the chain, as done in OSTRICH. describes how patches can be manipulated in order to get the desired version in the context of software. This model describes fundamental operations, such as the commute operation, the re- D.4 Hybrid Archive Policy vert operation and the operation. Cassidy et al. adapt TailR [21] interleaves fully materialized versions (snapshots) these operations so that are applicable to RDF stores as well. in between the delta chain, as seen in Figure 1. The snapshots Im et al. [12] introduced a CB store with a Relational reset the version materialization cost but can lead to a higher Database Management Systems (RDBMS). They propose an ag- storage requirement. gregated deltas approach wherein not only the delta between a OSTRICH [3] is another hybrid solution that interleaves fully parent and child, but all possible deltas are stored. This results materialized snapshots in between the delta chain, as seen in in an increased storage overhead, but a decreased version mate- Figure 2. However, unlike TailR, OSTRICH uses aggregated rialization cost compared to the classic sequential delta chain. deltas [12], deltas who directly refer to the snapshot, instead of Vander Sande et al. [13] introduce R&WBase - a distributed the previous version. Moreover, the delta chain is stored by an- CB RDF archive, wherein version are stored as a consecutive notating each triple with version information, making it a IC, CB deltas. Deltas between versions consist of an addition set and a and TB hybrid. Ingestion can be done using an in-memory batch deletion set, respectively listing which triples haven been added algorithm or a streaming algorithm. OSTRICH supports offset- and deleted. Since deltas are stored in the same graph, triples are table query result streams. In addition, OSTRICH also provides annotated with a context number, indicating which version the query count estimation functionality, which can be used as a ba- triple belongs and whether it was added or deleted. In particu- sis for query optimization in query engines [22]. lar, an even context number indicates the triple is an addtition, an uneven context number indicates the triple is a deletion. Queries E. Query Atoms can be handled efficiently by looking at the highest context num- ber. If the context number is even than the triple is present for Fernandez´ et al. [2] also distinguish five types of queries, that version. If the context number is uneven than the triple is called query atoms. queries retrieve data from a not present for that version. Finally, R&WBase also supports • Version Materialization (VM) tagging, branching and merging of datasets. single version. queries retrieve the differences R43ples [14] is another CB RDF archive, since it groups ad- • Delta Materialization (DM) ditions and deletions in named graphs. R43ples allows manipu- between two versions. annotates query result with version lation of revisions with SPARQL, by introducing new keywords • Version Query (VQ) such as REVISION, TAG and BRANCH. Versions are materi- numbers wherein data exists. joins results of two queries over two alized by starting from the head of the branch and applying all • Cross-Version join (CV) prior additions/deletions. different versions. returns a list of versions in • Change Materialization (CM) D.3 Timestamp-Based Archive Policy which a given query produces consecutively different results. Hauptmann et al. [15] propose a similar delta-based store Some storage policies are better suited for some query atoms as R43ples, including complete graphs and via than others. The IC approach is best suited for VM queries SPARQL. However, in Hauptmann’s approach, each triple is vir- since the versions are stored completely and do not need to tually annotated with version information that is cached using a be reconstructed. The CB approach is particularly effective for hash table, making it a TB approach. DM queries between neighboring versions since these changes x-RDF-3X [16] extends RDF-3X [17] with versioning sup- are stored. The TB approach is very efficient in resolving VQ port. Each triple is annotated with a creation timestamp and queries since triples are naturally annotated with version num- when appropriate, a deletion timestamp, making it a TB ap- bers wherein they exist. proach. F. RDF Archiving Benchmarks v-RDFCSA [18] is an TB archiving extension on RDFCSA [19], a compact self-indexing RDF storage that is based on suf- BEAR [2] is a RDF archiving benchmark based on real-world fix arrays. data from three different domains: Dydra [20] is a RDF archive that stores versions as named - 58 weekly snapshots from the Dynamic Linked • BEAR-A graphs in a quad store, that can be queried using the REVI- Data Observatory [23]. SION SPARQL keyword. Dydra uses B+-trees with six indexes: - the 100 most volatile resources from DBPedia • BEAR-B GSPO, GPOS, GOSP, SPOG, POSG, OSPG. B+-tree values in- Live [24] at three different granularities: instant, hour and daily. dicate which revisions a particular quad is visible in, making it - 32 weekly snapshots from the Open Data Portal • BEAR-C a TB system. Watch project [25]. Δ Δ Snapshot Δ Δ Addition Counts Addition Counts Δ Δ Snapshot Δ Δ Metadata Metadata 0 1 2 3 4 Reverse Delta Chain Forward Delta Chain Fig. 3: A simplified non-aggregated bidirectional delta chain. Dictionary

Δ Δ Snapshot Δ Δ ADD SPO DEL SPO HDT ADD SPO DEL SPO

...... Reverse Delta Chain Forward Delta Chain Fig. 4: A simplified aggregated bidirectional delta chain. ADD POS DEL POS ADD POS DEL POS

BEAR-A provides triple pattern queries and their results for ...... seven triple patterns. BEAR-B provides triple pattern queries ADD OSP DEL OSP ADD OSP DEL OSP and their results for ?PO and ?P? triple patterns, which are based on the most frequent triple patterns from the DBpedia query ...... set. BEAR-C provides 10 complex queries that, although they Fig. 5: An overview of the storage structure of a bidirectional delta chain. Figure cannot be efficiently resolved with current archiving strategies, adapted from OSTRICH [3]. they could help foster development of new query resolution al- gorithms. snapshot and the versions results in smaller aggregated deltas, III.STORAGE OPTIMIZATION:BIDIRECTIONAL DELTA thus reducing the overall storage size. Bidirectional delta chains CHAIN reduce the average distance between the snapshot and other ver- sions. Therefore bidirectional delta chains should have a lower As seen in previous works [21], [3], a delta chain consists storage size, compared to unidirectional delta chains for grow- of a fully materialized snapshot followed by a series of deltas. ing datasets. The main idea behind our storage optimization is moving the snapshot from the front of the delta chain to the middle of the C. Bidirectional Delta Chain Disadvantages delta chain, in order to potentially reduce the overall storage size. This transforms the delta chain into a bidirectional delta In-order ingestion is the biggest drawback of bidirectional chain, which divides the original delta chain into two smaller delta chains. Indeed, to ingest a version in the reverse delta delta chains, i.e. the reverse delta chain and the forward delta chain, we would need to calculate the delta between the version chain. Figure 3 and 4 show two example bidirectional delta and an unknown snapshot. However, a fix-up algorithm could chains. be used to build the bidirectional delta chain. In the fix-up algo- rithm, all versions are stored in a forward delta chain. Once the A. Non-Aggregated Bidirectional Delta Chain future snapshot is inserted, the forward delta chain can be con- In a non-aggregated delta chain, all deltas reference the clos- verted into a reverse delta chain. RCS [8] presents an alternative est preceding version. So in order to materialize a version, all to the fix-up algorithm. For this algorithm, the latest version is preceding deltas need to be applied until the fully materialized always stored fully materialized. To add a new version, the sys- snapshot is reached. As stated above, a bidirectional delta chain tem stores the new version completely and replaces the previous divides the original delta chain into two smaller delta chains. version by its delta, keeping the rest of the chain intact. Moreover, the size of the deltas remains the same, since the re- verse delta chain is just the inverse of the original deltas. There- IV. BIDIRECTIONAL RDFARCHIVE fore, the worst-case materialization cost for bidirectional delta A. Storage Overview chains is half of that for unidirectional delta chains. On the other hand, bidirectional non-aggregated delta chains could also po- The storage structure for the bidirectional delta chain can be tentially reduce storage size, while maintaining a similar version seen in Figure 5. The storage structure is similar to OSTRICH materialization time. Indeed, if we compare a series of two uni- [3] and as you can see the reverse delta chain has the same stor- directional delta chains with a single bidirectional delta chain, age structure as the forward delta chain. one fewer snapshot would need to be stored. B. Multiple Snapshots B. Aggregated Bidirectional Delta Chain The fix-up algorithm requires multiple snapshots, but OS- In an aggregated delta chain, all deltas reference the snapshot, TRICH only supports a single snapshot. Therefore, we modify which means that an aggregated delta contains all the changes OSTRICH so that multiple snapshots are supported. Support- from all preceding deltas. In this work, we assume that a higher ing multiple snapshots comes down to finding the correspond- distance between versions results in a bigger aggregated delta ing snapshot for a given version. We calculate the greatest lower chain. This assumption holds for datasets that steadily grow bound and the least upper bound of all the snapshots for the over time by adding more new triples because later versions will given version. If the upper bound snapshot does not have a re- have more and more new triples compared to earlier versions. verse delta chain, our version is stored in a forward delta chain It follows then that reducing the average distance between the and the corresponding snapshot is the lower bound snapshot. chain (intra-delta) and a DM query between two deltas in op- Snapshot Δ Δ Δ posite delta chains (inter-delta). The first case and second case Snapshot Δ Δ Δ are similar to OSTRICH. In the third case, we resolve the DM Fig. 6: State of the delta chains before the fix-up algorithm is applied. query by splitting up the requested delta in two sequential deltas that are relative the snapshot and then merging theses sequen- tial deltas back together. In other words if we use DARCs [11] If the upper bound snapshot has a reverse snapshot, the corre- patch notation, with o being the start version, e being the end sponding snapshot is the snapshot closest to the version. version and s being the snapshot: oDe = oD1sD2e This strat- egy is quite efficient, since the deltas relative to the snapshot are C. Ingestion stored. Furthermore, since the snapshot deltas are sorted, they As mentioned before, in-order ingestion is difficult in a bidi- can be merged in a sort-merge fashion. It is difficult to give an rectional delta chain. Therefore, we will first discuss out-of- exact count of the results, for inter-delta DM queries. However, order ingestion, before discussing in-order ingestion. an estimation of the result count can be calculated by summing up the counts of both deltas relative to the snapshot. This can C.1 Out-of-order Ingestion overestimate the actual count if triples are present in both deltas. Ingesting versions out-of-order in a reverse delta chain is sim- VQ queries annotate triples with version numbers in which ilar to OSTRICH’s forward ingestion process, we simply need they exist. We will present a VQ algorithm for a single snap- to transform the input . Firstly, since the forward in- shot and corresponding reverse and forward delta chain. The gestion algorithm expects the input changeset to reference the algorithm is based on the VQ algorithm of OSTRICH. The al- snapshot, we reverse the input change set by swapping the ad- gorithm starts by iterating over all the triples in the snapshot for ditions and deletion so that the input changeset references the the given triple pattern. Next, the deletion trees are probed for snapshot. Secondly, since the forward ingestion algorithm ex- the triple. If the triple is not present in the deletion tree, the pects the version closest to the snapshot to be inserted first, we triple is present in all versions. If the triple is present in a dele- insert the versions in reverse order. tion tree the corresponding versions are erased from the version annotation. After all the snapshot triples have been processed, C.2 In-order Ingestion the algorithm iterates over the addition triples stored in the ad- For in-order ingestion, we utilize a fix-up algorithm, which dition tree in a sort-merge join fashion. As was the case with starts ingesting versions in a temporary forward delta chain. snapshot triples, the deletion trees are probed for the triple. If Once the system decides a new delta chain needs to be initiated, the triple is not present in the deletion trees, the triple is present for example, the delta chain size exceeds a certain threshold, the in all versions ranging from the version that introduced the triple system will store the next version once in the temporary forward to the last version. If the triple is present in a deletion tree the delta chain and store it again as the snapshot for the new perma- versions are erased from the annotations. Result streams can be nent delta chain. The reason behind storing the version twice is partially offset, by offsetting the snapshot iterator of HDT [6]. to simplify the input extraction, which will be explained in the following section. Subfigure 6 shows the resulting delta chains. V. EVALUATION Once the system has some idle time the fix-up process can COBRA (Change-Based Offset-Enabled Bidirectional RDF be performed. The fix-up process starts by extracting the origi- Archive) refers to the C++ software implementation of our stor- nal input changeset from the temporary delta chain. Hence, the age optimization. COBRA uses the same software libraries as algorithm iterates over the version information for every triple OSTRICH [3]. in the temporary delta chain. If the previous version is present, that means that the triple was already added in a previous version A. Experimental Setup and therefore the triple was not present in the input changeset. If the previous version is not present in the version information, We will evaluate the ingestion and query resolution capabili- the triple was first added in the current version and should be ties of COBRA. For this we will use the BEAR [2] benchmark, present in the input changeset. The temporary delta chain can particularly BEAR-A, BEAR-B daily and BEAR-B hourly. then be deleted and a new permanent reverse delta chain can be The ingestion process will be evaluated on storage size and in- constructed out-of-order with the extracted input changeset. gestion time. For BEAR-A we will only ingest the first eight ver- sions due to memory constraints. Similarly, for BEAR-B hourly, D. Queries we will only ingest the first 400 versions. For BEAR-B daily, we VM queries retrieve data from a single version. VM queries will ingest all 89 versions. We will do the ingestion evaluation are handled exactly the same as in OSTRICH, even for versions for multiple storage layouts and ingestion orders namely: stored in the reverse delta chain, since we stored inverse deltas. OSTRICH-1F: OSTRICH with one forward delta chain, as • DM queries retrieve the differences between two versions and seen in Figure 2. annotate whether they are an addition or deletion. In this work, OSTRICH-2F: OSTRICH with two forward delta chains. • we will focus on DM queries for a single snapshot and corre- COBRA-PRE FIX UP, COBRA’s pre fix-up state, as seen in • sponding reverse and forward delta chain. We can discern three Figure 6. cases for DM queries, namely: a DM query between snapshot COBRA-POST FIX UP, COBRA’s bidirectional delta chain • and delta, a DM query between two deltas in the same delta post fix-up, as seen in 4. Fig. 7: Cumulative storage size for the first eight versions of BEAR-A. Fig. 8: Cumulative storage size for all versions of BEAR-B daily.

COBRA-OUT OF ORDER, COBRA’s bidirectional delta • chain, as seen in 4, but ingested out-of-order (snapshot - reverse delta chain - forward delta chain). BEAR also provides query sets, which will be evaluated as VM queries for all version, DM queries between all versions and a VQ query. Since neither OSTRICH nor COBRA support multiple snapshots for all query atoms, we limit our experiments to OSTRICH’s unidirectional storage layout and COBRA’s bidi- rectional storage layout. Fig. 9: Cumulative storage size for the first 400 version of BEAR-B hourly.

B. Results As can be seen in Figures 7, 8 and 9, there is no approach that has the lowest storage size for all the benchmarks. Indeed, CO- BRA has the lowest storage size for BEAR-A, OSTRICH-1F has the lowest storage size for BEAR-B daily and OSTRICH- 2F has the lowest storage size for BEAR-B hourly. We can see that for all benchmarks, COBRA-OUT OF ORDER reduces the storage increase from initializing a second delta chain. How- Fig. 10: Mean BEAR- Fig. 11: Mean BEAR-A ever, this does not always result in an overall storage size reduc- A VM query duration of DM query duration be- Fig. 12: Mean BEAR- all versions for all triple tween all versions for all A VQ query duration for tion due to the size difference between the first delta chain and patterns. triple patterns. all triple patterns. the reverse delta chain. Table I shows the ingestion times of the different configu- rations for all three benchmarks. We can see that OSTRICH- 1F has the highest ingestion time. We also see that COBRA- PRE FIX UP has a higher ingestion time than OSTRICH-2F due to the additional version. Figures 10, 13, 16 display the mean VM query duration for all three benchmarks. We can see that VM queries are resolved faster in COBRA than OSTRICH, eventhough the same VM al- Fig. 14: Mean BEAR- gorithm was used. Fig. 13: Mean BEAR- B daily DM query du- As can be seen in Figures 11, 14, 17 COBRA resolves DM B daily VM query du- ration between all ver- Fig. 15: Mean BEAR-B queries faster than OSTRICH. The reason for this is that intra- ration of all versions for sions for all triple pat- daily VQ query duration all triple patterns. terns. for all triple patterns. delta DM queries are faster in smaller delta chains. Figure 12, 15, 18 display the mean VQ query duration for all three benchmarks. It can be seen that VQ are roughly similar for COBRA and OSTRICH, which means COBRA’s altered VQ algorithm does not cause significant overhead.

VI.CONCLUSION In this work, we presented bidirectional delta chains as a po- tential storage optimization for CB RDF archives. We applied Fig. 17: Mean BEAR- this storage optimization on an existing RDF archive named OS- Fig. 16: Mean BEAR- B hourly DM query du- Fig. 18: Mean BEAR- TRICH [3]. For this purpose, we modified OSTRICH so that B hourly VM query du- ration between all ver- B hourly VQ query du- multiple snapshots could be supported. Next, we presented an ration of all versions for sions for all triple pat- ration for all triple pat- all triple patterns. terns. terns. in-order ingestion algorithm. Moreover, we presented a novel TABLE I: Ingestion times of the different configurations for all three benchmarks. The ingestion time of COBRA-OUT OF ORDER is reprented as the sum of the ingestion time of COBRA-PRE FIX UP and the fix-up time, since COBRA-OUT OF ORDER uses the fix-up algorithm. configuration BEAR-A (min) BEAR-B daily (min) BEAR-B hourly (min) OSTRICH-1F 1419.27 6.53 34.47 OSTRICH-2F 686.87 3.18 15.2 COBRA-PRE FIX UP 775.31 3.28 14.87 COBRA-POST FIX UP 775.31 + 502.75 3.28 + 2.48 14.87 + 11.41 COBRA-OUT OF ORDER 877.52 4.24 18.30

DM query algorithm for inter-delta versions. Finally, we altered [8] Tichy Walter F., “Rcs a system for version control,” Software: Practice the existing VQ query algorithm so that bidirectional chains are and Experience, vol. 15, no. 7, pp. 637–654, 1982. [9] Max Volkel¨ and Tudor Groza, “SemVersion: An RDF-based Ontology supported. Versioning System,” in Proceedings of IADIS International Conference on We evaluated different storage configurations and concluded WWW/Internet (IADIS 2006), Miguel Baptista Nunes, Ed., Murcia, Spain, that no storage configuration has the lowest storage size for all October 2006, pp. 195–202. [10] Steve Cassidy and James Ballantine, “Version control for rdf triple benchmarks. We recommend initializing a new delta chain when stores.,” in ICSOFT 2007 - 2nd International Conference on Software the latest delta chain becomes too large. We also recommend and Data Technologies, Proceedings, 01 2007, pp. 5–12. merging two forward delta chains into a bidirectional delta chain [11] David Roundy, “Darcs: Distributed version management in haskell,” in Proceedings of the 2005 ACM SIGPLAN Workshop on Haskell, New York, if the first delta chain is more similar to the second snapshot NY, USA, 2005, Haskell ’05, pp. 1–4, ACM. than the first snapshot. We also confirmed that initiating a new [12] Dong-Hyuk Im, Sang-Won Lee, and Hyoung-Joo Kim, “A version man- agement framework for rdf triple stores,” International Journal of Soft- delta chain is a viable method for reducing the ingestion time. ware Engineering and Knowledge Engineering, vol. 22, no. 01, pp. 85– Finally, we evaluated VM, DM and VQ queries for OSTRICH 106, 2012. and COBRA and observed that VM and DM queries were faster [13] Miel Vander Sande, Pieter Colpaert, Ruben Verborgh, Sam Coppens, Erik Mannens, and Rik Van de Walle, “R&Wbase: for triples,” in Proceed- in COBRA and VQ were equal. ings of the 6th Workshop on Linked Data on the Web, Christian Bizer, Tom In conclusion, binary delta chains are not the all-round stor- Heath, Tim Berners-Lee, Michael Hausenblas, and Soren¨ Auer, Eds., May age optimization technique we set-out to find at the start of this 2013, vol. 996 of CEUR Workshop Proceedings. [14] Markus Graube, Stephan Hensel, and Leon Urbas, “R43ples: Revisions work, however it is a viable tool for reducing the overall storage for triples an approach for version control in the semantic web,” in CEUR size in certain cases. Workshop Proceedings, 2014. On this topic, there are many opportunities for future re- [15] Claudius Hauptmann, Michele Brocco, and Wolfgang Worndl,¨ “Scalable semantic version control for linked data management,” in LDQ@ESWC, search. First, there needs to be a reliable way of predicting 2015. whether a delta chain is more similar to the preceding snap- [16] Thomas Neumann and Gerhard Weikum, “x-rdf-3x: Fast querying, high shot or the future snapshot. Second, future work could devise update rates, and consistency for rdf databases,” Proc. VLDB Endow., vol. 3, no. 1-2, pp. 256–263, Sept. 2010. a novel input extraction algorithm for the fix-up algorithm so [17] Thomas Neumann and Gerhard Weikum, “Rdf-3x: A risc-style engine for that the middle version does not need to be stored twice. Fi- rdf,” Proc. VLDB Endow., vol. 1, no. 1, pp. 647–659, Aug. 2008. nally, additional research is needed to expand the current DM [18] A. Cerdeira-Pena, A. Faria, J. D. Fernndez, and M. A. Martnez-Prieto, “Self-indexing rdf archives,” in 2016 Data Compression Conference and VQ algorithms for multiple snapshots and allow for more (DCC), March 2016, pp. 526–535. efficient offsets. [19] Nieves R. Brisaboa, Ana Cerdeira-Pena, Antonio Farina,˜ and Gonzalo Navarro, “A compact RDF store using suffix arrays,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intel- REFERENCES ligence and Lecture Notes in Bioinformatics), 2015. [1] Cyril Schoreels, Brian Logan, and Jonathan M Garibaldi, “Agent based [20] James Anderson and Arto Bendiken, “Transaction-time queries in Dydra,” genetic algorithm employing financial technical analysis for making trad- in Joint proceedings of the 3rd Workshop on Managing the Evolution and ing decisions using historical equity market data,” in Intelligent Agent Preservation of the Data Web (MEPDaW 2017) and the 4th Workshop on Technology, 2004.(IAT 2004). Proceedings. IEEE/WIC/ACM International Linked Data Quality (LDQ 2017) co-located with 14th European Semantic Conference on. IEEE, 2004, pp. 421–424. Web Conference (ESWC 2017), 2016. [2] Javier D. Fernandez,´ Jurgen¨ Umbrich, Axel Polleres, and Magnus Knuth, [21] Paul Meinhardt, Magnus Knuth, and Harald Sack, “Tailr: a platform for “Evaluating query and storage strategies for rdf archives,” in Proceedings preserving history on the web of data,” in Proceedings of the 11th Inter- of the 12th International Conference on Semantic Systems, New York, NY, national Conference on Semantic Systems. ACM, 2015, pp. 57–64. USA, 2016, SEMANTiCS 2016, pp. 41–48, ACM. [22] Ruben Verborgh, Miel Vander Sande, Olaf Hartig, Joachim Van Herwe- [3] Ruben Taelman, Ruben Verborgh, and Erik Mannens, “Exposing RDF gen, Laurens De Vocht, Ben De Meester, Gerald Haesendonck, and Pieter archives using triple pattern fragments,” in Lecture Notes in Computer Colpaert, “Triple pattern fragments: a low-cost knowledge graph interface Science (including subseries Lecture Notes in Artificial Intelligence and for the web,” Web Semantics: Science, Services and Agents on the World Lecture Notes in Bioinformatics), 2017. Wide Web, vol. 37, pp. 184–206, 2016. [4] Tim Berners-Lee, James Hendler, and Ora Lassila, “The Semantic Web,” [23] Tobias Kafer,¨ Ahmed Abdelrahman, Jurgen¨ Umbrich, Patrick O’Byrne, Scientific American, vol. 284, no. 5, pp. 34–43, 2001. and Aidan Hogan, “Exploring the dynamics of linked data,” in The [5] Christian Bizer, Tom Heath, and Tim Berners-Lee, “Linked data - the story Semantic Web: ESWC 2013 Satellite Events, Philipp Cimiano, Miriam so far,” International Journal on Semantic Web and Information Systems, Fernandez,´ Vanessa Lopez, Stefan Schlobach, and Johanna Volker,¨ Eds., vol. 5, no. 3, pp. 1–22, 2009. Berlin, Heidelberg, 2013, pp. 302–303, Springer Berlin Heidelberg. [6] Javier D. Fernndez, Miguel A. Martnez-Prieto, Claudio Gutirrez, Axel [24] Mohamed Morsey, Jens Lehmann, Sren Auer, Claus Stadler, and Sebas- Polleres, and Mario Arias, “Binary rdf representation for publication and tian Hellmann, “Dbpedia and the live extraction of structured data from exchange (hdt),” Web Semantics: Science, Services and Agents on the wikipedia,” Program, vol. 46, no. 2, pp. 157–181, 2012. World Wide Web, vol. 19, pp. 22 – 41, 2013. [25] Jurgen¨ Umbrich, Sebastian Neumaier, and Axel Polleres, “Quality assess- [7] Miguel A. Mart´ınez-Prieto, Mario Arias Gallego, and Javier D. Fernandez,´ ment and evolution of open data portals,” in Future Internet of Things and “Exchange and consumption of huge rdf data,” in The Semantic Web: Re- Cloud (FiCloud), 2015 3rd International Conference on. IEEE, 2015, pp. search and Applications, Elena Simperl, Philipp Cimiano, Axel Polleres, 404–411. Oscar Corcho, and Valentina Presutti, Eds., Berlin, Heidelberg, 2012, pp. 437–452, Springer Berlin Heidelberg. Table of Contents

Acknowledgments iv

Usage v

Summary vi

Extended Abstract vii

Table of Contents xiii

List of Figures xvi

List of Tables xix

1 Preface 1 1.1 Introduction ...... 1 1.2 Research Question ...... 1 1.3 Outline ...... 2

2 Background 3 2.1 Semantic Web ...... 3 2.2 RDF ...... 3 2.3 SPARQL ...... 4 2.4 RDF Storage Layout ...... 4 2.4.1 RDBMS Storage ...... 4 2.4.1.1 Triple Table ...... 5 2.4.1.2 Property Table ...... 5 2.4.1.3 Vertical Partitioning ...... 5 2.4.2 NoSQL Storage ...... 7 2.4.3 Native Storage ...... 7 2.5 Archiving ...... 8 2.5.1 Non-RDF Archives ...... 8 2.5.2 RDF Archives ...... 8 2.5.2.1 Independant Copies ...... 8 2.5.2.2 Change-Based ...... 9 2.5.2.3 Timestamp-Based ...... 9 2.5.2.4 Hybrid ...... 9 2.6 Query Types ...... 10 2.6.1 Independant Copies ...... 10 2.6.2 Change-Based ...... 11 2.6.3 Timestamp-Based ...... 11 2.7 RDF Archive Benchmarks ...... 11 2.7.1 BEAR ...... 11 2.7.2 EvoGen ...... 11 2.7.3 SPBv ...... 11

xiii TABLE OF CONTENTS xiv

3 Use Case: Friend Network 13 3.1 Use Case ...... 13 3.2 Requirements ...... 14 3.3 Need ...... 14

4 Storage Optimization: Bidirectional Delta Chain 15 4.1 Advantages Bidirectional Delta Chain ...... 15 4.1.1 Advantages Non-Aggregated Bidirectional Delta Chain ...... 15 4.1.2 Advantages Aggregated Bidirectional Delta Chain ...... 16 4.2 Disadvantages Bidirectional Delta Chain ...... 16 4.3 Hypotheses ...... 18

5 OSTRICH Overview 19 5.1 Storage Structure ...... 19 5.1.1 Snapshot Storage ...... 19 5.1.2 Delta Chain Dictionary ...... 19 5.1.3 Delta Storage ...... 20 5.1.3.1 Local Change Flags ...... 20 5.1.3.2 Deletion Relative Position ...... 20 5.1.3.3 Multiple Indexes ...... 21 5.1.3.4 Addition Counts ...... 21 5.1.3.5 Deletion Counts ...... 21 5.1.3.6 Metadata ...... 21 5.2 Ingestion ...... 21 5.3 Queries ...... 22 5.3.1 Version Materialization Query ...... 22 5.3.1.1 Query ...... 22 5.3.1.2 Result Count ...... 23 5.3.2 Delta Materialization Query ...... 23 5.3.2.1 Query ...... 23 5.3.2.2 Result Count ...... 24 5.3.3 Version Query ...... 24 5.3.3.1 Query ...... 24 5.3.3.2 Result Count ...... 24

6 Bidirectional RDF Archive 25 6.1 Storage Structure ...... 25 6.2 Multiple Snapshots ...... 25 6.3 Ingestion ...... 25 6.3.1 Out-of-order Ingestion ...... 26 6.3.2 In-order Ingestion: Fix-Up Algorithm ...... 26 6.4 Query ...... 28 6.4.1 Version Materialized Query ...... 28 6.4.2 Delta Materialized Query ...... 29 6.4.2.1 Query ...... 29 6.4.2.2 Result Count ...... 30 6.4.3 Version Query ...... 30 6.4.3.1 Query ...... 30 6.4.3.2 Result Count ...... 32

7 Evaluation 33 7.1 COBRA Implementation ...... 33 7.2 Experimental Setup ...... 33 7.2.1 Ingestion ...... 33 7.2.2 Query ...... 34 7.3 Results ...... 34 7.3.1 Ingestion Results ...... 34 7.3.2 Query Results ...... 34 TABLE OF CONTENTS xv

7.4 Discussion ...... 39 7.4.1 Ingestion Evaluation ...... 46 7.4.1.1 Storage Size ...... 46 7.4.1.2 Ingestion Time ...... 47 7.4.2 Query Evaluation ...... 47 7.4.3 Hypotheses Evaluation ...... 47

8 Conclusion and Future Work 49 8.1 Conclusion ...... 49 8.2 Future Work ...... 50

Bibliography 51

Appendices 54 A BEAR-A Query Results ...... 55 B BEAR-B daily Query Results ...... 61 C BEAR-B hourly Query Results ...... 62 List of Figures

2.1 An example RDF graph...... 4 2.2 Unidirectional delta chain, as done in TailR...... 10 2.3 Unidirectional delta chain where all deltas are relative to the snapshot at the beginning of the chain, as done in OSTRICH...... 10

3.1 Friend Network Example ...... 13

4.1 A simplified non-aggregated bidirectional delta chain...... 15 4.2 A simplified aggregated bidirectional delta chain...... 16 4.3 An example to showcase unidirectional and bidirectional non-aggregated delta chains. Triples are represented by numbers...... 17 4.4 An example to showcase unidirectional and bidirectional aggregated delta chains. Triples are represented by numbers...... 17

5.1 An overview of the storage structure used in OSTRICH...... 20

6.1 An overview of the storage structure of a bidirectional delta chain. Figure adapted from OSTRICH...... 26 6.2 An illustration of the fix-up algorithm...... 28 6.3 State of the delta chains before the fix-up algorithm is applied...... 28

7.1 Comparison of the cumulative storage sizes (in GB) per version for the first eight versions of the BEAR-A benchmark...... 35 7.2 Comparison of the cumulative ingestion times (in hours) per version for the first eight versions of the BEAR-A benchmark...... 35 7.3 Comparison of the individual ingestion times (in minutes) per version for the first eight versions of the BEAR-A benchmark...... 36 7.4 Comparison of the cumulative storage sizes (in MB) per version of the BEAR-B daily benchmark...... 36 7.5 Comparison of the cumulative ingestion time (in min) per version of the BEAR-B daily benchmark...... 37 7.6 Comparison of the individual ingestion time (in min) per version of the BEAR-B daily benchmark...... 37 7.7 Comparison of the cumulative storage sizes (in MB) per version of the BEAR-B hourly benchmark...... 38 7.8 Comparison of the cumulative ingestion time (in min) per version of the BEAR-B hourly benchmark...... 38 7.9 Comparison of the individual ingestion time (in min) per version of the BEAR-B hourly benchmark...... 40 7.10 Average VM query durations for all triple patterns in BEAR-A...... 40 7.11 Average DM query durations between version 3 and all other versions for all triple patterns in BEAR-A...... 41 7.12 Average DM query durations between all versions for all triple patterns in BEAR-A. 41 7.13 Average VQ query durations for all triple patterns in BEAR-A...... 42 7.14 Average VM query durations for all provided triple patterns in BEAR-B daily. . 42

xvi LIST OF FIGURES xvii

7.15 Average DM query durations between version 3 and all other versions for all triple patterns in BEAR-B daily...... 43 7.16 Average DM query durations between all versions for all triple patterns in BEAR- B daily...... 43 7.17 Average VQ query durations for all provided triple patterns in BEAR-B daily. . 44 7.18 Average VM query durations for all provided triple patterns in the first 400 ver- sions of BEAR-B hourly...... 44 7.19 Average DM query durations between version 3 and all other versions for all triple patterns in BEAR-B hourly...... 45 7.20 Average DM query durations between all versions for all triple patterns in BEAR- B hourly...... 45 7.21 Average VQ query durations for all provided triple patterns in the first 400 ver- sions of BEAR-B hourly...... 46

1 Average VM query durations for SPO triple patterns in the first eight versions of BEAR-A...... 55 2 Average VM query durations for low cardinality S?O triple patterns in the first eight versions of BEAR-A...... 55 3 Average VM query durations for low cardinality SP? triple patterns in the first eight versions of BEAR-A...... 55 4 Average VM query durations for high cardinality SP? triple patterns in the first eight versions of BEAR-A...... 55 5 Average VM query durations for low cardinality ?PO triple patterns in the first eight versions of BEAR-A...... 56 6 Average VM query durations for high cardinality ?PO triple patterns in the first eight versions of BEAR-A...... 56 7 Average VM query durations for low cardinality ??O triple patterns in the first eight versions of BEAR-A...... 56 8 Average VM query durations for high cardinality ??O triple patterns in the first eight versions of BEAR-A...... 56 9 Average VM query durations for low cardinality ?P? triple patterns in the first eight versions of BEAR-A...... 56 10 Average VM query durations for high cardinality ?P? triple patterns in the first eight versions of BEAR-A...... 56 11 Average VM query durations for low cardinality S?? triple patterns in the first eight versions of BEAR-A...... 57 12 Average VM query durations for high cardinality S?? triple patterns in the first eight versions of BEAR-A...... 57 13 Average DM query durations for SPO triple patterns in the first eight versions of BEAR-A...... 57 14 Average DM query durations for low cardinality S?O triple patterns in the first eight versions of BEAR-A...... 57 15 Average DM query durations for low cardinality SP? triple patterns in the first eight versions of BEAR-A...... 57 16 Average DM query durations for high cardinality SP? triple patterns in the first eight versions of BEAR-A...... 57 17 Average DM query durations for low cardinality ?PO triple patterns in the first eight versions of BEAR-A...... 58 18 Average DM query durations for high cardinality ?PO triple patterns in the first eight versions of BEAR-A...... 58 19 Average DM query durations for low cardinality ??O triple patterns in the first eight versions of BEAR-A...... 58 20 Average DM query durations for high cardinality ??O triple patterns in the first eight versions of BEAR-A...... 58 21 Average DM query durations for low cardinality ?P? triple patterns in the first eight versions of BEAR-A...... 58 22 Average DM query durations for high cardinality ?P? triple patterns in the first eight versions of BEAR-A...... 58 LIST OF FIGURES xviii

23 Average DM query durations for low cardinality S?? triple patterns in the first eight versions of BEAR-A...... 59 24 Average DM query durations for high cardinality S?? triple patterns in the first eight versions of BEAR-A...... 59 25 Average VQ query durations for SPO triple patterns in the first eight versions of BEAR-A...... 59 26 Average VQ query durations for low cardinality S?O triple patterns in the first eight versions of BEAR-A...... 59 27 Average VQ query durations for low cardinality SP? triple patterns in the first eight versions of BEAR-A...... 59 28 Average VQ query durations for high cardinality SP? triple patterns in the first eight versions of BEAR-A...... 59 29 Average VQ query durations for low cardinality ?PO triple patterns in the first eight versions of BEAR-A...... 60 30 Average VQ query durations for high cardinality ?PO triple patterns in the first eight versions of BEAR-A...... 60 31 Average VQ query durations for low cardinality ??O triple patterns in the first eight versions of BEAR-A...... 60 32 Average VQ query durations for high cardinality ??O triple patterns in the first eight versions of BEAR-A...... 60 33 Average VQ query durations for low cardinality ?P? triple patterns in the first eight versions of BEAR-A...... 60 34 Average VQ query durations for high cardinality ?P? triple patterns in the first eight versions of BEAR-A...... 60 35 Average VQ query durations for low cardinality S?? triple patterns in the first eight versions of BEAR-A...... 61 36 Average VQ query durations for high cardinality S?? triple patterns in the first eight versions of BEAR-A...... 61 37 Average VM query durations for ?P? triple patterns in BEAR-B daily...... 61 38 Average DM query durations for ?P? triple patterns in BEAR-B daily...... 61 39 Average VQ query durations for ?P? triple patterns in BEAR-B daily...... 61 40 Average VM query durations for ?PO triple patterns in BEAR-B daily...... 62 41 Average DM query durations for ?PO triple patterns in BEAR-B daily...... 62 42 Average VQ query durations for ?PO triple patterns in BEAR-B daily...... 62 43 Average VM query durations for ?P? triple patterns in the first 400 versions of BEAR-B hourly...... 62 44 Average DM query durations for ?P? triple patterns in the first 400 versions of BEAR-B hourly...... 62 45 Average VQ query durations for ?P? triple patterns in the first 400 versions of BEAR-B hourly...... 63 46 Average VM query durations for ?PO triple patterns in the first 400 versions of BEAR-B hourly...... 63 47 Average DM query durations for ?PO triple patterns in the first 400 versions of BEAR-B hourly...... 63 48 Average VQ query durations for ?PO triple patterns in the first 400 versions of BEAR-B hourly...... 63 List of Tables

2.1 An example triple table...... 5 2.2 An example property table...... 6 2.3 An example vertical partioning table...... 6

5.1 Overview of which index OSTRICH uses for each triple pattern...... 21

7.1 Storage sizes and ingestion times of the different approaches for all three bench- marks. COBRA-POST FIX UP represents the in-order ingestion of the bidirec- tional delta chain using the fix-up algorithm. Therefore, the ingestion time is the sum of the ingestion time of COBRA-PRE FIX UP and the fix-up time...... 39

xix LIST OF TABLES xx Acronyms

CVS Concurrent Versions System. ix, 8

CB Change-Based. ix, 8, 9, 11, 13, 15 CM Change Materialization. ix, 10, 14 COBRA Change-based Offset-enabled Bidirectional RDF Archive. ix, 17, 20 CSA Compressed Suffix Array. ix CV Cross-Version join. ix, 10

DBMS Database Management Systems. ix, 4 DM Delta Materialization. ix, 10, 14, 18, 19

FOAF Friend Of A Friend. ix, 13

IC Independant Copies. ix, 8–11

LUBM Lehigh University Benchmark. ix, 11

RDBMS Relational Database Management Systems. ix, 4, 8, 13 RDF Resource Description Framework. vii, ix, 2–9, 11, 13–15

SCM Software Configuration Management. ix, 7 SPARQL SPARQL Protocol And RDF Query Language. ix, 2, 3, 9

TB Timestamp-Based. ix, 9, 11, 18

VCS Version Control System. ix, 7 VM Version Materialization. ix, 10, 11, 14, 15, 17, 18 VQ Version Query. ix, 10, 11, 14, 18, 19

xxi Chapter 1

Preface

1.1 Introduction

Datasets change over time for numerous reasons, such as the addition of new information or the correction of erroneous information. Capturing this evolution allows for historical analyses which can provide new insights. For example, historical market data can be used to predict future market trends [2]. Linked Data is no exception to this. Linked Data is data that is structured in such a way that it is understandable for machines. The standard way of modeling Linked Data is RDF, which uses a triple structure. This triple structure can be used to form graphs, where the subject and object are nodes linked together via the predicate. Archiving RDF data has been an area of research for a few years [3]. The archiving strategies can be categorized into three groups [3]: • Independant Copies (IC) - Every version is stored fully materialized. • Change-Based (CB) - Only changes between versions are stored. • Timestamp-Based (TB) - Triples are annotated with their temporal validity. One particular research focus is enabling offsettable query streams for RDF archives since query streams are more memory efficient for large query results and the offset allows for faster queries when only a subset is needed. OSTRICH [1] is the state of the art when it comes to offset- enabled RDF archives. OSTRICH is a IC, CB and TB hybrid RDF archive that stores versions in a delta chain that starts with a fully materialized snapshot followed by a series of deltas that all reference this snapshot. However, OSTRICH has a large ingestion time for large dataset versions. The ingestion time can be reduced by introducing additional snapshots however this, in turn, can increase the storage size since snapshots are fully materialized. In this work, we will explore if we can reduce the resulting storage size increase of the multiple snapshot approach, while maintaining the ingestion time reduction, by restructuring the delta chain.

1.2 Research Question

The research question is as follows: “How much can we reduce storage usage of change-based RDF archives by restructuring the delta chain?”

1 CHAPTER 1. PREFACE 2

1.3 Outline

The remainder of this thesis is structured as follows. Chapter 2 will explain all the necessary concepts, terms and techniques needed to talk about RDF archives. Next, Chapter 3 will showcase a use case in order to highlight the need for a RDF archive for large datasets. Followed by Chapter 4, where we will present our storage optimization and corresponding hypotheses. We will apply this storage optimization on OSTRICH, so Chapter 5 will give a detailed overview of OSTRICH first. The implementation of this storage optimization will then be discussed in Chapter 6. In Chapter 7, we evaluate our implementation and compare it with OSTRICH. Finally, a conclusion is presented in Chapter 8, alongside an overview of possible future work. Chapter 2

Background

In order to fully understand this thesis, it is important to explain certain concepts and list which technologies are available. First, the Semantic Web will be explained, followed by its key technologies RDF and SPARQL. Second, a listing of popular RDF storage techniques will be given. Third, an overview of query types will be given. Fourth, an outline of RDF archive techniques will be given. Finally, the most popular versioned RDF benchmarks will be listed.

2.1 Semantic Web

In 2001, Tim Berners Lee, the inventor of the World-Wide-Web, proposed the idea of the Semantic Web [4]. The goal of the Semantic Web is to make data on the Web understandable to machines so that they can perform complex tasks. In order to make this vision a reality, Linked Data was introduced. As described by Bizer et al. [5], Linked Data (LD) refers to data published on the Web in such a way that it is: machine- readable, its meaning is explicitly defined, it is linked to other external data sets, and it can be linked to from external data sets. RDF and SPARQL are two key technologies of Linked Data, which will be explained in the following sections.

2.2 RDF

RDF [6] is a way of representing data and is a key technology of linked data. RDF uses triples to organize data. RDF triples are interperted as follows: the subject has a certain property with value object. RDF triples are used to form statements, which can be represented by a directed graph wherein the object and subject are nodes that are linked by a predicate. The subject and predicate are represented as resource URIs, while the object can be either a resource URI, a literal or a blank node. Figure 2.1 is an example RDF graph, for the sake of brevity no URI’s are used. Figure 2.1 describes me and contains the following triples: •

3 CHAPTER 2. BACKGROUND 4

Figure 2.1: An example RDF graph.

2.3 SPARQL

Although other RDF query languages exist such as RQL [7] and (i)RDQL [8], SPARQL is the W3C’s recommended RDF query language [9]. SPARQL is a graph-based pattern matching query language for RDF data. The graph-based patterns we will focus on are made up of triple patterns, which like RDF triples contain a subject, predicate and object. However, in this case, subject, predicate and object can be either fixed or variable. The query processor will try to match these patterns with elements of the domain, by adhering to the fixed components and filling in the variable components. SPARQL has four query forms [9]: • SELECT Returns all, or a subset of, the variables bound in a query pattern match. • CONSTRUCT Returns a RDF graph constructed by substituting variables in a set of triple templates. • ASK Returns a boolean indicating whether a query pattern matches or not. • DESCRIBE Returns a RDF graph that describes the resources found. These query forms are typically followed by a ’WHERE’ clause, that limits the result. The ’WHERE’ clause uses pattern matching on the triples, as mentioned above. As an illustration, the following query selects all book titles [9]:

Listing 2.1: An example SPARQL query that fetches all book titles. PREFIX dc: SELECT ?title WHERE { dc:title ?title }

2.4 RDF Storage Layout

This section gives an overview of general RDF storage layouts. They can be divided into two groups, namely native and non-native storage techniques[10, 11]. Native storage techniques are dedicated storage techniques that have been built from scratch, while non-native make use of existing Database Management Systems (DBMSs). We further divide the non-native techniques into Relational Database Management Systems (RDBMSs) and NoSQL storage techniques.

2.4.1 RDBMS Storage

RDBMSs have been around for decades and therefore have become extremely optimized. There- fore, a lot of techniques have been proposed to map RDF data to a RDBMS [12–14]. Three techniques will be discussed, namely triple tables, property tables and vertical partioning. CHAPTER 2. BACKGROUND 5

SUBJECT PROPERTY OBJECT ID1 Name ”John Smith” ID1 Salary 1800 ID1 Department ”Human Resources” ID2 Name ”Jane Tully” ID2 Salary 2000 ID2 Department ”Management” ID2 Bonus 100 ID2 Phone Number (251) 546-9442 ID3 Name ”Rob Phelps” ID3 Salary 1600 ID3 Department ”Human Resources”

Table 2.1: An example triple table.

2.4.1.1 Triple Table

The triple table technique maps triples into a single table with three columns for the subject, property and object. While this technique is very flexible, it has some performance issues. Since all triples are stored in a single table, queries require a lot of expensive self-joins for certain triple patterns [14]. Furthermore, since all triples are stored in a single table, this table can quickly become too large to fit in memory, making queries even slower. Table 2.1 is an example triple table for an employee database.

2.4.1.2 Property Table

Wilkinson et al. [12, 13], proposed property tables as a solution for the scalability problems of triple tables. Property tables try to group related RDF nodes in order to reduce query time and storage requirements. They consist of a subject column and several property columns. Triples that cannot be grouped are simply stored in a leftover triple table. Table 2.2 is an example property table for the same employee database. Wilkinson et al. [13] also discuss property-class tables which are a special case of property tables. The idea behind property-class tables is to store nodes of the same class together. In essence, this corresponds to storing the value of ’rdf:type’ in a property table. The most important advantage of property tables over triple tables is the faster query time. The speed-up is due to the fact that some self-joins can be avoided since related nodes are stored in the same row. In addition, storage size is generally lower than the triple table approach. The disadvantages are that the tables can become very sparse with NULLs, due to the unknown values. In addition, property tables cannot handle multi-valued attributes efficiently. Due to these disadvantages and the general complexity of property tables, they have not been widely adopted except in specialized cases [14].

2.4.1.3 Vertical Partitioning

Abadi et al. [14] proposed a new storage solution called SWstore. SWstore utilizes a technique called vertical partitioning which, similar to a property table, groups triples. For every predicate, there is a column table which contains the subject and object in a triple. Unlike property tables, multi-valued attributes are possible by simply storing the subject with all possible objects. Furthermore, NULL values, or unknown values, do not need to be stored. Table 2.3 is an example vertical partitioning table for the same employee database. CHAPTER 2. BACKGROUND 6

SUBJECT NAME SALARY DEPARTMENT Phone Number ID1 ”John Smith” 1800 ”Human Resources” NULL ID2 ”Jane Tully” 2000 ”Management” (251) 546-9442 ID3 ”Rob Phelps” 1600 ”Human Resources” NULL (a) Property Table SUBJECT PROPERTY OBJECT ID2 Bonus 100 (b) Left Over Triple Store

Table 2.2: An example property table.

SUBJECT OBJECT SUBJECT OBJECT ID1 ”John Smith” ID1 1800 ID2 ”Jane Tully” ID2 2000 ID3 ”Rob Phelps” ID3 1600 (a) Name Table (b) Salary Table SUBJECT OBJECT ID1 ”Human Resources” SUBJECT OBJECT ID2 ”Management” ID2 (251) 546-9442 ID3 ”Human Resources” (d) Phone Number Table (c) Department Table SUBJECT OBJECT ID2 100 (e) Bonus Table

Table 2.3: An example vertical partioning table. CHAPTER 2. BACKGROUND 7

2.4.2 NoSQL Storage

With the rising popularity of NoSQL databases, multiple RDF mappings into NoSQL stores have been proposed. Most of these systems use popular NoSQL stores as a backbone. Three major groups of NoSQL stores can be distinguished, namely key-value stores, document stores and column stores [15]. Key-value stores are NoSQL stores that manage a dictionary. Since RDF uses triples and key- value stores only deal with pairs, indexing can be difficult. The AWETO [16] system tries to solve this indexing problem by using four index orders: S-PO, P-SO, P-OS, and O-PS. Document stores are more complex key-value stores as they allow to encapsulate (key, value)- pairs in documents. Typically, RDF document stores rely on popular document stores such as MongoDB and CouchDB. Column stores store and retrieve data by column instead of by row. Unique keys are used to connect related column data. In addition, data can be indexed both row-wise and column-wise. For instance, Rya [17] is a RDF store that uses Accumulo, a column store similar to Google Bigtable, as a backbone.

2.4.3 Native Storage

Native storage techniques are dedicated storage techniques that use RDF’s triple structure to their benefit. YARS [18] is an optimized multi-index system that is designed for fast queries. The RDF triple is extended with context information that refers to the provenance of the data and is referred to as a quad. Since each quad component can be set or variable, there are 24, or 16, access patterns. These access patterns can be covered by only six indexes that utilize B+-trees. Furthermore, the string representation of each quad component is mapped to a short integer ID. These mappings are stored in a dictionary, which is used to convert from ID to string representation and vice- versa. By working with IDs instead of string representations, the indexes take less storage space. Furthermore, queries are faster since IDs can be compared more efficiently with each other. YARS2 [19] extends YARS to a distributed system. In particular, distributed indexing methods and parallel query evaluation methods are presented. Hexastore [20] uses six indexes for each permutation of the triple, namely SPO, SOP, OPS, PSO, OSP, and POS. Moreover, a dictionary is also used to map the RDF to keys. As an example, in the SPO index, a subject key is linked to a sorted vector of property keys, which all point to a list of object keys. RDF-3X [21] stores all triples in the leaf pages of a compressed clustered B+-tree. Six indexes are used for all permutation of the triples, six indexes for the aggregated indexes and three for the one-valued indexes, totaling 15 indexes. By storing the triples lexicographically, SPARQL queries can be converted into range scans. In addition, the string literals are replaced with ids using a dictionary, resulting in faster queries and a lower storage size. Bitmat [22] is a compact bit matrix structure for representing a large number of RDF triples. The data is represented as a bit cube with subject, predicate and object as dimensions, wherein each cell represents whether the triple exists or not. This binary matrix allows for efficient joins with the use of binary AND/OR operations. TripleBit [23] is a compact RDF store that relies on a bit matrix storage structure. The RDF triples are represented as a two dimensional bit matrix with RDF links (properties) as columns and RDF nodes (subjects, objects) as rows. Each cell consists of a boolean that indicates if a triple is present or not, resulting in a sparse matrix that can be compressed efficiently. In addition, dictionary encoding is used to reduce storage requirements even further. In order to speed up queries, two auxiliary indexes are used, namely ID-Chunk matrix and ID-Predicate bit matrix. The ID-Chunk matrix is used to quickly find chunks matching to a RDF node (subject, object). The ID-Predicate bit matrix is used to find the related predicates for a given RDF node (subject, object). CHAPTER 2. BACKGROUND 8

HDT [24] is a binary representation for exchanging RDF data, so compression is the main focus. HDT consists of three parts: • header - contains metadata and serves as an entry point to the data • dictionary - mapping between triple components and unique identifiers, referred to as dictionary encoding • triples - structure of the underlying RDF graph after dictionary encoding HDT resolves queries on the compressed data, but only has one index (SP-O), making certain triple patterns hard to resolve. In addition, by design HDT stores are immutable after creation, making them unsuitable for volatile datasets. HDT-FoQ [25] is an extension on HDT [24] that focusses on resolving queries faster. For this reason, HDT-FoQ adds two additional indexes, namely PS-O and OP-S, to cover more access patterns. The PS-O makes use of a wavelet tree, while the OP-S index uses adjacency lists, similar to the SP-O index. Waterfowl [26] builds on HDT-FoQ [25] by using wavelet trees in the SP-O index, instead of adjacency lists. To the best of our knowledge, no data can be found on Waterfowl’s performance compared to HDT-FoQ.

2.5 Archiving

Version control, also called versioning in this document, refers to the management of changes to a collection of information, for example, a dataset or codebase. The latter has been around for over four decades [27], proving how invaluable rolling back to a previous version is. Many version control techniques can be reused in archiving techniques. We will consider both non-RDF archiving techniques and RDF archiving techniques.

2.5.1 Non-RDF Archives

Many techniques from non-RDF archives and Version Control System (VCS) can be repurposed for versioning RDF archives. RCS [29] is a delta-based VCS, wherein each delta consists of insertions and deletions of lines. The latest version is stored completely and older revisions are stored in so-called reverse deltas, resulting in quick access to the latest version. To add a new revision the system stores the latest revision completely and replaces the previous revision by its delta, keeping the rest of the chain intact.

2.5.2 RDF Archives

This section gives an overview of existing RDF archive approaches. Fern´andezet al. [30] distinguish three groups of archive storage policies.

2.5.2.1 Independant Copies

In the IC approach the dataset is stored independently for every version. In the IC approach triples are repeated many times across different versions, so they typically have a higher storage requirement than other approaches. Due to its straightforward and simple approach, the IC approach is used in popular systems such as the Dynamic Linked Data Observatory [31] and DBpedia [30]. SemVersion [32] is an IC versioning system for RDF, that tries to emulate classical Concurrent Versions System (CVS) systems for version management. Each version is stored separately in CHAPTER 2. BACKGROUND 9

RDF stores that conform to a certain API, that manages said versions.

2.5.2.2 Change-Based

The CB approach tries to solve the large storage requirement of the IC approach by only storing the changes between versions. These changes, sometimes called deltas, typically consist of a set of additions and set of deletions. However, only storing the changes introduces a version materialization cost. This refers to the cost of reconstructing a database version by going over all the changes up to that point. It is clear that this cost will increase with every new version of the dataset. Cassidy et al. [33] propose a CB RDF archive that is built on Darcs theory of patches [51] - a mathematical model that describes how patches can be manipulated in order to get the desired version in the context of software. This model describes fundamental operations, such as the commute operation, the revert operation, and the merge operation. Cassidy et al. adapt these operations so that are applicable to RDF stores as well. Im et al. [34] introduced a CB store with a RDBMS. They propose an aggregated deltas approach wherein not only the delta between a parent and child, but all possible deltas are stored. This results in an increased storage overhead, but a decreased version materialization cost compared to the classic sequential delta chain. Vander Sande et al. [35] introduce R&WBase - a distributed CB RDF archive, wherein versions are stored as consecutive deltas. Deltas between versions consist of an addition set and a deletion set, respectively listing which triples haven been added and deleted. Since deltas are stored in the same graph, triples are annotated with a context number, indicating which version the triple belongs and whether it was added or deleted. In particular, an even context number indicates the triple is an addition and an uneven context number indicates the triple is a deletion. Queries can be handled efficiently by looking at the highest context number. If the context number is even than the triple is present for that version. If the context number is uneven than the triple is not present for that version. Finally, R&WBase also supports tagging, branching, and merging of datasets. R43ples [36] is another CB RDF archive, since it groups additions and deletions in named graphs. R43ples allows manipulation of revisions with SPARQL, by introducing new keywords such as REVISION, TAG and BRANCH. Versions are materialized by starting from the head of the branch and applying all prior additions/deletions.

2.5.2.3 Timestamp-Based

In the TB approach triples are annotated with creation and deletion timestamps. These anno- tations ensure that no triples are stored more than once. Hauptmann et al. [37] propose a similar delta-based store as R43ples, including complete graphs and version control via SPARQL. However, in Hauptmann’s approach, each triple is virtually annotated with version information that is cached using a hash table, making it a TB approach. x-RDF-3X [38] extends RDF-3X [21] with versioning support. Each triple is annotated with a creation timestamp and when appropriate, a deletion timestamp, making it a TB approach. v-RDFCSA [39] is an TB archiving extension on RDFCSA [40], a compact self-indexing RDF storage that is based on suffix arrays. Dydra [41] is a RDF archive that stores versions as named graphs in a quad store, that can be queried using the REVISION SPARQL keyword. Dydra uses B+-trees with six indexes: GSPO, GPOS, GOSP, SPOG, POSG, OSPG. B+-tree values indicate for which revisions a particular quad is visible, making it a TB system. CHAPTER 2. BACKGROUND 10

Figure 2.2: Unidirectional delta chain, as done in TailR.

Figure 2.3: Unidirectional delta chain where all deltas are relative to the snapshot at the beginning of the chain, as done in OSTRICH.

2.5.2.4 Hybrid

In the hybrid approach, the three aforementioned archiving strategies are combined. TailR [42] interleaves fully materialized versions (snapshots) in between the delta chain. The snapshots reset the version materialization cost but can lead to a higher storage requirement. OSTRICH [1] is another hybrid solution that interleaves fully materialized snapshots in between the delta chain, as seen in Figure 2.3. However, unlike TailR, OSTRICH uses aggregated deltas [34] – deltas who directly refer to the snapshot, instead of the previous version. Moreover, the delta chain is stored by annotating each triple with version information, making it a IC, CB and TB hybrid. OSTRICH focuses on providing memory-efficient query streams which can be offsetted. In addition, OSTRICH also provides query count estimation functionality, which can be used as a basis for query optimization in query engines [43].

2.6 Query Types

Fern´andez et al. [3] identified five fundamental query groups, referred to as query atoms: • Version Materialization (VM) queries retrieve data from a single version. For example, “Which paintings are on display today?”. • Delta Materialization (DM) queries retrieve the differences between two versions. For example, “Which paintings were added or removed between yesterday and today?”. • Version Query (VQ) annotates query result with version numbers wherein data exists. For example, ”When was the ’Mona Lisa’ on display?”. • Cross-Version join (CV) joins results of two queries over two different versions. For exam- ple, “Which paintings were on display yesterday and today?”. • Change Materialization (CM) returns a list of versions in which a given query produces consecutively different results. For example, “When was the ’Mona Lisa’ put on display or removed from display?”. Although other query classifications exist [44], we will only refer the above-mentioned query atoms, for the sake of simplicity. Some storage policies are better suited for some query atoms than others. CHAPTER 2. BACKGROUND 11

2.6.1 Independant Copies

Since all version are fully materialized and indexed in the IC approach, Version Materialization (VM) queries are relatively simple. Delta Materialization (DM) and CV queries are moderately complex because two queries need to be executed. VQ and CM are very complex because all versions need to be queried.

2.6.2 Change-Based

Due to the version materialization cost, as discussed in section 2.5.2.2, VM queries are more com- plex in the CB approach than the IC approach. On the other hand, CB queries for neighboring versions are very efficient in the CB approach since those changesets are stored.

2.6.3 Timestamp-Based

In the TB solution VQ queries are particularly efficient because the triples are naturally anno- tated with version numbers wherein they exist. However, other query atoms are typically slower than the IC approach due to the extra checks if a triple is valid for a given version.

2.7 RDF Archive Benchmarks

A benchmark is a set of tests that measure the performance of a system. More importantly, it allows us to easily compare systems. In this section, three RDF archive benchmarks are presented, namely BEAR [3, 45], EvoGen [46] and SPBv [44].

2.7.1 BEAR

BEAR [3, 45] is a benchmark for RDF archives that utilizes real data from three different domains: • BEAR-A - 58 weekly snapshots from the Dynamic Linked Data Observatory [31]. • BEAR-B - the 100 most volatile resources from DBPedia Live [47] over the course of three months at three different granularities: instant, hour and daily. • BEAR-C - 32 weekly snapshots of the Open Data Portal Watch project [48]. The data is stored under four different policies. Under the IC policy each version is stored in a separate N-Triples file, while in the CB policy, only additions and deletions of triples are stored in separate N-Triples files. Under the TB policy, a named graph annotates the triples with versions, while the CBTB policy only annotates the triples which have changed. The BEAR benchmark also provides triple patterns and their corresponding query results. BEAR-A provides triple pattern queries and their results for the following triple patterns: S??, ?P?, ?P?, SP?, ?PO, S?O and SPO. BEAR-B provides triple pattern queries and their results for ?PO and ?P? triple patterns for the hourly and daily granularity. These triple patterns are based on the most frequent triple patterns from the DBpedia query set. BEAR-C provides 10 complex queries that, although they cannot be efficiently resolved with current archiving strategies, they could help foster development of new query resolution algorithms.

2.7.2 EvoGen

EvoGen [46] is a highly configurable benchmark suite that generates synthetic and evolving RDF data. EvoGen is an extension on the Lehigh University Benchmark (LUBM) synthetic dataset, adding additional classes and properties for enabling schema evolution. Parameters can be used CHAPTER 2. BACKGROUND 12 to configure: instance evolution, schema evolution, query workload generation and archiving strategy.

2.7.3 SPBv

SPBv [44] is another highly configurable benchmark that generates RDF data based on the BBC’s media organization data, which they refer to as creative works. Creative works consist of properties such as: title, shortTitle, description, dateCreated, audience and format. The data generator tries to simulate the natural evolution of these creative works, by storing the creative works in different versions according to their creation date. SPBv can also be used to generate queries. However, unlike Fern´andez et al. [3], Papakonstantinou et al. consider eight query types: • Modern version materialization queries fully materialize the latest version. • Modern single-version structured queries are performed in the latest version. • Historical version materialization queries full materialize a version in the past. • Historical single-version structured queries are performed in a version in the past. • Delta materialization retrieves the delta between two versions. • Single-delta structured queries are performed on the delta of two consecutive versions. • Cross-delta structured queries are evaluated on changes of multiple versions. • Cross-version structured joins the results of queries on several versions, thereby re- trieving information common in several versions. Chapter 3

Use Case: Friend Network

This chapter describes a use case to highlight the need for an RDF archive for large version datasets.

3.1 Use Case

The use case is a Friend Of A Friend (FOAF) network in social media, which captures information such as who is friends with whom. As an example, consider the following raw FOAF data: ex:Trevor foaf:knows ex:John ex:John foaf:knows ex:Trevor ex:John foaf:knows ex:Amy ex:Amy foaf:knows ex:John

RDF’s triple structure is a good choice to model this data [18]. Figure 3.1 depicts the resulting RDF graph. Since people tend to acquire and lose friends over time, FOAF networks evolve over time. In order to capture this evolution, we can version the data. For example, the system could periodically take a snapshot of the network and store it as a new version. Typically, a large part of the data will remain unchanged between version. Moreover, versions can be perfectly described with additions and deletions of friends with respect to another version, making CB an excellent storage policy. As for storage size, the friend networks could become enormous when we are dealing with social media such as Facebook, which has 2.2 billion active accounts at the time of writing [49] with an average of 255 friends per account [50]. Therefore, our solution needs to be storage efficient. Finally, VM, DM and VQ queries could be performed on these friend networks. An example VM query could be: ”Who were my friends in sixth grade?”. An example DM query could be: ”Which friends did I add after switching schools?”. An example VQ query could be: ”How long have I been friends with someone?”. Users could interact with the system using a web-based SPARQL endpoint. Since we are dealing with high volume data, query results could easily

Figure 3.1: Friend Network Example

13 CHAPTER 3. USE CASE: FRIEND NETWORK 14 become too large to be displayed on a single web page. This means that when a page is loaded, only a subset of the query is needed, making offsettable queries very efficient.

3.2 Requirements

From our use case we identify the following requirements for our system: • an efficient RDF archive storage technique • an efficient offsettable VM query stream algorithm • an efficient offsettable DM query stream algorithm • an efficient offsettable VQ query stream algorithm • low storage

3.3 Need

As previously mentioned, the use case handles a dataset with very large versions. Therefore, we need a storage-efficient solution with a low ingestion time. On the other hand, we also need to support fast and offsettable queries. In chapter 2, we saw that OSTRICH [1] is the state- of-the-art in terms of offsettable RDF archives. While OSTRICH is storage-efficient compared to other RDF archives, OSTRICH has a large ingestion time that increases with the size of the delta chain. Taelman et al. suggest that additional snapshots can be used to limit the ingestion time, however this, in turn, could increase the storage size since snapshots are fully materialized. Therefore, there is a need for a storage solution that maintains the ingestion time of OSTRICH with multiple snapshots while also keeping the resulting storage size increase down. Chapter 4

Storage Optimization: Bidirectional Delta Chain

As outlined in Section 3.3, there is a need for a storage solution that maintains the ingestion time of OSTRICH with multiple snapshots while also keeping the resulting storage size increase down. In this work we propose a storage optimization for CB RDF archives, that is based on restructuring the delta chain. As seen in previous works [1, 42], a delta chain consists of a fully materialized snapshot followed by a series of deltas. The main idea behind our storage optimization is moving the snapshot from the front of the delta chain to the middle of the delta chain, in order to potentially reduce the overall storage size. This transforms the delta chain into a bidirectional delta chain, which divides the original delta chain into two smaller delta chains, i.e. the reverse delta chain and the forward delta chain. Figure 4.1 and 4.2 show two example bidirectional delta chains. In this chapter we will discuss the advantages and disadvantages of bidirectional delta chains.

4.1 Advantages Bidirectional Delta Chain

The advantages of the bidirectional delta chain depend on whether it is applied on an aggregated or a non-aggregated delta chain, which both will be explained hereafter.

4.1.1 Advantages Non-Aggregated Bidirectional Delta Chain

In a non-aggregated delta chain, all deltas reference the closest preceding version. So in order to materialize a version, all preceding deltas need to be applied until the fully materialized snapshot is reached. It follows then that the version materialization cost scales with the length of the delta chain and the size of the deltas. As stated above a bidirectional delta chain divides the original delta chain into two smaller delta chains. Moreover, the size of the deltas remains the same, since the reverse delta chain is just the inverse of the original deltas. Therefore, the worst-case materialization cost for bidirectional

Δ Δ Snapshot Δ Δ

Reverse Delta Chain Forward Delta Chain

Figure 4.1: A simplified non-aggregated bidirectional delta chain.

15 CHAPTER 4. STORAGE OPTIMIZATION: BIDIRECTIONAL DELTA CHAIN 16

Δ Δ Snapshot Δ Δ

Reverse Delta Chain Forward Delta Chain

Figure 4.2: A simplified aggregated bidirectional delta chain. delta chains is half of that for unidirectional delta chains. Figure 4.3 gives an example of both a unidirectional and bidirectional non-aggregated delta chain. As you can see the reverse delta chain, is simply the inverse of the original forward delta chain, so the size of the deltas remain equal. Moreover, in the bidirectional delta chain, the length of the delta chains have been halved, thus reducing the average version materialization cost. Bidirectional non-aggregated delta chains could also potentially reduce storage size, while main- taining a similar version materialization time. Indeed, if we compare a series of two unidirection delta chains with a single bidirectional delta chain, one fewer snapshot would need to be stored.

4.1.2 Advantages Aggregated Bidirectional Delta Chain

In an aggregated delta chain, all deltas reference a single snapshot, which means that an aggre- gated delta contains all the changes from all preceding deltas. In this work, we assume that a higher distance between versions results in a bigger aggregated delta chain. This assumption holds for datasets that steadily grow over time by adding more new triples because later versions will have more and more new triples compared to earlier versions. It follows then that reducing the average distance between the snapshot and the versions results in smaller aggregated deltas, thus reducing the overall storage size. Bidirectional delta chains reduce the average distance between the snapshot and other versions. Therefore bidirectional delta chains should have a lower storage size, compared to unidirectional delta chains for growing datasets. Figure 4.4 gives an example of both a unidirectional and a bidirectional aggregated delta chain. As seen in Subfigure 4.4a, newer versions gradually increase in size, due to the addition of new triples. The corresponding unidirectional aggregated delta chain is shown in Subfigure 4.4b, as you can see these newer triples are repeated in every subsequent delta. However, in the bidi- rectional aggregated delta chain, shown in Subfigure 4.4c, the deltas become smaller since the triples are repeated for fewer versions, e.g. triple 4. Another way of reducing the the average distance between the snapshot and other versions, is introducing an additional snapshot, as seen in Figure 6.2a. However, bidirectional delta chains have an advantage in the sense that they only need to store a single snapshot.

4.2 Disadvantages Bidirectional Delta Chain

A bidirectional delta chain contains a reverse delta chain - a delta chain where the deltas precede the reference snapshot. However, building such a delta chain is difficult when we need to insert versions in-order and do not know the future snapshot. Indeed, we can not calculate the delta between the version we need to insert and the future snapshot if the future snapshot is not known. A fix-up algorithm is a potential way of solving this issue. In the fix-up algorithm, all versions are stored in a forward delta chain. Once the future snapshot is inserted, the forward delta chain can be converted into a reverse delta chain. As discussed in Subsection 2.5.1, RCS [29] presents an incremental algorithm to build the reverse delta chain without the need for a fix-up algorithm. For this algorithm, the latest version is CHAPTER 4. STORAGE OPTIMIZATION: BIDIRECTIONAL DELTA CHAIN 17

1 1 4 4 3 4 2 5 2 1 5 4 5 4 6 6 6 6 6 6 6 7 7 7 7 8

(a) A fully materialized example data set.

add 5 add 3 add 5 2 add 1 add 2 add 8 add 4 add 7 add 1 6 remove 2 remove 5 remove 1 remove 1 remove 3 remove 2

(b) An example unidirectional non-aggregated delta chain.

4 add 5 add 1 add 3 add 2 5 add 2 add 8 remove 3 remove 5 add 1 remove 1 6 remove 5 remove 1 remove 4 remove 7 7 remove 2

(c) An example bidirectional non-aggregated delta chain.

Figure 4.3: An example to showcase unidirectional and bidirectional non-aggregated delta chains. Triples are represented by numbers.

(a) A fully materialized example data set.

add 1 add 4 add 4 add 3 add 4 add 1 2 add 1 add 5 add 5 add 4 add 5 add 4 6 remove 2 add 7 add 8 remove 2 add 7 add 7 remove 2 add 7 remove 2

(b) An example unidirectional aggregated delta chain.

add 2 add 1 4 add 3 add 1 remove 4 remove 4 5 remove 5 add 1 add 2 add 8 remove 5 remove 5 6 remove 7 remove 5 remove 7 remove 7 7

(c) An example bidirectional aggregated delta chain.

Figure 4.4: An example to showcase unidirectional and bidirectional aggregated delta chains. Triples are represented by numbers. CHAPTER 4. STORAGE OPTIMIZATION: BIDIRECTIONAL DELTA CHAIN 18 always stored fully materialized. To add a new version, the system stores the new version completely and replaces the previous version by its delta, keeping the rest of the chain intact.

4.3 Hypotheses

In this section we propose five hypotheses regarding aggregated unidirectional delta chains and aggregated bidirectional delta chains. The first hypothesis states that: “Disk space will be significantly lower for a bidirectional delta chain compared to a unidirectional delta chain.”. For the reasoning behind this hypotheses, we refer to Section 4.1.2. The second hypothesis states that: “In-order ingestion time will be lower for a unidrectional delta chain compared to a bidirectional delta chain.”. This hypothesis stems from the fact that the fix-up algorithm needs to insert the versions in a temporary forward delta chain first before they can be inserted in the reverse delta chain and RCS needs to calculate a delta before a new version can be inserted. The third hypothesis states that: “The mean VM query duration will be equal for both a uni- directional delta chain and a bidirectional delta chain.”. The reasoning behind this hypothesis is that a VM query comes down to applying the stored aggregated delta to the snapshot, so wheter a delta was stored in a reverse delta chain or forward delta chain should not afect the VM query time. The fourth hypothesis states that: “The mean DM query duration will be equal for both a unidirectional delta chain and a bidirectional delta chain.”. We stated this hypothesis because both the unidirectional and bidirectional delta chains store aggregated deltas. The fifth hypothesis states that: “The mean VQ query duration will be equal for both a unidi- rectional delta chain and a bidirectional delta chain.”. We state this hypothesis because a VQ query should iterate over every every version to gather the version information of each triple, so wheter a delta was stored in a reverse delta chain or forward delta chain should not afect the VQ query time. Chapter 5

OSTRICH Overview

We will apply the storage optimization discussed in Chapter 4 to OSTRICH [1]. Therefore we will give a detailed overview on OSTRICH in this chapter. The chapter is outlined as follows. First, we will give an overview of the storage structure. Next, we will give an overview of the ingestion process. Finally, we will give an overview of how VM, DM and VQ are handled.

5.1 Storage Structure

In this section, we will explain the storage structure of OSTRICH [1] in more detail. Figure 5.1 gives an overview of all the components in a delta chain, which will be explained in more detail in the following subsections.

5.1.1 Snapshot Storage

The first version of OSTRICH is always stored as a fully materialized snapshot, which is stored using HDT(-FoQ) [24, 25]. HDT is a good solution for storing snapshots since it has a low storage requirement. Moreover, HDT enables fast VM queries due to the fact that HDT stores are immutable and have multiple indexes. Furthermore, in HDT, query results can be represented as triple streams which can be offsetted. Finally, HDT also provides cardinality estimation for the query results, making it an excellent solution for snapshot storage.

5.1.2 Delta Chain Dictionary

A delta chain consists of two dictionaries that are used to encode the triple components, namely the snapshot dictionary and the delta dictionary. The snapshot dictionary is the dictionary used in HDT, and stores all mappings for triple components present in the snapshot. The snapshot dictionary is static, so it can be sorted and compressed efficiently. The delta dictionary is the dictionary that stores the triple components of newly added triples that were not present in the snapshot. The delta dictionary is volatile since a new version can lead to new mappings. In case a triple component needs to be encoded, the snapshot dictionary is probed first, followed by the delta dictionary in case there was no match in the snapshot dictionary. If neither dictionary contains a mapping a new entry is created. In case a triple component needs to be decoded, a reserved bit is checked that indicates wether the mapping is stored in the snapshot dictionary or the delta dictionary.

19 CHAPTER 5. OSTRICH OVERVIEW 20

Addition Counts Snapshot Δ Δ Metadata 1 2 3

Dictionary

HDT ADD SPO DEL SPO

......

ADD POS DEL POS

......

ADD OSP DEL OSP

......

Figure 5.1: An overview of the storage structure used in OSTRICH [1].

5.1.3 Delta Storage

OSTRICH stores subsequent versions in an aggregated delta chain. However, aggregated deltas often contain duplicate changes across the deltas, since they contain all previous deltas, therefore a TB-like approach is used to compress the deltas. Unlike in the regular TB approach, where triples are annotated with timestamps, OSTRICH annotates the triples with the version wherein the triple exists, meaning triples are stored only once. A delta chain consists of a set of triple additions and triple deletions, which are stored separately due to the requirements for certain query algorithms. Both additions and deletions are stored and indexed by B+ trees, with the encoded triple acting as key and the corresponding version information as value. The version information consists of the triple timestamp information, local change flags and in case of deletions also the relative position of the triple inside the delta chain. The two latter ones will be explained in the following sections.

5.1.3.1 Local Change Flags

The local change flags indicate whether a triple is local change. A local change refers to a series of triple instances in the delta chain that negate each other. For example, a triple that is deleted in version 1 and is added again in version 2. Since it is difficult to determine whether a change is local, OSTRICH stores this information as local change flags for each version triple during ingestion. Local changes improve VM query evaluation since local changes can be filtered out.

5.1.3.2 Deletion Relative Position

The relative position of a deletion is the position of the deletion triple if all deletion triples in the delta chain were sorted. The benefits of storing the relative position for deletions is two-fold. First, it allows query algorithms to efficiently offset the query results. Second, as we will describe in Section 5.1.3.5, it allows us to efficiently find the deletion count for any triple pattern. CHAPTER 5. OSTRICH OVERVIEW 21

Triple pattern Index S P O SPO S P ? SPO S ? O OSP S ? ? SPO ? P O POS ? P ? POS ? ? O OSP ? ? ? SPO

Table 5.1: Overview of which index OSTRICH uses for each triple pattern.

5.1.3.3 Multiple Indexes

OSTRICH [1] stores a triple in three different component orders, namely SPO, POS and OSP. As seen in Table 5.1, this is sufficient to resolve any triple pattern. Since OSTRICH stores addition and deletion separately, there are a total of six B+ trees per delta chain.

5.1.3.4 Addition Counts

In order to handle queries more efficiently, OSTRICH [1] stores a mapping from triple pattern and version to the number of matching additions. These addition counts are calculated and stored during ingestion. However, since the number of mappings can grow very quickly, OS- TRICH only stores addition counts that exceed a certain threshold. If an addition count is below the threshold, thus not stored in the mapping, it is calculated on the fly.

5.1.3.5 Deletion Counts

As discussed in Section 5.1.3.2, every deletion is annotated with its relative position in the delta chain. OSTRICH starts by performing a backward search in the deletion trees in order to find the largest triple for a given triple pattern. Once OSTRICH has a match, it will look up the relative position for the triple. Since it has the largest triple for the given triple pattern, the triple will be the last deletion in the list, therefore the relative position corresponds with the deletion count for the given triple pattern.

5.1.3.6 Metadata

In order to get a view of all versions stored in the delta chain, OSTRICH stores a list of stored versions as metadata. This also enables the system to easily find the version count.

5.2 Ingestion

Ingestion refers to inserting new versions into the storage. OSTRICH [1] focusses on ingesting a unidirectional non-aggregated changeset. In other words, the ingestion transforms the input as seen in Figure 2.2 into the storage structure seen in Figure 2.3. In short, the algorithm performs a sort-merge join over the addition stream, the deletion stream and the input stream, which are sorted in SPO-order. The algorithm iterates over all three streams until they are finished. In each iteration, the smallest triple between all three streams is processed. There are seven cases: 1. deletion < input and deletion < addition CHAPTER 5. OSTRICH OVERVIEW 22

2. addition < input and addition < deletion 3. input < addition and input < deletion 4. input == deletion and input < addition 5. input == addition and input < deletion 6. addition == deletion and addition < input 7. addition == input == deletion In the first case, where the deletion is the smallest triple, the deletion information is copied to the new version and the relative positions are updated in this case and all other cases. In the second case, where the addition is the smallest triple, the addition information is copied to the new version. In the third case the input is the smallest triple, which means the triple was not added or deleted in previous versions. In this case, OSTRICH adds the triple as either a non-- local change addition or a non-local change deletion. In the fourth case, the input triple already exists as a deletion. If the input triple is an addition it is added as a local change. Similarly, in the fifth case, the triple already exists as an addition. If the input triple is a deletion it is added as a local change. In the sixth case the triple already existed as an addition and deletion. In this case the triple is carried over to the next version. In the seventh case the input triple already existed as both an addition and deletion. In this case, if the input triple is an addition it becomes a deletion, and vice versa and the local change flag is carried over.

5.3 Queries

OSTRICH [1] supports three query atoms, namely VM, DM and VQ.

5.3.1 Version Materialization Query

As described by Fern´andez et al. [3], VM queries retrieve data from a single version. In the following sections, we will describe how VM queries are resolved in OSTRICH [1], as well as how the cardinality of the result stream can be estimated.

5.3.1.1 Query

Algorithm 5.1 shows how VM queries are resolved. First, the corresponding snapshot is re- trieved. Next, the snapshot is queried for the given triple pattern and the offset is applied. If the requested version is the snapshot itself, the algorithm returns HDT’s snapshot iterator. The algorithm continues by initializing the addition and deletion streams to the start position for the given triple pattern and version. The addition and deletion streams will both filter out local changes since they do not affect the final result, as explained in Section 5.1.3.1. The algorithm always returns snapshot triples first before returning additions. Therefore, de- termining the offset for the snapshot, addition and deletion stream can be split up in two cases, namely the offset lies within the range of the snapshot count minus the deletion count or within the range of addition triples. In the first case, the offset is within the range of the snapshot count minus the deletion count, the algorithm starts a loop that converges to the actual snapshot offset. The loop starts by looking at the triple at the current snapshot offset. The algorithm then offsets the deletion stream with that snapshot triple. This triple offset is done by navigating the deletion tree to the smallest triple before or equal to the offset triple. The offset within the deletion stream is then stored. The loop continues until the sum of the original offset and deletion offset is different from the snapshot offset. In the second case, where the offset lies within the addition range, the algorithm terminates the snapshot iterator. The algorithm then applies an offset to the addition iterator. This is offset is the original offset minus the snapshot count incremented with the number of deletions. CHAPTER 5. OSTRICH OVERVIEW 23

Finally, the algorithm returns an offsettable iterator that contains the snapshot iterator, the deletion iterator and the addition iterator. This iterator performs a sort-merge join operation to delete triples from the snapshot iterator that also appear in the deletion iterator. Once the snapshot and deletions have been resolved, the iterator will emit all additions. queryVm(store, tp, version, originalOffset) { snapshot = store.getSnapshot(version).query(tp, originalOffset) if (snapshot.getVersion() = version) { return snapshot } additions = store.getAdditionsStream(tp, version) deletions = store.getDeletionStream(tp, version) offset = 0 if (originalOffset < snapshot.count(tp) - deletions.exactCount(tp)) { do { snapshot.offset(originalOffset + offset) offsetTriple = snapshot.peek() deletions = store.getDeletionsStream(tp, version, offsetTriple) offset = deletions.getOffset(tp) } while (snapshot.getCurrentOffset() != originalOffset + offset) } else { snapshot.offset(snapshot.count(tp)) additions.offset(originalOffset - snapshot.count(tp) + deletions.exactCount(tp)) } return PatchedSnapshotIterator(snapshot, deletions, additions) }

Algorithm 5.1: Version Materiliazation Algorithm from OSTRICH [1]

5.3.1.2 Result Count

The result count for VM queries is the number of snapshot triples for a given triple pattern summed up with the addition count and subtracted with the deletion count. As explained in Subsection 5.1.3.4, large addition counts are calculated and stored during ingestion, so the result count can be efficiently calculated.

5.3.2 Delta Materialization Query

As described by Fern´andez et al. [3], DM queries retrieve the differences between two versions and annotates whether they are an addition or deletion. OSTRICH supports DM queries for a single snapshot and forward delta chain. So we can discern two cases namely, DM query between snapshot-delta and DM query between two deltas in the same delta chain. In the following sections, we will describe how OSTRICH resolves DM queries, as well as how the cardinality of the result stream is estimated.

5.3.2.1 Query

The first case, a DM query between snapshot and delta, is trivial since OSTRICH stores aggre- gated deltas, so all deltas are relative to the snapshot. The second case is a DM query between two deltas in the same delta chain. In this case, the algorithm iterates over the triples inside the addition tree and deletion tree for the given CHAPTER 5. OSTRICH OVERVIEW 24 triple pattern, in a sort-merge join fashion. Triples are only emitted if they have a different addition/deletion flag for the two versions.

5.3.2.2 Result Count

In the first case, a DM query between snapshot and delta, the result count is exactly the number of snapshot triples summed together with the number of deletions and additions for the given triple pattern. The second case is a DM query between two deltas in the same delta chain. In this case, OSTRICH gives an estimation of the result count by summing up the additions and deletion for the given triple pattern in both versions. This can overestimate the actual count if triples are changed such that they negate each other inside the version range.

5.3.3 Version Query

As described by Fern´andez et al. [3], VQ annotates triples with version numbers in which they exist. OSTRICH [1] supports VQ queries for a single snapshot and forward delta chain. In the following sections, we will describe how version queries are resolved in OSTRICH, as well as how the cardinality of the result stream can be estimated.

5.3.3.1 Query

The algorithm starts by iterating over all the triples in the snapshot for the given triple pattern. Next, the deletion tree is probed for the triple. If the triple is not present in the deletion tree, the triple is present in all versions. If the triple is present in the deletion tree the corresponding versions are erased from the annotations. After all snapshot triples have been processed, the algorithm iterates over the addition triples stored in the addition tree. As was the case with snapshot triples, the deletion tree is probed again for the triple. If the triple is present in a deletion tree the versions are erased from the annotations. If the triple is not present, the triple is present in all versions ranging from the version that introduced the triple to the last version. Result streams can be partially offsetted, by offsetting the snapshot iterator.

5.3.3.2 Result Count

The result count can be calculated by retrieving the count for the requested triple pattern in the snapshot and adding the addition count for the requested triple pattern. Chapter 6

Bidirectional RDF Archive

As mentioned in Chapter 5 we will apply the potential storage optimization that was presented in Chapter 4 to OSTRICH [1]. The chapter is outlined as follows. First, we will give an overview of the storage structure. Second, we will explain how we expanded OSTRICH to support multiple snapshots. Third, we will give an overview of the ingestion process. Finally, we will give an overview of how VM, DM and VQ are handled.

6.1 Storage Structure

As seen in Figure 5.1 and Figure 6.1, the storage structure is similar to OSTRICH’s storage structure which was explained in Section 5.1. Indeed, the storage structure of the bidirectional delta chain is just a snapshot and two delta chains, i.e. the reverse delta chain and forward delta chain.

6.2 Multiple Snapshots

In Section 4.2, we presented two in-order ingestion algorithms for bidirectional delta chains which utilize multiple snapshots. However, OSTRICH only supports one snapshot, therefore we need to expand OSTRICH so multiple snapshots are supported. Supporting multiple snapshots comes down to finding the corresponding snapshot for a given version. We can discern three cases: the store consists of only forward delta chains, the store consists of only bidirectional delta chains and the store consists of a combination of bidirectional and forward delta chains. For the first case, assuming versions are in ascending order, the corresponding snapshot of a version is the greatest lower bound of all the snapshots, i.e. the largest snapshot that is still smaller than the version. For the second case, if we assume the delta chains are of equal length and all versions are in incremental order, the corresponding snapshot for a version is the snapshot that is closest to the version. In the third case, we calculate the greatest lower bound and the least upper bound of all the snapshots for the given version. If the upper bound snapshot does not have a reverse delta chain, our version is stored in a forward delta chain and the corresponding snapshot is the lower bound snapshot. If the upper bound snapshot has a reverse snapshot, the corresponding snapshot is the snapshot closest to the version.

6.3 Ingestion

Ingestion refers to inserting new versions into the storage. In this work, we focus on ingesting unidirectional non-aggregated changesets for the delta chains and fully materialized versions for

25 CHAPTER 6. BIDIRECTIONAL RDF ARCHIVE 26

Addition Counts Addition Counts Δ Δ Snapshot Δ Δ Metadata Metadata 0 1 2 3 4

Dictionary

ADD SPO DEL SPO HDT ADD SPO DEL SPO

......

ADD POS DEL POS ADD POS DEL POS

......

ADD OSP DEL OSP ADD OSP DEL OSP

......

Figure 6.1: An overview of the storage structure of a bidirectional delta chain. Figure adapted from OSTRICH [1]. the snapshot. In other words, the ingestion transforms the input as seen in Figure 2.2 into the storage structure seen in Figure 6.2b. We will first explain how we ingest versions in a forward delta chain. Next, we briefly explain how we can insert version out-of-order in a reverse delta chain. Finally, we explain how we can insert versions in-order in a reverse delta chain.

6.3.1 Out-of-order Ingestion

Out-of-order ingestion refers to ingesting versions in non-chronological order. Out-of-order in- gestion is difficult in a realistic setting since it requires the system to somehow buffer the input changeset. However, we will see that if we ignore this impracticality, we can easily ingest versions in the reverse delta chain. Ingesting versions in a reverse delta chain is similar to ingesting in a forward delta chain, we sim- ply need to transform the input changeset. Firstly, since the forward ingestion algorithm expects the input changeset to reference the snapshot, we reverse the input change set by swapping the additions and deletion so that the input changeset references the snapshot. For example, Figure 4.3b shows an example input changeset and Figure 4.3c shows the reversed input changeset. Secondly, since the forward ingestion algorithm expects the version closest to the snapshot to be inserted first, we insert the versions in reverse order.

6.3.2 In-order Ingestion: Fix-Up Algorithm

As mentioned in Section 6.3.1, out-of-order ingestion requires the system to buffer the input changeset, which is not always practical. Therefore we also propose an in-order ingestion algo- rithm. As discussed in Section 4.2, a potential way of inserting versions in-order in a reverse delta chain is the fix-up algorithm. In this section, we will expand upon this idea further. CHAPTER 6. BIDIRECTIONAL RDF ARCHIVE 27

The in-order ingestion process starts by inserting the first half of the delta chain in a temporary forward delta chain, as discussed in Section 5.2. Once the system decides a new delta chain needs to be initiated, for example, the delta chain size exceeds a certain threshold, the system will store the next version once in the temporary forward delta chain and store it again as the snapshot for the new permanent delta chain. The reason behind storing the version twice is to simplify the input extraction, which will be explained in the following section. Subfigure 6.3 shows the resulting delta chains. Once the system has some idle time the fix-up process can be performed. It is important to note that the fix-up process can be performed at any time since the temporary forward delta chain is fully functional. In summary, the fix-up process extracts the original input changeset from the temporary delta chain. The temporary delta chain can then be deleted and a new permanent reverse delta chain can be constructed out-of-order with the extracted input changeset. Subfig- ure 6.2b shows the final result. The input changeset is extracted from the temporary delta chain using Algorithm 6.1 to extract the additions and Algorithm 6.2 to extract the deletions. Algorithm 6.1 starts by iterating over every addition in the main addition tree of the delta chain. As discussed in Subsection 5.1.3, additions are annotated with version information. Since this version information represents an aggregated delta chain, we need to transform it in order to get the non-aggregated input change set. The algorithm does this by iterating over the version information. If the previous version is present, that means that the triple was already added in a previous version and therefore the triple was not present in the input addition change set. If the previous version is not present in the version information, the triple was first added in the current version and should be present in the input changeset. We write the triple to file in order to limit memory usage. However, deserializing the triple brings additional overhead. Algorithm 6.2 extracts the input deletion change set and is similarly resolved as Algorithm 6.1. Once the input changeset is recovered, the temporary delta chain is deleted. Finally, the ex- tracted changeset is inserted out-of-order into a reverse delta chain, as discussed in Subsection 6.3.1. extract_additions(store) { main_addition_tree = store->spo_additions_index();

addition_iterator = main_addition_tree->get_cursor();

while (current_addition = addition_iterator->next()) { versions = current_addition->get_versions(); for (i = 0; i < versions.size(); i++) { if (i == 0 || versions[i-1] != versions[i] - 1) { versions[i]/additions.nt << current_addition->get_triple(); } } } }

Algorithm 6.1: Algorithm to extract addition n-triple files from a delta chain. CHAPTER 6. BIDIRECTIONAL RDF ARCHIVE 28

Snapshot Δ Δ Snapshot Δ Δ Δ

(a) State of the delta chains before the fix-up algorithm is applied.

Δ Δ Δ Snapshot Δ Δ Δ

(b) State of the delta chains after the fix-up algorithm is applied.

Figure 6.2: An illustration of the fix-up algorithm.

Snapshot Δ Δ Δ

Snapshot Δ Δ Δ

Figure 6.3: State of the delta chains before the fix-up algorithm is applied. extract_deletions(store) { main_deletion_tree = store->spo_deletions_index();

deletion_iterator = main_deletion_tree->get_cursor();

while (current_deletion = deletion_iterator->next()) { versions = current_deletion->get_versions(); for (i = 0; i < versions.size(); i++) { if (i == 0 || versions[i-1] != versions[i] - 1) { versions[i]/deletions.nt << current_deletion->get_triple(); } } } }

Algorithm 6.2: Algorithm to extract deletion n-triple files from a delta chain.

6.4 Query

In this section we will explain how VM, DM and VQ are resolved in bidirectional delta chains.

6.4.1 Version Materialized Query

As described by Fern´andez et al. [3], VM queries retrieve data from a single version. VM queries are handled exactly the same as OSTRICH [1], which was described in Subsection 5.3.1. Indeed, even in the case where the version is stored in the reverse delta chain, the algorithm is the same since inverse deltas were ingested. CHAPTER 6. BIDIRECTIONAL RDF ARCHIVE 29

6.4.2 Delta Materialized Query

As described by Fern´andez et al. [3], DM queries retrieve the differences between two versions and annotates whether they are an addition or deletion. In this work, we will focus on DM queries for a single snapshot and corresponding reverse and forward delta chain. We can discern three cases namely, a DM query between snapshot-delta, a DM query between two deltas in the same delta chain (intra-delta DM query) and a DM query between two deltas in different delta chains (inter-delta DM query). The first case and the second case are handled exactly the same as OSTRICH [1], see Subsection 5.3.2. The third case is a DM query between two versions where one version is stored in the reverse delta chain and the other version in the forward delta chain. In summary, we resolve this case by splitting up the delta in two sequential deltas that are relative the snapshot and then merging the sequential deltas together. In other words if we use DARCs [51] patch notation, with o being the start version, e being the end version and s being the snapshot:

oDe = oD1sD2e

This strategy is quite efficient, since the delta relative to the snapshot are stored. Furthermore, since the snapshot deltas are sorted, they can be merged in a sort-merge fashion.

6.4.2.1 Query

The algorithm starts by calculating the deltas relative to the snapshot, which corresponds with the first case of the DM query algorithm that was explained in Subsection 5.3.2. We refer to the delta iterator between the version in the reverse delta chain and the snapshot, as the reverse delta iterator. Similarly, we refer to the delta iterator between the snapshot and the version in the forward delta chain, as the forward delta iterator. The algorithm continues by iterating over the two delta iterators in a sort-merge join fashion, as seen in Algorithm 6.3. If the triples at the heads of the streams are equal, we do not emit a triple, since we have an addition and deletion that cancel each other out. If the triple at the head of the reverse delta iterator is smaller, meaning the triple is not present in the forward delta iterator, the triple and its change flag is emitted. Similarly, if the triple is the head of the forward delta iterator is the smaller, the triple and its change flag is emitted. next_delta_triple() { if(forward_it->has_next() || reverse_forward_it->has_next()) { if(reverse_it->peek_head() == forward_it->peek_head()) { reverse_it->next(); forward_it->next(); } else if(reverse_it->peek_head() < forward_it->peek_head()) { return reverse_it->next(); } else { return forward_it->next(); } } }

Algorithm 6.3: Sort-merge join algorithm for merging two delta iterators. CHAPTER 6. BIDIRECTIONAL RDF ARCHIVE 30

6.4.2.2 Result Count

It is difficult to give an exact count of the results, for inter-delta DM queries. However, an estimation of the result count can be calculated by summing up the counts of both deltas relative to the snapshot. However, this can overestimate the actual count if triples are present in both deltas.

6.4.3 Version Query

As described by Fern´andez et al. [3], VQ annotates triples with version numbers in which they exist. In this work we will only present an algorithm for a single snapshot and corresponding reverse and forward delta chain. The algorithm is based on the VQ algorithm of OSTRICH, which was explained in Subsection 5.3.3. In the following sections, we will describe how version queries are resolved, as well as how the cardinality of the result stream can be estimated.

6.4.3.1 Query

As seen in Algorithm 6.4, the algorithm starts by iterating over all the triples in the snapshot for the given triple pattern. Next, the deletion trees are probed for the triple. If the triple is not present in the deletion tree, the triple is present in all versions. If the triple is present in a deletion tree the corresponding versions are erased from the version annotation, as seen in Algorithm 6.6. After all the snapshot triples have been processed, the algorithm iterates over the addition triples stored in the addition tree in a sort-merge join fashion, as seen in Algorithm 6.5. As was the case with snapshot triples, the deletion trees are probed for the triple. If the triple is not present in a deletion trees, the triple is present in all versions ranging from the version that introduced the triple to the last version. If the triple is present in a deletion tree the versions are erased from the annotations. Result streams can be partially offsetted, by offsetting the snapshot iterator of HDT [24]. next_VQ_triple() { if (snapshot_it->has_next()) { result_triple = snapshot_it->next(); result_versions = erase_deleted_versions(result_triple, null, null); return TripleVersion(result_triple, result_versions); } // iterate over additions in a sort-merge join fashion // and erase deletions if (has_next_addition()) { result_triple = next_addition(); result_versions = erase_deleted_versions(null, result_triple); return TripleVersion(result_triple, result_versions); } return false; }

Algorithm 6.4: Version Query Algorithm CHAPTER 6. BIDIRECTIONAL RDF ARCHIVE 31 next_addition() { if(forward_it->has_next() || reverse_forward_it->has_next()) { if(reverse_it->peek_head() == forward_it->peek_head()) { return (reverse_it->next(), reverse_it->next()); } else if(reverse_it->peek_head() < forward_it->peek_head()) { return (reverse_it->next(), null); } else { return (null, forward_it->next()); } } else { return (null, null); } }

Algorithm 6.5: Algorithm to merge forward and reverse addition iterators. CHAPTER 6. BIDIRECTIONAL RDF ARCHIVE 32 erase_deleted_versions( snapshot_triple, reverse_addition, forward_addition) {

// initialise version annotation optimistically smallest_version = reverse_patch_tree->get_min(); largest_version = forward_patch_tree->get_max(); if(snapshot_triple) { // vector from smallest_version to largest_version with step size 1 versions = [smallest_version : largest_version]; } else if(reverse_addition && forward_addition) { versions_reverse = [smallest_version : reverse_addition.get_largest_version()]; version_forward = [forward_addition.get_largest_version() : largest_version]; versions = versions_reverse + version_forward; } else if(reverse_addition) { versions = [smallest_version : reverse_addition.get_largest_version()]; } else if(forward_addition) { versions = [forward_addition.get_smallest_version() : largest_version]; }

// erase deletions from reverse delta chain deletion_versions = reverse_patchtree->get_deletion(triple); versions.erase(deletion_versions);

// erase deletions from forward delta chain deletion_versions = forward_patchtree->get_deletion(triple); versions.erase(deletion_versions);

return versions; }

Algorithm 6.6: Algorithm to calculate version annotations in VQ queries.

6.4.3.2 Result Count

The result count can be estimated by retrieving the count for the requested triple pattern in the snapshot and adding the addition counts for the requested triple pattern from the reverse and forward delta chains. This can overestimate the actual result count if a triple is added in both delta chains, since the triple will only appear once in the result stream. Chapter 7

Evaluation

In this chapter, we will evaluate COBRA and compare it with OSTRICH[1]. We start by describing the software implementation of the bidirectional RDF archive described in Chapter 6. Next, we will outline our experimental setup. Next, we will report the results from our experiments. Finally, we will discuss and interpret these results.

7.1 COBRA Implementation

COBRA (Change-Based Offset-Enabled Bidirectional RDF Archive) refers to the C++ software implementation of the storage described in Chapter 6 and can be found at https://github. ugent.be/tpmahieu/COBRA. COBRA uses the same technologies as OSTRICH [1]. COBRA uses HDT-(FoQ) [24, 25] for storing the snapshot. Moreover, the extended dictionary is com- pressed with gzip. For our B+ tree indexes, we use the B+ tree implementation from Kyoto Cabinet (http://fallabs.com/kyotocabinet/), which is memory-mapped and can be easily compressed. From Kyoto Cabinet, we also use the Hash Database implementation for storing the addition counts, which is also memory-mapped.

7.2 Experimental Setup

In this work we will evaluate the ingestion capabilities and the query resolution capabilities of COBRA. For this we will use the BEAR [45] benchmark which can be found at https: //aic.ai.wu.ac.at/qadlod/bear.html. In particular, we will use the BEAR-A, BEAR-B daily and BEAR-B hourly benchmarks. All experiments were performed on a 64-bit Ubuntu 14.04 machine with a 6-core 2.40 GHz CPU and 48 GB of RAM.

7.2.1 Ingestion

The ingestion process will be evaluated on storage size and ingestion time. For BEAR-A we will only ingest the first eight versions due to memory constraints. Similarly, for BEAR-B hourly, we will only ingest the first 400 versions. For BEAR-B daily, we will ingest all 89 versions. We will do the ingestion evaluation for multiple storage layouts and ingestion orders namely: • OSTRICH-1F: OSTRICH with one forward delta chain, as seen in Figure 2.3. • OSTRICH-2F: OSTRICH with two forward delta chains, as seen in Figure 6.2a. • COBRA-PRE FIX UP, COBRA’s pre fix-up state, as seen in Figure 6.3. • COBRA-POST FIX UP, COBRA’s bidirectional delta chain post fix-up, as seen in 6.2b.

33 CHAPTER 7. EVALUATION 34

• COBRA-OUT OF ORDER, COBRA’s bidirectional delta chain, as seen in 6.2b, but in- gested out-of-order (snapshot - reverse delta chain - forward delta chain).

7.2.2 Query

The BEAR benchmark also provides query sets that are grouped by triple pattern. BEAR-A provides seven query sets containing around 100 triple patterns that are further divided in high result cardinality and low result cardinality. BEAR-B only provides two query sets that contain ?P? and ?PO queries. These queries will be evaluated as VM queries for all version, DM queries between all versions and a VQ query. In order to minimize outliers, we replicate the queries five times and take the mean results. Furthermore, we also perform a warm-up period before the first query of each triple pattern. Since neither OSTRICH nor COBRA support multiple snapshots for all query atoms, we limit our experiments to OSTRICH’s unidrectional storage layout and COBRA’s bidirectional storage layout.

7.3 Results

In this section we will present the results from our experiments. The raw results can be found at https://github.ugent.be/tpmahieu/COBRA. A discusion of these results will be presented in the next section.

7.3.1 Ingestion Results

Table 7.1a displays the storage sizes and ingestion times of the different delta chain configurations for the first eight versions of the BEAR-A benchmark. Figure 7.1 and Figure 7.2 show the cumulative storage size and cumulative ingestion time per version , while Figure 7.3 shows the individual ingestion time per version. For BEAR-A, OSTRICH-1F has the highest ingestion time and requires more storage than COBRA-OUT OF ORDER. OSTRICH-2F ingests the fastest but also requires more storage than COBRA-OUT OF ORDER. COBRA-PRE FIX UP requires the most storage space and ingests faster than OSTRICH-1F but slower than OSTRICH-2F. Table 7.1b shows the ingestion times of the different approaches for BEAR-B daily. Figure 7.4 and Figure 7.5 display the cumulative storage size and cumulative ingestion time, while Figure 7.6 shows the individual ingestion time per version. For BEAR-B daily, OSTRICH-1F has the lowest storage size but also the highest ingestion time. OSTRICH-2F has the lowest ingestion time. COBRA-PRE FIX UP has a similar ingestion time and storage size as OSTRICH-2F. COBRA-OUT OF ORDER has the higest storage size. Table 7.1c shows the ingestion times of the different approaches for the first 400 version of BEAR-B hourly. Figure 7.7 and Figure 7.8 display the cumulative storage size and cumulative ingestion time for every version for BEAR-B. Figure 7.9 shows the ingestion time per version. For BEAR-B hourly, OSTRICH-2F has the lowest storage size and the lowest ingestion time. COBRA-PRE FIX UP has a similar ingestion time and storage size as OSTRICH-2F. COBRA- OUT OF ORDER has a higer ingestion time and storage size compared to OSTRICH-2F. OSTRICH-1F has the highest storage size and ingestion time.

7.3.2 Query Results

Figures 7.10, 7.12 and 7.13 show the mean VM, DM and VQ query duration for all triple patterns provided by the BEAR-A benchmark. In order to have outlier query durations to influence the results, we opted for the mean over the median. Appendix A lists the average query durations per tripple pattern. CHAPTER 7. EVALUATION 35

Figure 7.1: Comparison of the cumulative storage sizes (in GB) per version for the first eight versions of the BEAR-A benchmark.

Figure 7.2: Comparison of the cumulative ingestion times (in hours) per version for the first eight versions of the BEAR-A benchmark. CHAPTER 7. EVALUATION 36

Figure 7.3: Comparison of the individual ingestion times (in minutes) per version for the first eight versions of the BEAR-A benchmark.

Figure 7.4: Comparison of the cumulative storage sizes (in MB) per version of the BEAR-B daily benchmark. CHAPTER 7. EVALUATION 37

Figure 7.5: Comparison of the cumulative ingestion time (in min) per version of the BEAR-B daily benchmark.

Figure 7.6: Comparison of the individual ingestion time (in min) per version of the BEAR-B daily benchmark. CHAPTER 7. EVALUATION 38

Figure 7.7: Comparison of the cumulative storage sizes (in MB) per version of the BEAR-B hourly benchmark.

Figure 7.8: Comparison of the cumulative ingestion time (in min) per version of the BEAR-B hourly benchmark. CHAPTER 7. EVALUATION 39

Approach Storage Size (GB) Ingestion Time (hour) OSTRICH-1F 3.92 23.66 OSTRICH-2F 3.83 11.45 COBRA-PRE FIX UP 4.31 12.92 COBRA-POST FIX UP 3.36 12.92 + 8.38 COBRA-OUT OF ORDER 3.40 14.63 (a) Storage sizes and ingestion times of the different approaches for BEAR-A. Storage Layout Storage Size (MB) Ingestion Time (min) OSTRICH-1F 19.37 6.53 OSTRICH-2F 25.90 3.18 COBRA-PRE FIX UP 26.01 3.28 COBRA-POST FIX UP 29.15 3.28 + 2.48 COBRA-OUT OF ORDER 28.44 4.24 (b) Storage sizes and ingestion times of the different approaches for BEAR-B daily. Storage Layout Storage Size (MB) Ingestion Time (min) OSTRICH-1F 61.02 34.47 OSTRICH-2F 46.40 14.85 COBRA-PRE FIX UP 46.42 14.87 COBRA-POST FIX UP 55.42 14.87 + 11.41 COBRA-OUT OF ORDER 53.26 18.30 (c) Storage sizes and ingestion times of the different approaches for BEAR-B hourly.

Table 7.1: Storage sizes and ingestion times of the different approaches for all three benchmarks. COBRA-POST FIX UP represents the in-order ingestion of the bidirectional delta chain using the fix-up algorithm. Therefore, the ingestion time is the sum of the ingestion time of COBRA-PRE FIX UP and the fix-up time.

For BEAR-A, COBRA resolves VM queries faster than OSTRICH. Similarly, COBRA resolves DM queries slighlty faster than OSTRICH. Finally, VQ queries are resolved sligthly faster in OSTRICH compared to COBRA. Figures 7.14, 7.16 and 7.17 show the average VM, DM and VQ query duration for all triple patterns provided by the BEAR-B benchmark for the all versions of the BEAR-B daily dataset. Appendix B lists the average query durations per tripple pattern. For BEAR-B daily, COBRA resolves VM queries and DM queries faster than OSTRICH. VQ query durations are similar for OSTRICH and COBRA. Figures 7.18, 7.20 and 7.21 show the average VM, DM and VQ query duration for all triple patterns provided by the BEAR-B benchmark for the first 400 versions of the BEAR-B hourly dataset. Appendix C lists the average query durations per triple pattern. For BEAR-B hourly, COBRA resolves VM and DM queries faster than OSTRICH. VQ query durations are similar for OSTRICH and COBRA.

7.4 Discussion

In section, we discuss the results presented in Section 7.3. First, we will discuss the ingestion results. Second, we will discuss the query results. Finally, we evaluate our hypotheses that were presented in 4.3. CHAPTER 7. EVALUATION 40

Figure 7.9: Comparison of the individual ingestion time (in min) per version of the BEAR-B hourly benchmark.

Figure 7.10: Average VM query durations for all triple patterns in BEAR-A. CHAPTER 7. EVALUATION 41

Figure 7.11: Average DM query durations between version 3 and all other versions for all triple patterns in BEAR-A.

Figure 7.12: Average DM query durations between all versions for all triple patterns in BEAR-A. CHAPTER 7. EVALUATION 42

Figure 7.13: Average VQ query durations for all triple patterns in BEAR-A.

Figure 7.14: Average VM query durations for all provided triple patterns in BEAR-B daily. CHAPTER 7. EVALUATION 43

Figure 7.15: Average DM query durations between version 3 and all other versions for all triple patterns in BEAR-B daily.

Figure 7.16: Average DM query durations between all versions for all triple patterns in BEAR-B daily. CHAPTER 7. EVALUATION 44

Figure 7.17: Average VQ query durations for all provided triple patterns in BEAR-B daily.

Figure 7.18: Average VM query durations for all provided triple patterns in the first 400 versions of BEAR-B hourly. CHAPTER 7. EVALUATION 45

Figure 7.19: Average DM query durations between version 3 and all other versions for all triple patterns in BEAR-B hourly.

Figure 7.20: Average DM query durations between all versions for all triple patterns in BEAR-B hourly. CHAPTER 7. EVALUATION 46

Figure 7.21: Average VQ query durations for all provided triple patterns in the first 400 versions of BEAR-B hourly.

7.4.1 Ingestion Evaluation

7.4.1.1 Storage Size

There is no aproach that has the lowest storage size for all the benchmarks. Indeed, COBRA has the lowest storage size for BEAR-A, OSTRICH-1F has the lowest storage size for BEAR-B daily and OSTRICH-2F has the lowest storage size for BEAR-B hourly. As mentioned above, COBRA has the lowest storage size for BEAR-A. In Figure 7.1, we can identify two causes for this storage size reduction. First, we see that the reverse delta chain, version 0 up to 4, has a lower storage size than the forward delta chain of OSTRICH-1F, OSTRICH-2F and COBRA-PRE FIX UP. Second, we see that for version 4 OSTRICH-2F and COBRA-PRE FIX UP have a storage increase due to the additional snapshot, which does not need to be stored for OSTRICH-1F and COBRA-OUT OF ORDER. For BEAR-B daily, OSTRICH-1F has the lowest storage size, due to a large storage increase for the other approaches, which can be seen in Figure 7.4. This large storage increase is the result of the additional delta chain of OSTRICH-2F, COBRA-PRE FIX UP and COBRA- OUT OF ORDER, which is initialized in version 4. We can also see that the this storage increase is smaller for COBRA-OUT OF ORDER compared to COBRA-PRE FIX UP and OSTRICH-2F, due to the additional snapshot of the latter. However, in this case, COBRA- OUT OF ORDER does not have a smaller storage size than OSTRICH-2F and COBRA-PRE FIX UP, due to the storage size of the reverse delta chain. For BEAR-B hourly, OSTRICH-2F has the lowest storage size. In Figure 7.7, we can again see the large storage increase for OSTRICH-2F, COBRA-PRE FIX UP and COBRA-OUT OF ORDER, which is again smaller for COBRA-OUT OF ORDER. However, for BEAR-B hourly OSTRICH- 1F has the largest storage size, which implies that additional snapshots can reduce the total storage size. CHAPTER 7. EVALUATION 47

7.4.1.2 Ingestion Time

In figure 7.3 and Figure 7.6 we observe that for all storage configurations, the ingestion time increases with the the number of versions until a new delta chain in initiated. The reason for this behiour is that the ingestion algorithm needs to iterate over the addition and deletion trees, so the ingestion time increases with the size of the addition and deletion trees. Therefore, OSTRICH-1F ingests slower than the other approaches. In Figure 7.2, 7.5 and 7.8 we see that OSTRICH-2F has the lowest ingestion time for all the benchmarks. COBRA-PRE FIX UP has similar ingestion times for OSTRICH-2F for BEAR-B daily and BEAR-B hourly, but ingests much slower for BEAR-A. As explained in Subsection 6.3.2, COBRA-PRE FIX UP stores the middle version twice, in order to speed-up and simplify the fix-up step. For datasets with many small versions such as BEAR-B, the storage size and ingestion time increase is neglible. However, for datasets with only a few large versions, such as BEAR-A, the storage size and ingestion time increase is non-neglible. In Table 7.1a, 7.1b and 7.1c, we see that the fix-up time is quite large, but the in-order ingestion duration for COBRA is still lower than the ingestion duration for OSTRICH-1F, for all evaluated benchmarks.

7.4.2 Query Evaluation

VM queries are resolved faster by COBRA compared to OSTRICH, eventhough COBRA and OSTRICH have the same VM algorithm. We attribute this discrepency to the smaller delta chains of COBRA. DM queries are resolved faster by COBRA compared to OSTRICH for all three benchmarks. In Figure 7.11, 7.15 and 7.19, we can see that intra-delta DM queries are resolved faster in COBRA compared to OSTRICH. Indeed, as discussed in Subsection 5.3.2 intra-delta DM queries rely on iterating over the deletion and addition trees, so smaller addition and deletion trees result in faster inter-delta DM queries. Therefore, since COBRA has halved OSTRICH’s delta chain, COBRA also halved the inter-delta DM query time. In Figure 7.11, 7.19 and 7.15, we also see that COBRA handles inter-delta DM queries as efficienly as OSTRICH handles intra-delta DM queries. VQ query durations are roughly equal for both OSTRICH and COBRA, with OSTRICH being faster for BEAR-A but slower for BEAR-B daily and BEAR-B hourly. As discussed in Subsection 5.3.3 and Subsection 6.4.3, COBRA’s VQ algorithm is similar to OSTRICH’s VQ algorithm. So COBRA has only a limited extra overhead compared to OSTRICH, namely the merging of the reverse and forward addition tree, and a potential additonal look-up in the reverse deletion tree. Which, as evidenced by the Figure 7.13, 7.17 and 7.21, only have a limited additional overhead.

7.4.3 Hypotheses Evaluation

In Section 4.3 we proposed five hypotheses, which will be evaluated in this section based on the experimental results of our software implementation. The first hypotheses states that: “Disk space will be significantly lower for a bidirectional delta chain compared to a unidirectional delta chain.”. In Subsection 7.4.1 we discussed that COBRA has a lower storage size than OSTRICH for BEAR-A and BEAR-B hourly, but that COBRA has a higher storage size for BEAR-B daily. Therefore we reject this hypotheses. The second hypotheses stated that: “In-order ingestion time will be lower for a unidrectional delta chain compared to a bidirectional delta chain.” we discussed that OSTRICH-1F ingests the slowest for all benchmarks. Therefore, we reject this hypotheses. The third hypotheses stated that: “The mean VM query duration will be equal for both a unidirectional delta chain and a bidirectional delta chain.”. As discussed in Subsection 7.4.2, the average VM query duration is faster for COBRA compared to OSTRICH, due to the smaller delta chains of COBRA. Therefore, we reject this hypotheses. The fourth hypotheses stated that: “The mean DM query duration will be equal for both a unidirectional delta chain and a bidirectional delta chain.”. As discussed in Subsection 7.4.2, the average DM is resolved faster in COBRA compared to OSTRICH, due to the lower intra- CHAPTER 7. EVALUATION 48 delta DM query duration. Therefore, we reject this hypotheses. The fifth hypotheses stated that: “The mean VQ query duration will be equal for both a unidirectional delta chain and a bidirectional delta chain.”. As discussed in Subsection 7.4.2, the average VQ query duration is roughly equal for both OSTRICH and COBRA. Therefore, we accept this hypotheses. Chapter 8

Conclusion and Future Work

In this chapter, we will give a conclusion for this thesis and list potential future work.

8.1 Conclusion

In this work, we presented bidirectional delta chains as a potential storage optimization for CB RDF archives. We applied this storage optimization on an existing RDF archive named OSTRICH [1]. For this purpose, we modified OSTRICH so that multiple snapshots could be supported. Next, we presented an in-order ingestion algorithm using a fix-up strategy. Moreover, we presented a novel DM query algorithm for inter-delta versions. Finally, we altered the existing VQ query algorithm so that bidirectional chains are supported. In our ingestion results presented in Section 7.4.1, we confirmed that multiple snapshots are a viable method of reducing the ingestion time for OSTRICH, as Taelman et al. [1] suggested. Moreover, we discovered that for all three evaluated benchmarks, COBRA has a lower ingestion time than OSTRICH, even with the additional ingestion cost due to the fix-up time. Finally, we saw that COBRA does not reduce the storage size for all three evaluated benchmarks, only for the first eight versions of BEAR-A. From the results of BEAR-B hourly, we conclude that OSTRICH with one snapshot, performs poorly for datasets with many versions. So, in this case, we recommend breaking up the delta chain by introducing an additional snapshot or using a bidirectional delta chain. However from the results of BEAR-B daily, we conclude that for smaller datasets the delta chain should be sufficiently large, before a new delta chain is initiated so that the initial cost of a new delta chain does not dominate the storage size. Finally, we saw that for all benchmarks COBRA reduced the storage increase from the second delta chain, because COBRA does not need to store the second snapshot. However this did not always result in an overall storage size reduction, due to the size of COBRA’s reverse delta chain. Therefore we recommend transforming two unidirectional delta chains into a bidirectional delta chain, only if the first delta chain is more similar to the second snapshot, so that the resulting reverse delta chain will be smaller than the current forward delta chain. In our query results presented in Section 7.4.2, we saw that COBRA reduces the VM query durations. We attributed this to the smaller delta chain of COBRA, so we expect similar results for OSTRICH with two snapshots. Similarly, for the DM queries, we observed that COBRA has a reduced inter-delta DM cost compared to OSTRICH, which we also attributed to the smaller delta chain, so similar gains can be expected for OSTRICH with two snapshots. Finally, we saw that VQ queries are roughly equal for both OSTRICH and COBRA. In conclusion, binary delta chains are not the all-round storage optimization technique we set- out to find at the start of this work, however it is a viable tool for reducing the overall storage size in certain cases. In particular, for merging two unidirectional delta chains when the first delta chain is more similar to the second snapshot.

49 CHAPTER 8. CONCLUSION AND FUTURE WORK 50

8.2 Future Work

In this work there are a number of potentially interesting research opportunities left for future work. First, as mentioned in the previous section, there needs to be a reliable way of predicting whether a delta chain is more similar to the preceding snapshot or the future snapshot, before two unidirectional delta chains can be transformed into a bidirectional delta chain. In Chapter 4, we also presented an alternative algorithm for ingesting versions in-order in a bidirectional delta chain which could be implemented and evaluated. Moreover, in Section 7.4 we discussed that storing an additional version for COBRA’s pre fix-up state has a nonnegligible overhead for large versions, so future work could research how to extract the input changeset from the snapshot so the additional version would not need to be stored twice. Finally, additional research is needed to expand the current DM and VQ algorithms for multiple snapshots and allow for more efficient offsets. Bibliography

[1] R. Taelman, R. Verborgh, and E. Mannens, “Exposing RDF archives using triple pattern fragments,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2017. [2] C. Schoreels, B. Logan, and J. M. Garibaldi, “Agent based genetic algorithm employing fi- nancial technical analysis for making trading decisions using historical equity market data,” in Intelligent Agent Technology, 2004.(IAT 2004). Proceedings. IEEE/WIC/ACM Interna- tional Conference on, pp. 421–424, IEEE, 2004. [3] J. D. Fern´andez,J. Umbrich, A. Polleres, and M. Knuth, “Evaluating query and storage strategies for rdf archives,” in Proceedings of the 12th International Conference on Semantic Systems, SEMANTiCS 2016, (New York, NY, USA), pp. 41–48, ACM, 2016. [4] T. Berners-Lee, J. Hendler, and O. Lassila, “The Semantic Web,” Scientific American, vol. 284, no. 5, pp. 34–43, 2001. [5] C. Bizer, T. Heath, and T. Berners-Lee, “Linked data - the story so far,” International Journal on Semantic Web and Information Systems, vol. 5, no. 3, pp. 1–22, 2009. [6] F. Manola, E. Miller, B. McBride, et al., “Rdf primer,” W3C recommendation, vol. 10, no. 1-107, p. 6, 2004. [7] G. Karvounarakis, A. Magganaraki, S. Alexaki, V. Christophides, D. Plexousakis, M. Scholl, and T. Tolle, “Querying the Semantic Web with RQL,” Computer Networks, 2003. [8] A. Bernstein and C. Kiefer, “Imprecise rdql: Towards generic retrieval in ontologies using similarity joins,” in Proceedings of the 2006 ACM Symposium on Applied Computing, SAC ’06, (New York, NY, USA), pp. 1684–1689, ACM, 2006. [9] W. S. W. Group et al., “Sparql 1.1 overview, w3c recommendation 21 march 2013,” 2012. [10] D. C. Faye, O. Cur´e,and G. Blin, “A survey of RDF storage approaches,” Revue Africaine de la Recherche en Informatique et Math´ematiquesAppliqu´ees, vol. 15, pp. 11–35, 2012. [11] O. Cur´eand G. Blin, RDF database systems: triples storage and SPARQL query processing. Morgan Kaufmann, 2014. [12] K. J. Wilkinson, C. Sayers, H. A. Kuno, and D. Reynolds, “Efficient rdf storage and retrieval in jena2,” in SWDB, 2003. [13] K. Wilkinson, “Jena Property Table Implementation,” in SSWS, (Athens, Georgia, USA), pp. 35–46, 2006. [14] D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach, “Sw-store: a vertically parti- tioned dbms for semantic web data management,” The VLDB Journal, vol. 18, pp. 385–406, Apr 2009. [15] O. Cur´eand G. Blin, “An update strategy for the waterfowl rdf data store.,” in Inter- national Semantic Web Conference (Posters and Demos) (M. Horridge, M. Rospocher, and J. van Ossenbruggen, eds.), vol. 1272 of CEUR Workshop Proceedings, pp. 377–380, CEUR-WS.org, 2014.

51 BIBLIOGRAPHY 52

[16] X. Pu, J. Wang, Z. Song, P. Luo, and M. Wang, “Efficient incremental update and querying in aweto rdf storage system,” Data and Knowledge Engineering, vol. 89, pp. 55 – 75, 2014. [17] R. Punnoose, A. Crainiceanu, and D. Rapp, “Rya: A scalable rdf triple store for the clouds,” in Proceedings of the 1st International Workshop on Cloud Intelligence, Cloud-I ’12, (New York, NY, USA), pp. 4:1–4:8, ACM, 2012. [18] A. Harth and S. Decker, “Optimized index structures for querying rdf from the web,” in Proceedings of the Third Latin American Web Congress, LA-WEB ’05, (Washington, DC, USA), pp. 71–, IEEE Computer Society, 2005. [19] A. Harth, J. Umbrich, A. Hogan, and S. Decker, “YARS2: A federated repository for query- ing graph structured data from the Web,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2007. [20] C. Weiss, P. Karras, and A. Bernstein, “Hexastore: sextuple indexing for semantic web data management,” Proc. VLDB Endow., vol. 1, pp. 1008–1019, Aug. 2008. [21] T. Neumann and G. Weikum, “Rdf-3x: A risc-style engine for rdf,” Proc. VLDB Endow., vol. 1, pp. 647–659, Aug. 2008. [22] M. Atre, J. Srinivasan, and J. A. Hendler, “Bitmat: A main-memory bit matrix of rdf triples for conjunctive triple pattern queries,” in International Semantic Web Conference, 2008. [23] P. Yuan, P. Liu, B. Wu, H. Jin, W. Zhang, and L. Liu, “Triplebit: A fast and compact system for large scale rdf data,” Proc. VLDB Endow., vol. 6, pp. 517–528, May 2013. [24] J. D. Fern´andez,M. A. Mart´ınez-Prieto,C. Guti´errez,A. Polleres, and M. Arias, “Binary rdf representation for publication and exchange (hdt),” Web Semantics: Science, Services and Agents on the World Wide Web, vol. 19, pp. 22 – 41, 2013. [25] M. A. Mart´ınez-Prieto,M. Arias Gallego, and J. D. Fern´andez,“Exchange and consumption of huge rdf data,” in The Semantic Web: Research and Applications (E. Simperl, P. Cimi- ano, A. Polleres, O. Corcho, and V. Presutti, eds.), (Berlin, Heidelberg), pp. 437–452, Springer Berlin Heidelberg, 2012. [26] O. Cur´e,G. Blin, D. Revuz, and D. C. Faye, “Waterfowl: A compact, self-indexed and inference-enabled immutable rdf store,” in The Semantic Web: Trends and Challenges (V. Presutti, C. d’Amato, F. Gandon, M. d’Aquin, S. Staab, and A. Tordai, eds.), (Cham), pp. 302–316, Springer International Publishing, 2014. [27] M. J. Rochkind, “The source code control system,” IEEE Transactions on Software Engi- neering, vol. SE-1, pp. 364–370, Dec 1975. [28] C. Schneider, A. Z¨undorf, and J. Niere, “Coobra - a small step for development tools to collaborative environments,” in Workshop on Directions in Software Engineering Environ- ments, 2004. [29] T. W. F., “Rcs — a system for version control,” Software: Practice and Experience, vol. 15, no. 7, pp. 637–654, 1982. [30] J. D. Fern´andez,A. Polleres, and J. Umbrich, “Towards efficient archiving of dynamic linked open data,” in CEUR Workshop Proceedings, 2015. [31] T. K¨afer,A. Abdelrahman, J. Umbrich, P. O’Byrne, and A. Hogan, “Exploring the dy- namics of linked data,” in The Semantic Web: ESWC 2013 Satellite Events (P. Cimiano, M. Fern´andez,V. Lopez, S. Schlobach, and J. V¨olker, eds.), (Berlin, Heidelberg), pp. 302– 303, Springer Berlin Heidelberg, 2013. [32] M. V¨olkel and T. Groza, “SemVersion: An RDF-based Ontology Versioning System,” in Proceedings of IADIS International Conference on WWW/Internet (IADIS 2006) (M. B. Nunes, ed.), (Murcia, Spain), pp. 195–202, October 2006. BIBLIOGRAPHY 53

[33] S. Cassidy and J. Ballantine, “Version control for rdf triple stores.,” in ICSOFT 2007 - 2nd International Conference on Software and Data Technologies, Proceedings, pp. 5–12, 01 2007. [34] D.-H. Im, S.-W. Lee, and H.-J. Kim, “A version management framework for rdf triple stores,” International Journal of Software Engineering and Knowledge Engineering, vol. 22, no. 01, pp. 85–106, 2012. [35] M. Vander Sande, P. Colpaert, R. Verborgh, S. Coppens, E. Mannens, and R. Van de Walle, “R&Wbase: git for triples,” in Proceedings of the 6th Workshop on Linked Data on the Web (C. Bizer, T. Heath, T. Berners-Lee, M. Hausenblas, and S. Auer, eds.), vol. 996 of CEUR Workshop Proceedings, May 2013. [36] M. Graube, S. Hensel, and L. Urbas, “R43ples: Revisions for triples an approach for version control in the semantic web,” in CEUR Workshop Proceedings, 2014. [37] C. Hauptmann, M. Brocco, and W. W¨orndl,“Scalable semantic version control for linked data management,” in LDQ@ESWC, 2015. [38] T. Neumann and G. Weikum, “x-rdf-3x: Fast querying, high update rates, and consistency for rdf databases,” Proc. VLDB Endow., vol. 3, pp. 256–263, Sept. 2010. [39] A. Cerdeira-Pena, A. Fari˜na,J. D. Fern´andez,and M. A. Mart´ınez-Prieto,“Self-indexing rdf archives,” in 2016 Data Compression Conference (DCC), pp. 526–535, March 2016. [40] N. R. Brisaboa, A. Cerdeira-Pena, A. Fari˜na,and G. Navarro, “A compact RDF store using suffix arrays,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2015. [41] J. Anderson and A. Bendiken, “Transaction-time queries in Dydra,” in Joint proceedings of the 3rd Workshop on Managing the Evolution and Preservation of the Data Web (MEPDaW 2017) and the 4th Workshop on Linked Data Quality (LDQ 2017) co-located with 14th European Semantic Web Conference (ESWC 2017), 2016. [42] P. Meinhardt, M. Knuth, and H. Sack, “Tailr: a platform for preserving history on the web of data,” in Proceedings of the 11th International Conference on Semantic Systems, pp. 57–64, ACM, 2015. [43] R. Verborgh, M. Vander Sande, O. Hartig, J. Van Herwegen, L. De Vocht, B. De Meester, G. Haesendonck, and P. Colpaert, “Triple pattern fragments: a low-cost knowledge graph interface for the web,” Web Semantics: Science, Services and Agents on the World Wide Web, vol. 37, pp. 184–206, 2016. [44] V. Papakonstantinou, G. Flouris, I. Fundulaki, K. Stefanidis, and Y. Roussakis, “Spbv: Benchmarking linked data archiving systems,” in Joint Proceedings of BLINK2017: 2nd International Workshop on Benchmarking Linked Data and NLIWoD3: Natural Language Interfaces for the Web of Data co-located with 16th International Semantic Web Conference (ISWC 2017), Vienna, Austria, October 21st - to - 22nd, 2017., 2017. [45] J. D. F. Garcia, J. Umbrich, and A. Polleres, “Bear: Benchmarking the efficiency of rdf archiving,” Working Papers on Information Systems, Information Business and Operations 02/2015, Department f¨urInformationsverarbeitung und Prozessmanagement, WU Vienna University of Economics and Business, Vienna, 2015. [46] M. Meimaris and G. Papastefanatos, “The EvoGen benchmark suite for evolving RDF data,” in CEUR Workshop Proceedings, 2016. [47] M. Morsey, J. Lehmann, S. Auer, C. Stadler, and S. Hellmann, “Dbpedia and the live extraction of structured data from wikipedia,” Program, vol. 46, no. 2, pp. 157–181, 2012. [48] J. Umbrich, S. Neumaier, and A. Polleres, “Quality assessment and evolution of open data portals,” in Future Internet of Things and Cloud (FiCloud), 2015 3rd International Conference on, pp. 404–411, IEEE, 2015. BIBLIOGRAPHY 54

[49] “Number of facebook users worldwide 2008-2018 — statistic.” https://www.statista. com/statistics/264810/number-of-monthly-active-facebook-users-worldwide/. Accessed: 2018-05-24. [50] R. I. M. Dunbar, “Do online social media cut through the constraints that limit the size of offline social networks?,” Royal Society Open Science, 2016. [51] D. Roundy, “Darcs: Distributed version management in haskell,” in Proceedings of the 2005 ACM SIGPLAN Workshop on Haskell, Haskell ’05, (New York, NY, USA), pp. 1–4, ACM, 2005. Appendices

A BEAR-A Query Results

In this section we list the average query dura- tion for all the triple patterns of the BEAR-A benchmark.

Figure 3: Average VM query durations for low cardinality SP? triple patterns in the first eight ver- sions of BEAR-A.

Figure 1: Average VM query durations for SPO triple patterns in the first eight versions of BEAR- A.

Figure 2: Average VM query durations for low Figure 4: Average VM query durations for high cardinality S?O triple patterns in the first eight ver- cardinality SP? triple patterns in the first eight ver- sions of BEAR-A. sions of BEAR-A.

55 BIBLIOGRAPHY 56

Figure 5: Average VM query durations for low Figure 8: Average VM query durations for high cardinality ?PO triple patterns in the first eight ver- cardinality ??O triple patterns in the first eight ver- sions of BEAR-A. sions of BEAR-A.

Figure 6: Average VM query durations for high Figure 9: Average VM query durations for low cardinality ?PO triple patterns in the first eight ver- cardinality ?P? triple patterns in the first eight ver- sions of BEAR-A. sions of BEAR-A.

Figure 7: Average VM query durations for low Figure 10: Average VM query durations for high cardinality ??O triple patterns in the first eight ver- cardinality ?P? triple patterns in the first eight ver- sions of BEAR-A. sions of BEAR-A. BIBLIOGRAPHY 57

Figure 11: Average VM query durations for low Figure 14: Average DM query durations for low cardinality S?? triple patterns in the first eight ver- cardinality S?O triple patterns in the first eight ver- sions of BEAR-A. sions of BEAR-A.

Figure 12: Average VM query durations for high Figure 15: Average DM query durations for low cardinality S?? triple patterns in the first eight ver- cardinality SP? triple patterns in the first eight ver- sions of BEAR-A. sions of BEAR-A.

Figure 13: Average DM query durations for SPO Figure 16: Average DM query durations for high triple patterns in the first eight versions of BEAR- cardinality SP? triple patterns in the first eight ver- A. sions of BEAR-A. BIBLIOGRAPHY 58

Figure 17: Average DM query durations for low Figure 20: Average DM query durations for high cardinality ?PO triple patterns in the first eight ver- cardinality ??O triple patterns in the first eight ver- sions of BEAR-A. sions of BEAR-A.

Figure 18: Average DM query durations for high Figure 21: Average DM query durations for low cardinality ?PO triple patterns in the first eight ver- cardinality ?P? triple patterns in the first eight ver- sions of BEAR-A. sions of BEAR-A.

Figure 19: Average DM query durations for low Figure 22: Average DM query durations for high cardinality ??O triple patterns in the first eight ver- cardinality ?P? triple patterns in the first eight ver- sions of BEAR-A. sions of BEAR-A. BIBLIOGRAPHY 59

Figure 23: Average DM query durations for low Figure 26: Average VQ query durations for low cardinality S?? triple patterns in the first eight ver- cardinality S?O triple patterns in the first eight ver- sions of BEAR-A. sions of BEAR-A.

Figure 24: Average DM query durations for high Figure 27: Average VQ query durations for low cardinality S?? triple patterns in the first eight ver- cardinality SP? triple patterns in the first eight ver- sions of BEAR-A. sions of BEAR-A.

Figure 25: Average VQ query durations for SPO Figure 28: Average VQ query durations for high triple patterns in the first eight versions of BEAR- cardinality SP? triple patterns in the first eight ver- A. sions of BEAR-A. BIBLIOGRAPHY 60

Figure 29: Average VQ query durations for low Figure 32: Average VQ query durations for high cardinality ?PO triple patterns in the first eight ver- cardinality ??O triple patterns in the first eight ver- sions of BEAR-A. sions of BEAR-A.

Figure 30: Average VQ query durations for high Figure 33: Average VQ query durations for low cardinality ?PO triple patterns in the first eight ver- cardinality ?P? triple patterns in the first eight ver- sions of BEAR-A. sions of BEAR-A.

Figure 31: Average VQ query durations for low Figure 34: Average VQ query durations for high cardinality ??O triple patterns in the first eight ver- cardinality ?P? triple patterns in the first eight ver- sions of BEAR-A. sions of BEAR-A. BIBLIOGRAPHY 61

Figure 35: Average VQ query durations for low Figure 37: Average VM query durations for ?P? cardinality S?? triple patterns in the first eight ver- triple patterns in BEAR-B daily. sions of BEAR-A.

Figure 38: Average DM query durations for ?P? triple patterns in BEAR-B daily.

Figure 36: Average VQ query durations for high cardinality S?? triple patterns in the first eight ver- sions of BEAR-A.

B BEAR-B daily Query Re- sults

In this section we list the average query dura- tion for the ?P? and ?PO triple patterns of the Figure 39: Average VQ query durations for ?P? BEAR-B daily benchmark. triple patterns in BEAR-B daily. BIBLIOGRAPHY 62

C BEAR-B hourly Query Re- sults

In this section we list the average query du- ration for the ?P? and ?PO triple patterns in the first 400 versions of BEAR-B hourly bench- mark.

Figure 40: Average VM query durations for ?PO triple patterns in BEAR-B daily.

Figure 43: Average VM query durations for ?P? triple patterns in the first 400 versions of BEAR-B hourly.

Figure 41: Average DM query durations for ?PO triple patterns in BEAR-B daily.

Figure 44: Average DM query durations for ?P? Figure 42: Average VQ query durations for ?PO triple patterns in the first 400 versions of BEAR-B triple patterns in BEAR-B daily. hourly.. BIBLIOGRAPHY 63

Figure 45: Average VQ query durations for ?P? Figure 47: Average DM query durations for ?PO triple patterns in the first 400 versions of BEAR-B triple patterns in the first 400 versions of BEAR-B hourly. hourly.

Figure 46: Average VM query durations for ?PO Figure 48: Average VQ query durations for ?PO triple patterns in the first 400 versions of BEAR-B triple patterns in the first 400 versions of BEAR-B hourly. hourly.