Compressing Provenance Graphs Abstract

Compressing Provenance Graphs Yulai Xieyz, Kiran-Kumar Muniswamy-Reddyx, Darrell D. E. Longz Ahmed Amer{, Dan Fengy, Zhipeng Tany yHuazhong University of Wuhan National Laboratory zUniversity of California, xHarvard {Santa Clara Science and Technology for Optoelectronics Santa Cruz University University Abstract two new techniques to improve the provenance graph The provenance community has built a number of sys- compression. tems to collect provenance, most of which assume that We test our hypothesis by compressing the provenance provenance will be retained indefinitely. However, it is graphs generated by the PASS [3] system. Our results not cost-effective to retain provenance information inef- show that our web compression algorithms can compress ficiently. Since provenance can be viewed as a graph, we such graphs, and that our improved algorithms can fur- note the similarities to web graphs and draw upon tech- ther raise the compression ratio up to 3.31:1. niques from the web compression domain to provide our own novel and improved graph compression solutions for provenance graphs. Our preliminary results show that 2 Web & Provenance Graphs adapting web compression techniques results in a compression ratio of 2.12:1 to 2.71:1, which we can improve A web graph is a directed graph where each URL is rep- upon to reach ratios of up to 3.31:1. resented as a node, and the link from one page to the other page is a directed edge. There are some key prop- erties exploited by current web graph compression algo- 1 Introduction rithms: ² Locality: Few links would go across URL domain, Provenance, though extremely valuable, can take up sub- and therefore the vast majority tend to point to pages stantial storage space. For instance, in the PReServ [9] nearby. provenance store, the original data was 100 KB, while the provenance reached 1 MB. For MiMI [10], an on- ² Similarity: Pages that are not far from each other line protein database, the provenance expands to 6 GB have common neighbors with high probability. while the base data is only 270 MB. Similar results are observed in other systems [3, 11]. This makes prove- ² Consecutiveness: The node numbers of successors nance data an increasing overhead on the storage subsys- of a page are in sequential order. tem. A provenance dataset can be represented as a prove- Provenance graphs also have a similar organizational nance graph [12]. Thus, efficient representation of prove- structure and characteristics as web graphs. Figure 1 nance graphs can fundamentally speed up provenance shows the conversion from a snapshot of a NetBSD queries. While, inefficient multi-GB provenance graphs provenance trace (generated in the PASS system [3]) to will not fit in limited memory and dramatically reduce an adjacency list that represents the provenance graph. query efficiency. The notation “A INPUT[ANC] B” in the provenance We propose to adapt web compression algorithms to trace means that B is an ancestor of A, indicating that compress provenance graphs. Our motivation comes there exists a directed edge from A pointing to B. In this from our observation that provenance graphs and web way, a provenance graph is also a directed graph and each graphs have similar structure and some common essen- node (e.g., node 2 or 3) has a series of out-neighbors. tial characteristics, i.e., similarity, locality and consec- Provenance nodes 2 and 3 are similar, as they have utiveness. We have further discovered that provenance common successors in the form of nodes 4, 7, 9 and 14. graphs have their own special features, and we propose The reason for this is that many header files or library 1 …… 2.0 NAME / bin/cp 2.1 INPUT [ANC] 4.0 2.1 INPUT [ANC] 5.0 8 10 2.1 INPUT [ANC] 6.0 2.1 INPUT [ANC] 7.0 11 2.1 INPUT [ANC] 8.0 6 2.1 INPUT [ANC] 9.0 12 2.1 INPUT [ANC] 10.0 2.1 INPUT [ANC] 11.0 2 2.1 INPUT [ANC] 12.0 Node Out-degree Successors 5 13 2.1 INPUT [ANC] 13.0 2.1 INPUT [ANC] 14.0 2.1 INPUT [ANC] 15.0 …… …… …… 15 2 12 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 … 4 7 3.0 NAME /disk/scripts/bulkbuild 3 9 4, 7, 9, 14, 17, 18, 19, 20, 21 3.1 INPUT [ANC] 4.0 9 …… …… …… . 14 3.1 INPUT [ANC] 7.0 3.1 INPUT [ANC] 9.0 3.1 INPUT [ANC] 14.0 3.1 INPUT [ANC] 17.0 17 3 21 3.1 INPUT [ANC] 18.0 3.1 INPUT [ANC] 19.0 3.1 INPUT [ANC] 20.0 18 20 3.1 INPUT [ANC] 21.0 19 ... Provenance trace Provenance graph Adjacency list Figure 1: Mapping from the provenance trace to adjacency list that represents the provenance graph. The expression “2.1 INPUT[ANC] 4.0” indicates that node 4 is an ancestor of node 2, resulting in a directed edge from node 2 pointing to node 4. This figure also shows that provenance graph exhibits the similar characteristics (i.e., similarity, locality and consecutiveness) as web graph. files that are represented as nodes like 4, 7, 9 and 14 are can only be applied to a flat data model, i.e., where each repeatedly used as input by many processes (e.g., nodes 2 node has complete provenance, and therefore cannot be and 3). Nodes 2 and 3 also exhibit locality. The succes- used to compress provenance generated by a provenance sors of provenance node 2 are only between 4 and 15, system such as PASS [3]. Our methods are more general and the successors of node 3 are only between 4 and 21. and are hence applicable to wider range of systems. This is because many header files or library files (e.g., There has been considerable work [5, 6, 7, 8] in the the successors of nodes 2 and 3) that are used as input domain of web graph compression. Adler et al. [5] pro- by a process are probably in the same directory, so the posed to utilize reference compression to compress web ordering (and therefore assigned numbers) of these files graphs. Randall et al. [6] suggested a set of technolo- are usually close to each other. Consecutiveness is also gies such as delta codes and variable-length bit encoding clear in the successors of nodes 2 and 3, from 4 to 15 to compress a database providing fast access to web hy- and from 17 to 21 respectively. The existence of consec- perlinks. The critical web graph compression framework utiveness is because a process or file may be generated was presented by Boldi and Vigna [7]. They obtained by many files that do not belong to a PASS volume, and good compression performance by fully exploiting the such files are usually numbered sequentially. locality and similarity of web pages. The algorithm we use is based on this framework. 3 Related Work There are also classical techniques like LZ-based compression algorithms [2]. These techniques present an up- per bound on the compression that is possible. However, Barga et al. [1] presented a layered provenance model since they do not preserve the structure of the data, the re- for workflow systems that can efficiently reduce prove- sulting compressed graph will not be amenable to query- nance size by minimizing the repeated operation infor- ing. mation during provenance generation. This is similar to our approach that exploits similarity between the neighbors of different nodes representing the same manipula- 4 Compression Algorithms tion. Chapman et al. [4] proposed two classes of algorithms to compress provenance graphs: provenance fac- Three critical ideas lie behind web compression algo- torization and provenance inheritance. The former aims rithms [7]: First, encoding the successor list of one node to find common subtrees between different nodes and the by using the similar successors of another node as a ref- latter focuses on find similarities between data items that erence, thus efficiently avoiding encoding the duplicate have ancestry relationships or belong to a particular type. data; Second, encoding consecutive numbers by only These algorithms achieve good compression ratios, but recording the start number and length, reducing the num- 2 Node Out-degree Successors Node Name Successors 15 5 3, 11, 13, 14, 17 16 7 11, 14, 19, 20, 21, 31, 33 … … … Reference compression 15 /bin/cp 19, 21, 32 Node Out-degree Bit list Extra nodes W=3 16 /bin/bash 4, 9, 13, 17 15 5 3, 11, 13, 14, 17 17 /bin/su 19, 20, 23 16 7 01010 19, 20, 21, 31, 33 18 /usr/bin 3, 11, 13, 14, 17 Find consecutive numbers 19 /bin/hostname 3, 10, 13, 17 Node Out-degree Bit list Left extreme Length Residuals 20 /sbin/consoletype 4, 8, 9, 11 15 5 13 2 3, 11, 17 W=3 21 /meminfo 4, 8, 11, 15 16 7 01010 19 3 31, 33 22 /usr/bin/id 5, 7, 11, 12 Encode gaps 23 /usr/bin 3, 11, 13, 14, 18 24 /bin/sed 4, 6 Node Out-degree Bit list Left extreme Length Residuals 25 /usr/bin 3, 11, 13, 14, 19 15 5 13 2 -12, 8, 6 16 7 01010 19 3 15, 2 ... ... ... Figure 2: An example on web compression algorithm Figure 3: Name-identified reference list ber of successors to be encoded; and Third, encoding the Our Improved Approach gap between the successors of a node rather than the suc- We now describe two improvements beyond existing cessors themselves, which typically requires fewer bits web compression algorithms.

Load more