Compressing Provenance Graphs Abstract

Total Page:16

File Type:pdf, Size:1020Kb

Compressing Provenance Graphs Abstract Compressing Provenance Graphs Yulai Xieyz, Kiran-Kumar Muniswamy-Reddyx, Darrell D. E. Longz Ahmed Amer{, Dan Fengy, Zhipeng Tany yHuazhong University of Wuhan National Laboratory zUniversity of California, xHarvard {Santa Clara Science and Technology for Optoelectronics Santa Cruz University University Abstract two new techniques to improve the provenance graph The provenance community has built a number of sys- compression. tems to collect provenance, most of which assume that We test our hypothesis by compressing the provenance provenance will be retained indefinitely. However, it is graphs generated by the PASS [3] system. Our results not cost-effective to retain provenance information inef- show that our web compression algorithms can compress ficiently. Since provenance can be viewed as a graph, we such graphs, and that our improved algorithms can fur- note the similarities to web graphs and draw upon tech- ther raise the compression ratio up to 3.31:1. niques from the web compression domain to provide our own novel and improved graph compression solutions for provenance graphs. Our preliminary results show that 2 Web & Provenance Graphs adapting web compression techniques results in a com- pression ratio of 2.12:1 to 2.71:1, which we can improve A web graph is a directed graph where each URL is rep- upon to reach ratios of up to 3.31:1. resented as a node, and the link from one page to the other page is a directed edge. There are some key prop- erties exploited by current web graph compression algo- 1 Introduction rithms: ² Locality: Few links would go across URL domain, Provenance, though extremely valuable, can take up sub- and therefore the vast majority tend to point to pages stantial storage space. For instance, in the PReServ [9] nearby. provenance store, the original data was 100 KB, while the provenance reached 1 MB. For MiMI [10], an on- ² Similarity: Pages that are not far from each other line protein database, the provenance expands to 6 GB have common neighbors with high probability. while the base data is only 270 MB. Similar results are observed in other systems [3, 11]. This makes prove- ² Consecutiveness: The node numbers of successors nance data an increasing overhead on the storage subsys- of a page are in sequential order. tem. A provenance dataset can be represented as a prove- Provenance graphs also have a similar organizational nance graph [12]. Thus, efficient representation of prove- structure and characteristics as web graphs. Figure 1 nance graphs can fundamentally speed up provenance shows the conversion from a snapshot of a NetBSD queries. While, inefficient multi-GB provenance graphs provenance trace (generated in the PASS system [3]) to will not fit in limited memory and dramatically reduce an adjacency list that represents the provenance graph. query efficiency. The notation “A INPUT[ANC] B” in the provenance We propose to adapt web compression algorithms to trace means that B is an ancestor of A, indicating that compress provenance graphs. Our motivation comes there exists a directed edge from A pointing to B. In this from our observation that provenance graphs and web way, a provenance graph is also a directed graph and each graphs have similar structure and some common essen- node (e.g., node 2 or 3) has a series of out-neighbors. tial characteristics, i.e., similarity, locality and consec- Provenance nodes 2 and 3 are similar, as they have utiveness. We have further discovered that provenance common successors in the form of nodes 4, 7, 9 and 14. graphs have their own special features, and we propose The reason for this is that many header files or library 1 …… 2.0 NAME / bin/cp 2.1 INPUT [ANC] 4.0 2.1 INPUT [ANC] 5.0 8 10 2.1 INPUT [ANC] 6.0 2.1 INPUT [ANC] 7.0 11 2.1 INPUT [ANC] 8.0 6 2.1 INPUT [ANC] 9.0 12 2.1 INPUT [ANC] 10.0 2.1 INPUT [ANC] 11.0 2 2.1 INPUT [ANC] 12.0 Node Out-degree Successors 5 13 2.1 INPUT [ANC] 13.0 2.1 INPUT [ANC] 14.0 2.1 INPUT [ANC] 15.0 …… …… …… 15 2 12 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 … 4 7 3.0 NAME /disk/scripts/bulkbuild 3 9 4, 7, 9, 14, 17, 18, 19, 20, 21 3.1 INPUT [ANC] 4.0 9 …… …… …… . 14 3.1 INPUT [ANC] 7.0 3.1 INPUT [ANC] 9.0 3.1 INPUT [ANC] 14.0 3.1 INPUT [ANC] 17.0 17 3 21 3.1 INPUT [ANC] 18.0 3.1 INPUT [ANC] 19.0 3.1 INPUT [ANC] 20.0 18 20 3.1 INPUT [ANC] 21.0 19 ... Provenance trace Provenance graph Adjacency list Figure 1: Mapping from the provenance trace to adjacency list that represents the provenance graph. The expression “2.1 INPUT[ANC] 4.0” indicates that node 4 is an ancestor of node 2, resulting in a directed edge from node 2 pointing to node 4. This figure also shows that provenance graph exhibits the similar characteristics (i.e., similarity, locality and consecutiveness) as web graph. files that are represented as nodes like 4, 7, 9 and 14 are can only be applied to a flat data model, i.e., where each repeatedly used as input by many processes (e.g., nodes 2 node has complete provenance, and therefore cannot be and 3). Nodes 2 and 3 also exhibit locality. The succes- used to compress provenance generated by a provenance sors of provenance node 2 are only between 4 and 15, system such as PASS [3]. Our methods are more general and the successors of node 3 are only between 4 and 21. and are hence applicable to wider range of systems. This is because many header files or library files (e.g., There has been considerable work [5, 6, 7, 8] in the the successors of nodes 2 and 3) that are used as input domain of web graph compression. Adler et al. [5] pro- by a process are probably in the same directory, so the posed to utilize reference compression to compress web ordering (and therefore assigned numbers) of these files graphs. Randall et al. [6] suggested a set of technolo- are usually close to each other. Consecutiveness is also gies such as delta codes and variable-length bit encoding clear in the successors of nodes 2 and 3, from 4 to 15 to compress a database providing fast access to web hy- and from 17 to 21 respectively. The existence of consec- perlinks. The critical web graph compression framework utiveness is because a process or file may be generated was presented by Boldi and Vigna [7]. They obtained by many files that do not belong to a PASS volume, and good compression performance by fully exploiting the such files are usually numbered sequentially. locality and similarity of web pages. The algorithm we use is based on this framework. 3 Related Work There are also classical techniques like LZ-based com- pression algorithms [2]. These techniques present an up- per bound on the compression that is possible. However, Barga et al. [1] presented a layered provenance model since they do not preserve the structure of the data, the re- for workflow systems that can efficiently reduce prove- sulting compressed graph will not be amenable to query- nance size by minimizing the repeated operation infor- ing. mation during provenance generation. This is similar to our approach that exploits similarity between the neigh- bors of different nodes representing the same manipula- 4 Compression Algorithms tion. Chapman et al. [4] proposed two classes of algo- rithms to compress provenance graphs: provenance fac- Three critical ideas lie behind web compression algo- torization and provenance inheritance. The former aims rithms [7]: First, encoding the successor list of one node to find common subtrees between different nodes and the by using the similar successors of another node as a ref- latter focuses on find similarities between data items that erence, thus efficiently avoiding encoding the duplicate have ancestry relationships or belong to a particular type. data; Second, encoding consecutive numbers by only These algorithms achieve good compression ratios, but recording the start number and length, reducing the num- 2 Node Out-degree Successors Node Name Successors 15 5 3, 11, 13, 14, 17 16 7 11, 14, 19, 20, 21, 31, 33 … … … Reference compression 15 /bin/cp 19, 21, 32 Node Out-degree Bit list Extra nodes W=3 16 /bin/bash 4, 9, 13, 17 15 5 3, 11, 13, 14, 17 17 /bin/su 19, 20, 23 16 7 01010 19, 20, 21, 31, 33 18 /usr/bin 3, 11, 13, 14, 17 Find consecutive numbers 19 /bin/hostname 3, 10, 13, 17 Node Out-degree Bit list Left extreme Length Residuals 20 /sbin/consoletype 4, 8, 9, 11 15 5 13 2 3, 11, 17 W=3 21 /meminfo 4, 8, 11, 15 16 7 01010 19 3 31, 33 22 /usr/bin/id 5, 7, 11, 12 Encode gaps 23 /usr/bin 3, 11, 13, 14, 18 24 /bin/sed 4, 6 Node Out-degree Bit list Left extreme Length Residuals 25 /usr/bin 3, 11, 13, 14, 19 15 5 13 2 -12, 8, 6 16 7 01010 19 3 15, 2 ... ... ... Figure 2: An example on web compression algorithm Figure 3: Name-identified reference list ber of successors to be encoded; and Third, encoding the Our Improved Approach gap between the successors of a node rather than the suc- We now describe two improvements beyond existing cessors themselves, which typically requires fewer bits web compression algorithms.
Recommended publications
  • A Combinatorial Game Theoretic Analysis of Chess Endgames
    A COMBINATORIAL GAME THEORETIC ANALYSIS OF CHESS ENDGAMES QINGYUN WU, FRANK YU,¨ MICHAEL LANDRY 1. Abstract In this paper, we attempt to analyze Chess endgames using combinatorial game theory. This is a challenge, because much of combinatorial game theory applies only to games under normal play, in which players move according to a set of rules that define the game, and the last player to move wins. A game of Chess ends either in a draw (as in the game above) or when one of the players achieves checkmate. As such, the game of chess does not immediately lend itself to this type of analysis. However, we found that when we redefined certain aspects of chess, there were useful applications of the theory. (Note: We assume the reader has a knowledge of the rules of Chess prior to reading. Also, we will associate Left with white and Right with black). We first look at positions of chess involving only pawns and no kings. We treat these as combinatorial games under normal play, but with the modification that creating a passed pawn is also a win; the assumption is that promoting a pawn will ultimately lead to checkmate. Just using pawns, we have found chess positions that are equal to the games 0, 1, 2, ?, ", #, and Tiny 1. Next, we bring kings onto the chessboard and construct positions that act as game sums of the numbers and infinitesimals we found. The point is that these carefully constructed positions are games of chess played according to the rules of chess that act like sums of combinatorial games under normal play.
    [Show full text]
  • Provenance Data in Social Media
    Series ISSN: 2151-0067 • GUNDECHA •LIUBARBIER • FENG DATA INSOCIAL MEDIA PROVENANCE SYNTHESIS LECTURES ON M Morgan & Claypool Publishers DATA MINING AND KNOWLEDGE DISCOVERY &C Series Editors: Jiawei Han, University of Illinois at Urbana-Champaign, Lise Getoor, University of Maryland, Wei Wang, University of North Carolina, Chapel Hill, Johannes Gehrke, Cornell University, Robert Grossman, University of Illinois at Chicago Provenance Data Provenance Data in Social Media Geoffrey Barbier, Air Force Research Laboratory Zhuo Feng, Arizona State University in Social Media Pritam Gundecha, Arizona State University Huan Liu, Arizona State University Social media shatters the barrier to communicate anytime anywhere for people of all walks of life. The publicly available, virtually free information in social media poses a new challenge to consumers who have to discern whether a piece of information published in social media is reliable. For example, it can be difficult to understand the motivations behind a statement passed from one user to another, without knowing the person who originated the message. Additionally, false information can be propagated through social media, resulting in embarrassment or irreversible damages. Provenance data associated with a social media statement can help dispel rumors, clarify opinions, and confirm facts. However, Geoffrey Barbier provenance data about social media statements is not readily available to users today. Currently, providing this data to users requires changing the social media infrastructure or offering subscription services. Zhuo Feng Taking advantage of social media features, research in this nascent field spearheads the search for a way to provide provenance data to social media users, thus leveraging social media itself by mining it for the Pritam Gundecha provenance data.
    [Show full text]
  • LNCS 7215, Pp
    ACoreCalculusforProvenance Umut A. Acar1,AmalAhmed2,JamesCheney3,andRolyPerera1 1 Max Planck Institute for Software Systems umut,rolyp @mpi-sws.org { 2 Indiana} University [email protected] 3 University of Edinburgh [email protected] Abstract. Provenance is an increasing concern due to the revolution in sharing and processing scientific data on the Web and in other computer systems. It is proposed that many computer systems will need to become provenance-aware in order to provide satisfactory accountability, reproducibility,andtrustforscien- tific or other high-value data. To date, there is not a consensus concerning ap- propriate formal models or security properties for provenance. In previous work, we introduced a formal framework for provenance security and proposed formal definitions of properties called disclosure and obfuscation. This paper develops a core calculus for provenance in programming languages. Whereas previous models of provenance have focused on special-purpose languages such as workflows and database queries, we consider a higher-order, functional language with sums, products, and recursive types and functions. We explore the ramifications of using traces based on operational derivations for the purpose of comparing other forms of provenance. We design a rich class of prove- nance views over traces. Finally, we prove relationships among provenance views and develop some solutions to the disclosure and obfuscation problems. 1Introduction Provenance, or meta-information about the origin, history, or derivation of an object, is now recognized as a central challenge in establishing trust and providing security in computer systems, particularly on the Web. Essentially, provenance management in- volves instrumenting a system with detailed monitoring or logging of auditable records that help explain how results depend on inputs or other (sometimes untrustworthy) sources.
    [Show full text]
  • Sabermetrics: the Past, the Present, and the Future
    Sabermetrics: The Past, the Present, and the Future Jim Albert February 12, 2010 Abstract This article provides an overview of sabermetrics, the science of learn- ing about baseball through objective evidence. Statistics and baseball have always had a strong kinship, as many famous players are known by their famous statistical accomplishments such as Joe Dimaggio’s 56-game hitting streak and Ted Williams’ .406 batting average in the 1941 baseball season. We give an overview of how one measures performance in batting, pitching, and fielding. In baseball, the traditional measures are batting av- erage, slugging percentage, and on-base percentage, but modern measures such as OPS (on-base percentage plus slugging percentage) are better in predicting the number of runs a team will score in a game. Pitching is a harder aspect of performance to measure, since traditional measures such as winning percentage and earned run average are confounded by the abilities of the pitcher teammates. Modern measures of pitching such as DIPS (defense independent pitching statistics) are helpful in isolating the contributions of a pitcher that do not involve his teammates. It is also challenging to measure the quality of a player’s fielding ability, since the standard measure of fielding, the fielding percentage, is not helpful in understanding the range of a player in moving towards a batted ball. New measures of fielding have been developed that are useful in measuring a player’s fielding range. Major League Baseball is measuring the game in new ways, and sabermetrics is using this new data to find better mea- sures of player performance.
    [Show full text]
  • FATF Guidance Politically Exposed Persons (Recommendations 12 and 22)
    FATF GUIDANCE POLITICALLY EXPOSED PERSONS (recommendations 12 and 22) June 2013 FINANCIAL ACTION TASK FORCE The Financial Action Task Force (FATF) is an independent inter-governmental body that develops and promotes policies to protect the global financial system against money laundering, terrorist financing and the financing of proliferation of weapons of mass destruction. The FATF Recommendations are recognised as the global anti-money laundering (AML) and counter-terrorist financing (CFT) standard. For more information about the FATF, please visit the website: www.fatf-gafi.org © 2013 FATF/OECD. All rights reserved. No reproduction or translation of this publication may be made without prior written permission. Applications for such permission, for all or part of this publication, should be made to the FATF Secretariat, 2 rue André Pascal 75775 Paris Cedex 16, France (fax: +33 1 44 30 61 37 or e-mail: [email protected]). FATF GUIDANCE POLITICALLY EXPOSED PERSONS (RECOMMENDATIONS 12 AND 22) CONTENTS ACRONYMS ..................................................................................................................................................... 2 I. INTRODUCTION .................................................................................................................................... 3 II. DEFINITIONS ......................................................................................................................................... 4 III. THE RELATIONSHIP BETWEEN RECOMMENDATIONS 10 (CUSTOMER DUE DILIGENCE)
    [Show full text]
  • Guides to German Records Microfilmed at Alexandria, Va
    GUIDES TO GERMAN RECORDS MICROFILMED AT ALEXANDRIA, VA. No. 32. Records of the Reich Leader of the SS and Chief of the German Police (Part I) The National Archives National Archives and Records Service General Services Administration Washington: 1961 This finding aid has been prepared by the National Archives as part of its program of facilitating the use of records in its custody. The microfilm described in this guide may be consulted at the National Archives, where it is identified as RG 242, Microfilm Publication T175. To order microfilm, write to the Publications Sales Branch (NEPS), National Archives and Records Service (GSA), Washington, DC 20408. Some of the papers reproduced on the microfilm referred to in this and other guides of the same series may have been of private origin. The fact of their seizure is not believed to divest their original owners of any literary property rights in them. Anyone, therefore, who publishes them in whole or in part without permission of their authors may be held liable for infringement of such literary property rights. Library of Congress Catalog Card No. 58-9982 AMERICA! HISTORICAL ASSOCIATION COMMITTEE fOR THE STUDY OP WAR DOCUMENTS GUIDES TO GERMAN RECOBDS MICROFILMED AT ALEXAM)RIA, VA. No* 32» Records of the Reich Leader of the SS aad Chief of the German Police (HeiehsMhrer SS und Chef der Deutschen Polizei) 1) THE AMERICAN HISTORICAL ASSOCIATION (AHA) COMMITTEE FOR THE STUDY OF WAE DOCUMENTS GUIDES TO GERMAN RECORDS MICROFILMED AT ALEXANDRIA, VA* This is part of a series of Guides prepared
    [Show full text]
  • Call Numbers
    Call numbers: It is our recommendation that libraries NOT put J, +, E, Ref, etc. in the call number field in front of the Dewey or other call number. Use the Home Location field to indicate the collection for the item. It is difficult if not impossible to sort lists if the call number fields aren’t entered systematically. Dewey Call Numbers for Non-Fiction Each library follows its own practice for how long a Dewey number they use and what letters are used for the author’s name. Some libraries use a number (Cutter number from a table) after the letters for the author’s name. Other just use letters for the author’s name. Call Numbers for Fiction For fiction, the call number is usually the author’s Last Name, First Name. (Use a comma between last and first name.) It is usually already in the call number field when you barcode. Call Numbers for Paperbacks Each library follows its own practice. Just be consistent for easier handling of the weeding lists. WCTS libraries should follow the format used by WCTS for the items processed by them for your library. Most call numbers are either the author’s name or just the first letters of the author’s last name. Please DO catalog your paperbacks so they can be shared with other libraries. Call Numbers for Magazines To get the call numbers to display in the correct order by date, the call number needs to begin with the date of the issue in a number format, followed by the issue in alphanumeric format.
    [Show full text]
  • Minorities and Women in Television See Little Change, While Minorities Fare Worse in Radio
    RunningRunning inin PlacePlace Minorities and women in television see little change, while minorities fare worse in radio. By Bob Papper he latest figures from the RTNDA/Ball State Uni- Minority Population vs. Minority Broadcast Workforce versity Annual Survey show lit- 2005 2004 2000 1995 1990 tle change for minorities in tel- evision news in the past year but Minority Population in U.S. 33.2% 32.8% 30.9% 27.9% 25.9% slippage in radio news. Minority TV Workforce 21.2 21.8 21.0 17.1 17.8 TIn television, the overall minority Minority Radio Workforce 7.9 11.8 10.0 14.7 10.8 workforce remained largely un- Source for U.S. numbers: U.S. Census Bureau changed at 21.2 percent, compared with last year’s 21.8 percent. At non-Hispanic stations, the the stringent Equal Employment minority workforce in TV news is up minority workforce also remained Opportunity rules were eliminated in 3.4 percent. At the same time, the largely steady at 19.5 percent, com- 1998. minority population in the U.S. has pared with 19.8 percent a year ago. News director numbers were increased 7.3 percent. Overall, the After a jump in last year’s minority mixed, with the percentage of minor- minority workforce in TV has been at radio numbers, the percentage fell this ity TV news directors down slightly to 20 percent—plus or minus 3 per- year.The minority radio news work- 12 percent (from 12.5 percent last cent—for every year in the past 15.
    [Show full text]
  • FDA DSCSA: Blockchain Interoperability Pilot Project Report
    FDA DSCSA Blockchain Interoperability Pilot Project Report February, 2020 0 FDA DSCSA Blockchain Interoperability Pilot Report Table of Contents Content Page Number Executive Summary 2 Overview 4 Blockchain Benefits 5 Solution Overview 6 Results and Discussion 7 Test Results 7 Blockchain Evaluation and Governance 8 Value Beyond Compliance 9 Future Considerations and Enhancements 9 Appendix 10 Acknowledgements 11 Definitions 12 Serialization and DSCSA Background 13 DSCSA Industry Challenges 14 Blockchain Benefits Expanded 15 Solution Approach and Design 16 Functional Requirements 17 User Interactions 22 Solution Architecture 26 Solution Architecture Details 27 Solution Testing 31 List of Tables and Figures 33 Disclaimers and Copyrights 34 1 FDA DSCSA Blockchain Interoperability Pilot Report Executive Summary (Page 1 of 2) With almost half of the United States population taking prescription medications for various ailments and medical conditions1; an increase in the number of aging Americans projected to nearly double from 52 million in 2018 to 95 million by 20602; and adoption of new federal laws3, there is a significant opportunity and need to enhance transparency and trust in an increasingly complex pharmaceutical supply chain4. With these factors as a backdrop, the Drug Supply Chain Security Act (DSCSA) was signed into law in 2013, with the intention to allow the pharmaceutical trading partners to collaborate on improving patient safety. The law outlines critical steps to build an electronic, interoperable system by November 27, 2023, where members of the pharmaceutical supply chain are required to verify, track and trace prescription drugs as they are distributed in the United States. Various organizations and government entities collaborate to ensure medications are safe, efficacious and are produced using the highest quality ingredients.
    [Show full text]
  • Precise Radioactivity Measurements a Controversy
    http://cyclotron.tamu.edu Precise Radioactivity Measurements: A Controversy Settled Simultaneous measurements of x-rays and gamma rays emitted in radioactive nuclear decays probe how frequently excited nuclei release energy by ejecting an atomic electron THE SCIENCE Radioactivity takes several different forms. One is when atomic nuclei in an excited state release their excess energy (i.e., they decay) by emitting a type of electromagnetic radiation, called gamma rays. Sometimes, rather than a gamma ray being emitted, the nucleus conveys its energy electromagnetically to an atomic electron, causing it to be ejected at high speed instead. The latter process is called internal conversion and, for a particular radioactive decay, the probability of electron emission relative to that of gamma-ray emission is called the internal conversion coefficient (ICC). Theoretically, for almost all transitions, the ICC is expected to depend on the quantum numbers of the participating nuclear excited states and on the amount of energy released, but not on the sometimes obscure details of how the nucleus is structured. For the first time, scientists have been able to test the theory to percent precision, verifying its independence from nuclear structure. At the same time, the new results have guided theorists to make improvements in their ICC calculations. THE IMPACT For the decays of most known excited states in nuclei, only the gamma rays have been observed, so scientists have had to combine measured gamma-ray intensities with calculated ICCs in order to put together a complete picture known as a decay scheme, which includes the contribution of electron as well as gamma-ray emission, for every radioactive decay.
    [Show full text]
  • Multiplication As Comparison Problems
    Multiplication as Comparison Problems Draw a model and write an equation for the following: a) 3 times as many as 9 is 27 b) 40 is 4 times as many as 10 c) 21 is 3 times as many as 7 A Write equations for the following: a) three times as many as four is twelve b) twice as many as nine is eighteen c) thirty-two is four times as many as eight B Write a comparison statement for each equation: a) 3 x 7 = 21 b) 8 x 3 = 24 c) 5 x 4 = 20 ___ times as many as ___ is ___. C Write a comparison statement for each equation a) 45 = 9 x 5 b) 24 = 6 x 4 c) 18 = 2 x 9 ___ is ___ times as many as ___. ©K-5MathTeachingResources.com D Draw a model and write an equation for the following: a) 18 is 3 times as many as 6 b) 20 is 5 times as many as 4 c) 80 is 4 times as many as 20 E Write equations for the following: a) five times as many as seven is thirty-five b) twice as many as twelve is twenty-four c) four times as many as nine is thirty-six F Write a comparison statement for each equation: a) 6 x 8 = 48 b) 9 x 6 = 54 c) 8 x 7 = 56 ___ times as many as ___ is ___. G Write a comparison statement for each equation: a) 72 = 9 x 8 b) 81 = 9 x 9 c) 36 = 4 x 9 ___ is ___ times as many as ___.
    [Show full text]
  • Fine-Grained Provenance for Linear Algebra Operators
    Fine-grained provenance for linear algebra operators Zhepeng Yan Val Tannen Zachary G. Ives University of Pennsylvania fzhepeng, val, [email protected] Abstract A u v Provenance is well-understood for relational query operators. In- x B C I 0 creasingly, however, data analytics is incorporating operations ex- pressed through linear algebra: machine learning operations, net- work centrality measures, and so on. In this paper, we study prove- y D E 0 I nance information for matrix data and linear algebra operations. Sx Sy Our core technique builds upon provenance for aggregate queries and constructs a K−semialgebra. This approach tracks prove- I 0 Tu nance by annotating matrix data and propagating these annotations I = iden'ty matrix through linear algebra operations. We investigate applications in 0 = zero matrix matrix inversion and graph analysis. 0 I Tv 1. Introduction For many years, data provenance was modeled purely as graphs Figure 1. Provenance partitioning and its selectors capturing dataflows among “black box” operations: this repre- sentation is highly flexible and general, and has ultimately led erations such as addition, multiplication and inversion. We show to the PROV-DM standard. However, the database community building blocks towards applications in principal component anal- has shown that fine-grained provenance with “white-box” oper- ysis (PCA), frequently used for dimensionality reduction, and in ations [3] (specifically, for relational algebra operations) enables computing PageRank-style scores over network graphs. richer reasoning about the relationship between derived results and data provenance. The most general model for relational algebra queries, provenance semirings [7, 1], ensures that equivalent re- 2.
    [Show full text]