A Graph Model of Data and Workflow Provenance

A graph model of data and workflow provenance Umut Acar Peter Buneman Max-Planck Institute for Software Systems University of Edinburgh James Cheney Jan Van den Bussche Natalia Kwasnikowska University of Edinburgh Hasselt University Hasselt University Stijn Vansummeren Universite´ Libre de Bruxelles Abstract that these models share a great deal of structure once they are defined in a common language and data model [8, 6]. Provenance has been studied extensively in both database Provenance models have also been developed for a va- and workflow management systems, so far with little riety of workflow systems, such as Chimera [10], Tav- convergence of definitions or models. Provenance in erna [20], Kepler [1], Karma [24], and ZOOM [26]; databases has generally been defined for relational or also, many other systems such as PASS and PASOA em- complex object data, by propagating fine-grained an- ploy similar ideas [14, 23]. These systems model and notations or algebraic expressions from the input to record provenance as a directed acyclic graph that, in- the output. This kind of provenance has been found formally, describes the macroscopic computation steps useful in other areas of computer science: annotation (e.g., whole program executions) performed in construct- databases, probabilistic databases, schema and data in- ing intermediate and final results. Recently, the Open tegration, etc. In contrast, workflow provenance aims to Provenance Model (OPM) [22] has been developed as a capture a complete description of evaluation – or enact- consensus exchange format for representing provenance ment – of a workflow, and this is crucial to verification graphs. in scientific computation. Workflows and their prove- Workflow systems employ a much wider variety of nance are often presented using graphical notation, mak- programming constructs than databases, including con- ing them easy to visualize but complicating the formal currency, procedures, service calls, and queries to exter- semantics that relates their run-time behavior with their nal databases. However, these systems are seldom ac- provenance records. We bridge this gap by extending a companied by formal specifications of their intended se- previously-developed dataflow language which supports mantics, with or without provenance. As a result, it can both database-style querying and workflow-style batch be hard to understand provenance information produced processing steps to produce a workflow-style provenance by a workflow system, since the meaning intended by the graph that can be explicitly queried. We define and implementer may not match the expectations of the user. describe the model through examples, present queries This is a particularly vexing problem because some users that extract other forms of provenance, and give an exe- and implementers might not even be aware of the possi- cutable definition of the graph semantics of dataflow ex- bility of misinterpretation, leading to further confusion. pressions. The scarcity of clear specifications of the semantics and provenance behavior of workflow systems makes it 1 Introduction difficult to integrate database and workflow provenance or compare provenance graphs generated by different A number of standard database provenance models systems. Therefore, we believe that it is essential to study tailored to relational, complex-object or XML query the semantics of workflow provenance models and relate languages have emerged. These models include lin- them to existing models of database provenance. eage [9], where-provenance [3, 2], why-provenance [3, To compare and unify these different techniques, we 4], and more recent innovations such as dependency- need to define a common provenance model. Database provenance [7], how-provenance [13, 11], and prove- provenance models can be visualized as graphs. Where- nance traces [6]. These models have been presented in provenance, lineage, and dependency-provenance can be a number of different ways and founded on several dif- visualized as bipartite graphs linking parts of the out- ferent motivations. Recently, further study has revealed put with parts of the input. How-provenance and why- provenance are more complex, but can also be visualized such as images or data files. Functions include primitive as directed acyclic graphs linking parts of the output to operations on basic data types, such as integer addition parts of the input, where nodes are labeled with symbolic and equality. Furthermore, functions can also represent algebraic operations. Graphs provide a natural common large computational steps such as external program or formalism for workflow and database provenance. service calls: for example, to model the first Provenance We also need a common language that can express Challenge workflow we might use base types such as both database queries and workflows. In this paper, we Image, Header, or WarpFile and function symbols such use a core calculus for dataflows, called DFL, based on as align warp : Image × Header × Image × Header ! the Nested Relational Calculus (NRC). DFL has been WarpFile or reslice : WarpFile ! Image × Header to previously introduced by Hidders et al. [16, 17] and we represent the macroscopic computation steps. also build upon some prior work on provenance in this The remaining syntactic constructs above are standard setting [19]. We develop a graphical model of prove- components of the Nested Relational Calculus: we in- nance for both database queries and simple workflows clude record and field projection operations, booleans in a uniform way. This should provide a foundation for and conditionals, and set operations. We employ the syn- studying more complex workflow language features such tax fe0(x) j x 2 eg for the “for-loop” or set comprehen- as nondeterminism, concurrency and while-loops. sion operation which evaluates e to a set fv1; : : : ; vng 0 0 The structure of the rest of this paper is as follows. In and returns the set of values fe (v1); : : : ; e (vn)g ob- 0 Section 2 we review the dataflow calculus DFL. In Sec- tained by evaluating e with x bound to each vi. The tion 3 we describe the structure of provenance graphs and expression S e flattens a nested collection. The expres- give examples showing how typical dataflows are trans- sion empty?(e) tests whether collection e is empty. lated to graphs. In Section 4 we describe the provenance We will use ordered-pair syntax (e1; e2) to abbreviate semantics of DFL programs, and give an executable, yet hfst : e1; snd : e2i, and write fst(e) or snd(e) instead of still high-level implementation in Haskell. Section 5 dis- πfst(e) or πsnd(e), respectively, for the first and second cusses how to express queries over the graphs, partic- projections of an ordered pair. We also assume a fixed, ularly inspired by where- and and why-provenance in finite set of attribute names Attr. databases, and outlines some future directions. Section 6 DFL and NRC are statically typed languages with an discusses related work and Section 7 concludes. arbitrary but fixed collection of atomic types, and an ar- Note. Our graphical model is fundamentally very bitrary but fixed signature that assigns types to the con- similar to the trace model developed in prior, unpub- stants and function symbols [5]. The static typing disci- lished work with Acar and Ahmed [6]. However, we pline ensures that expressions are always well-defined on make a different contribution: namely, we feel our graph- input values of the correct type. For ease of presentation theoretic presentation is more widely accessible than in what follows, we will ignore typing issues and silently the syntactic traces and operational semantics rules em- restrict attention to expressions and evaluations that are ployed to simplify proofs of their main results in [6]. well-defined in the conventional sense [5]. So whenever we apply, for example, e1 [ e2, e1 and e2 are assumed to correctly evaluate to sets. 2 Background The dataflow language DFL [17] is an extension of the 3 Value, evaluation and provenance graphs Nested Relational Calculus (NRC) that includes atomic values and functions. As we can only briefly introduce DFL expressions are normally evaluated over complex DFL here due to space reasons, we encourage readers values, which are nested combinations of atomic data unfamiliar with DFL and NRC to consult [5, 17] for more values d, tuples of complex values hA1 : v1;:::;An : details. In brief, the syntax of DFL is as follows: vni, and sets of complex values fv1; : : : ; vng. As we show in Section 3.1, we can easily represent complex 0 0 e; e ::= x j let x = e in e j c j f(e1; : : : ; en) values as trees or (with sharing) as directed acyclic graphs. Using such value graphs, we are going to rep- j πA(e) j hA1:e1;:::;An:eni j empty?(e) resent the evaluation of a DFL expression by means of j True j False j if e then e else e 1 2 a provenance graph in Section 3.3. A provenance graph 0 [ j ; j feg j e1 [ e2 j fe j x 2 eg j e is a two-sorted graph, consisting of a value graph and an evaluation graph (introduced in Section 3.2), that docu- Here, c denotes a constant atomic data value, drawn from ments the evaluation of a program. Moreover, there is a a set D, and f denotes a function. Atomic data values connection between the evaluation graph and the value may be values of base types such as integers or booleans graph in that each evaluation node is linked to a value or strings, but they may also be more complicated objects node. 2 3.1 Value graphs As with value graphs, we write labl(m) to indicate that l 0 A value graph G is a directed acyclic graph (V; E) with node x has label l and m ! m to indicate that nodes m 0 labels on the nodes and edges.

A Graph Model of Data and Workflow Provenance

Are We Losing Our Ability to Think Critically?

Autumn 2Copy2:First Draft.Qxd

The Best Nurturers in Computer Science Research

Curating the CIA World Factbook 29 the International Journal of Digital Curation Issue 3, Volume 4 | 2009

Transforming Databases with Recursive Data Structures

A Survey on Scientific Data Management

Towards a Multi-Discipline Network Perspective

O. Peter Buneman Curriculum Vitæ – Jan 2008

The Hyperview Approach to the Integration of Semistructured Data

File Size for Two Types of Query As the Retrieved Resultset Increases

Using Links to Prototype a Database Wiki

RSE Fellows Ordered by Area of Expertise As at 11/10/2016