A graph model of data and workflow provenance

Umut Acar Peter Buneman Max-Planck Institute for Software Systems James Cheney Jan Van den Bussche Natalia Kwasnikowska University of Edinburgh Hasselt University Hasselt University Stijn Vansummeren Universite´ Libre de Bruxelles

Abstract that these models share a great deal of structure once they are defined in a common language and data model [8, 6]. Provenance has been studied extensively in both Provenance models have also been developed for a va- and workflow management systems, so far with little riety of workflow systems, such as Chimera [10], Tav- convergence of definitions or models. Provenance in erna [20], Kepler [1], Karma [24], and ZOOM [26]; has generally been defined for relational or also, many other systems such as PASS and PASOA em- complex object data, by propagating fine-grained an- ploy similar ideas [14, 23]. These systems model and notations or algebraic expressions from the input to record provenance as a directed acyclic graph that, in- the output. This kind of provenance has been found formally, describes the macroscopic computation steps useful in other areas of computer science: annotation (e.g., whole program executions) performed in construct- databases, probabilistic databases, schema and data in- ing intermediate and final results. Recently, the Open tegration, etc. In contrast, workflow provenance aims to Provenance Model (OPM) [22] has been developed as a capture a complete description of evaluation – or enact- consensus exchange format for representing provenance ment – of a workflow, and this is crucial to verification graphs. in scientific computation. Workflows and their prove- Workflow systems employ a much wider variety of nance are often presented using graphical notation, mak- programming constructs than databases, including con- ing them easy to visualize but complicating the formal currency, procedures, service calls, and queries to exter- semantics that relates their run-time behavior with their nal databases. However, these systems are seldom ac- provenance records. We bridge this gap by extending a companied by formal specifications of their intended se- previously-developed dataflow language which supports mantics, with or without provenance. As a result, it can both database-style querying and workflow-style batch be hard to understand provenance information produced processing steps to produce a workflow-style provenance by a workflow system, since the meaning intended by the graph that can be explicitly queried. We define and implementer may not match the expectations of the user. describe the model through examples, present queries This is a particularly vexing problem because some users that extract other forms of provenance, and give an exe- and implementers might not even be aware of the possi- cutable definition of the graph semantics of dataflow ex- bility of misinterpretation, leading to further confusion. pressions. The scarcity of clear specifications of the semantics and provenance behavior of workflow systems makes it 1 Introduction difficult to integrate database and workflow provenance or compare provenance graphs generated by different A number of standard database provenance models systems. Therefore, we believe that it is essential to study tailored to relational, complex-object or XML query the semantics of workflow provenance models and relate languages have emerged. These models include lin- them to existing models of database provenance. eage [9], where-provenance [3, 2], why-provenance [3, To compare and unify these different techniques, we 4], and more recent innovations such as dependency- need to define a common provenance model. Database provenance [7], how-provenance [13, 11], and prove- provenance models can be visualized as graphs. Where- nance traces [6]. These models have been presented in provenance, lineage, and dependency-provenance can be a number of different ways and founded on several dif- visualized as bipartite graphs linking parts of the out- ferent motivations. Recently, further study has revealed put with parts of the input. How-provenance and why- provenance are more complex, but can also be visualized such as images or data files. Functions include primitive as directed acyclic graphs linking parts of the output to operations on basic data types, such as integer addition parts of the input, where nodes are labeled with symbolic and equality. Furthermore, functions can also represent algebraic operations. Graphs provide a natural common large computational steps such as external program or formalism for workflow and database provenance. service calls: for example, to model the first Provenance We also need a common language that can express Challenge workflow we might use base types such as both database queries and workflows. In this paper, we Image, Header, or WarpFile and function symbols such use a core calculus for dataflows, called DFL, based on as align warp : Image × Header × Image × Header → the Nested Relational Calculus (NRC). DFL has been WarpFile or reslice : WarpFile → Image × Header to previously introduced by Hidders et al. [16, 17] and we represent the macroscopic computation steps. also build upon some prior work on provenance in this The remaining syntactic constructs above are standard setting [19]. We develop a graphical model of prove- components of the Nested Relational Calculus: we in- nance for both database queries and simple workflows clude record and field projection operations, booleans in a uniform way. This should provide a foundation for and conditionals, and set operations. We employ the syn- studying more complex workflow language features such tax {e0(x) | x ∈ e} for the “for-loop” or set comprehen- as nondeterminism, concurrency and while-loops. sion operation which evaluates e to a set {v1, . . . , vn} 0 0 The structure of the rest of this paper is as follows. In and returns the set of values {e (v1), . . . , e (vn)} ob- 0 Section 2 we review the dataflow calculus DFL. In Sec- tained by evaluating e with x bound to each vi. The tion 3 we describe the structure of provenance graphs and expression S e flattens a nested collection. The expres- give examples showing how typical dataflows are trans- sion empty?(e) tests whether collection e is empty. lated to graphs. In Section 4 we describe the provenance We will use ordered-pair syntax (e1, e2) to abbreviate semantics of DFL programs, and give an executable, yet hfst : e1, snd : e2i, and write fst(e) or snd(e) instead of still high-level implementation in Haskell. Section 5 dis- πfst(e) or πsnd(e), respectively, for the first and second cusses how to express queries over the graphs, partic- projections of an ordered pair. We also assume a fixed, ularly inspired by where- and and why-provenance in finite set of attribute names Attr. databases, and outlines some future directions. Section 6 DFL and NRC are statically typed languages with an discusses related work and Section 7 concludes. arbitrary but fixed collection of atomic types, and an ar- Note. Our graphical model is fundamentally very bitrary but fixed signature that assigns types to the con- similar to the trace model developed in prior, unpub- stants and function symbols [5]. The static typing disci- lished work with Acar and Ahmed [6]. However, we pline ensures that expressions are always well-defined on make a different contribution: namely, we feel our graph- input values of the correct type. For ease of presentation theoretic presentation is more widely accessible than in what follows, we will ignore typing issues and silently the syntactic traces and operational semantics rules em- restrict attention to expressions and evaluations that are ployed to simplify proofs of their main results in [6]. well-defined in the conventional sense [5]. So whenever we apply, for example, e1 ∪ e2, e1 and e2 are assumed to correctly evaluate to sets. 2 Background

The dataflow language DFL [17] is an extension of the 3 Value, evaluation and provenance graphs Nested Relational Calculus (NRC) that includes atomic values and functions. As we can only briefly introduce DFL expressions are normally evaluated over complex DFL here due to space reasons, we encourage readers values, which are nested combinations of atomic data unfamiliar with DFL and NRC to consult [5, 17] for more values d, tuples of complex values hA1 : v1,...,An : details. In brief, the syntax of DFL is as follows: vni, and sets of complex values {v1, . . . , vn}. As we show in Section 3.1, we can easily represent complex 0 0 e, e ::= x | let x = e in e | c | f(e1, . . . , en) values as trees or (with sharing) as directed acyclic graphs. Using such value graphs, we are going to rep- | πA(e) | hA1:e1,...,An:eni | empty?(e) resent the evaluation of a DFL expression by means of | True | False | if e then e else e 1 2 a provenance graph in Section 3.3. A provenance graph 0 [ | ∅ | {e} | e1 ∪ e2 | {e | x ∈ e} | e is a two-sorted graph, consisting of a value graph and an evaluation graph (introduced in Section 3.2), that docu- Here, c denotes a constant atomic data value, drawn from ments the evaluation of a program. Moreover, there is a a set D, and f denotes a function. Atomic data values connection between the evaluation graph and the value may be values of base types such as integers or booleans graph in that each evaluation node is linked to a value or strings, but they may also be more complicated objects node.

2 3.1 Value graphs As with value graphs, we write labl(m) to indicate that l 0 A value graph G is a directed acyclic graph (V,E) with node x has label l and m → m to indicate that nodes m 0 labels on the nodes and edges. The nodes are labeled and m are linked by an edge labeled l. using the alphabet {{}, hi, copy} ∪ D. The edges are A valid evaluation graph is one that can be constructed optionally labeled using the alphabet {elem} ∪ Attr. We using the following rules: use the formula labl(n) to indicate that n has label l in G l 0 0 and n → n to indicate that there is an edge (n, n ) with 1 e A1

e ... label l in G. ... e c or f or <> e n To illustrate, the following graph represents the value e An {hA : 1,B : 2i, hA : 1,B : 3i}: head e x letx or or or e !A e 1 A body <> elem B test test {} e e 2 e empty? A elem or if or if or e e <> then else 3 B 1 or ! or e " or e " We restrict our attention in what follows to legal value e 2 graphs that can be constructed using the following rules: e head e

or e {} or ... body forx

v c or v copy e body

v v

elem A1 ... or {} or ... <> where, again, the meaning of each pattern is that we can elem An extend the graph by adding new nodes and edges (shown v v using solid lines) by linking to existing nodes (shown us- The meaning of these patterns is that a value graph v ing dotted lines). The existing nodes need not be dis- can be constructed from another valid graph by adding joint, so sharing can occur in evaluation graphs. Also, new nodes and edges (shown using solid lines) linked to the empty graph is valid and the union of two disjoint some existing nodes (shown using dotted lines). Sharing graphs is valid. among the nodes of the value graph is allowed. Also, the Finally, we introduce the following terminology: A empty graph is valid and the union of two disjoint value node labeled let x or for x is said to bind x. We say that graphs is valid. a variable node ex labeled by x is in the scope of a node A tree-shaped value graph is called a value tree. e that binds x, if there is a path from e to ex that does not Clearly, one can canonically represent any complex value pass through another node that binds x. We require each by the root node of a value tree. Moreover, any value variable node to be in the scope of at most one binding graph can be converted to a value tree by duplicating node. shared nodes and by merging copy nodes with their tar- For example, the following is a valid evaluation graph: gets. In any value graph, any node from which the un- x 1 raveling yields this canonical value tree, is said to also = test 1 2 represent the same complex value. else if 0 1 * In what follows, we often say “n is a copy of n (in 5 2 G)” as shorthand meaning that n is a copy node and its 0 (sole, unlabeled) outgoing edge is n . Essentially, this graph says that a value was obtained by doing a conditional test x = 1 which failed, and then 3.2 Evaluation graphs evaluating the else-branch to return x ∗ 5. Note that there is no information about the actual value of x or the result An evaluation graph G = (V,E) is a labeled directed of the computation, although we can infer that x 6= 1 acyclic graph with node labels drawn from the set holds. [ {x, c, f, hi, πA, letx, if, ∅, {}, ∪, , forx} 3.3 Provenance Graphs and optional edge labels drawn from the set A provenance graph G = (V,E, val) is a directed {A, head, body, test, then, else, 1, 2,...} acyclic graph with nodes and edges labeled with either

3 1 fst copy 1 1 x copy snd 2 copy = False test

else 2 < > { } empty { } if 2 head copy { } body let R copy body copy copy 2 R head let S head 1 2 snd 1 copy for x { } U { } body < > { } copy = True test body 1 copy head fst then if for y { } 1 copy S body x copy { } { } 2

2 copy + 1 y copy snd fst copy (a)

x copy { } { } 1 then = False test else if 1 test 2 { } U { } empty? True if copy empty { } copy 2 body body for y { } head head { } U body head let S copy S R for x copy y copy copy False test copy body empty? body let R 1 else if 1 { } head empty { } copy { } y copy U { } head { } for y { } 2 { } then body if copy test 1 x copy = True

1 (b)

Figure 1: Examples (a) SELECT a+b FROM R WHERE a=b, where R = {(1, 2), (1, 1)} and (b) R MINUS S where {R = {1, 2} and S = {1}. Ovals are value nodes; boxes are evaluation nodes; gray edges are value-graph edges.

head value or evaluation labels, such that: 1. V = VV ] VE, 3 2. prov is a partial injective function from V to V x copy 1 body let x V E 2 * 9 1 copy copy 2 + 25 body so that each evaluation node e has a unique value node 1 * 16 let y y copy 2 head val(e), 3. G is a value graph if we disregard the eval- 4 uation graph structure, and 4. G is an evaluation graph if we merge each pair of nodes (e, val(e)) and disregard Figure 1(a) and (b) are two larger examples, correspond- the value structure. In the following examples, by con- ing to simple SQL queries. vention, we highlight parts of the input expression that Given a provenance graph G and an evaluation node are considered “inputs” using gray boxes. e, there is a natural notion of (G, e) being locally consis- Here is a simple example, showing the computation tent, with the intuition that the computation depicted in 3 + 4 = 7: the part of the evaluation graph reachable from e matches the assignment of value nodes to evaluation nodes by the 3 1 2 + 7 val-edges. The graphs used as examples so far are all 4 consistent in this sense; however, the following graphs are inconsistent: Here is a more complicated example, showing the 2 True test evaluation of expression π2( 3 + 4 , 5 ) = 5 involving + 5 if copy else constructing a pair and then selecting the second argu- 2 2 ment: The left example is obviously silly: the claimed result of copy 5 2 snd a function should be consistent with the function’s mean- 2 < > ing. The right example is more subtle: the labels of the 1 < > 3 1 1 2 + 7 branches in conditional nodes need to match the boolean 4 value of the test. Moreover, we expect the evaluation graph to be glob- Finally, here is a larger example demonstrating ally consistent, in the sense that the whole trace is an let-binding, showing the evaluation of an expression “unfolding” of the evaluation of a particular expression let x = 3 in let y = 4 in x ∗ x + y ∗ y. e. All of the graphs we have seen so far are globally con-

4 sistent with an expression. However, the following graph is globally inconsistent: c c c

v1 v1

1 1

...... f f f(v1,...,vn)

n n vn vn

The inconsistency here is between the two evaluation v v head bodies of the for-loop: the body of a comprehension can- head letx letx copy x not be both a constant 3 and a constant 4. In a globally e body x e body consistent graph the control flow leading to different val- v1 A1 v1

A1 A1 ...

ues for different iterations must be made explicit. ... <> <> <> It is possible to enforce local consistency using first- An vn An An vn order constraints on the provenance graph. Moreover, global consistency can also be defined using first-order

v1 v ... A1 ... A1 constraints by induction on the structure of expressions ... vi

vi ...... <> !Ai <> !Ai copy e. We omit the actual constraints due to space limits. An An vn v

True test True test 4 The provenance graph semantics then e1 if if copy else e1 then In this section we will show how to construct a consistent e2 provenance graph by evaluating DFL expressions to con- False test struct both a value and evaluation graph, with appropriate True test then e1 if if copy links. Figure 2 illustrates the semantics schematically us- else e2 else ing some graph rewriting rules, where the rounded boxes e2

describe the expression structure. Each rule schemati- v elem v elem ...... {} empty? {} empty? False cally shows how an expression locally evaluates to val- elem elem v ues and expression nodes. These rules can be applied to v build a provenance graph “bottom-up”. {} empty? {} empty? True

4.1 An executable definition ! ! ! ...

elem Defining an algorithm to actually construct a provenance v {} v {} {} graph can be tricky due to the large number of de-

v elem

tails we need to manage. We present a definition us- ... v

{} elem elem ...

elem ... {} ing Haskell [18], a functional language that provides v elem " v sophisticated facilities for defining side-effecting oper- v elem " {} ... v elem

{}

...... elem {} ations such as those involved in building a graph. Us- v elem elem ing these features we can define a function that traverses v

v elem v elem elem

...... an expression, evaluates it and constructs the associated {} {} ... elem elem v elem v elem provenance graph in only a few hundred lines of code. ... {} {} {}

v elem elem ! v elem elem !

...

...... Haskell programs are precise definition that is still rela- {} {} elem elem elem tively readable and clear. Moreover, it is an executable v v specification that can be used to generate small examples

v1 v1

and experiment with alternative definitions. elem elem

...... {} {} head In general while evaluating an expression with free elem head elem vn vn for forx variables, we need to keep track of the values associated x body {} x

x e body e elem ... with variables, so we will introduce a little more notation elem to help with this bookkeeping. x e body For a finite set X of variables, a value assignment for X is a mapping σ : X → Val that assigns to each x ∈ X Figure 2: Graph rewriting rules for constructing prove- a complex value. We can model value assignments in a nance graphs provenance graph G as follows. Let γ : X → GE be a function mapping variable names to variable nodes of

5 the evaluation graph of G, such that for each x ∈ X, we Furthermore, the “bind” operation takes an M a (i.e., a have labx(γ(x)). Then for each x, the value node γ(x) computation producing values of type a) and a function corresponds to a complex value. In particular, given an from a to M b and produces a computation M b returning ordinary assignment σ : X → Val, we can always con- a value of type b. Its definition is as follows: struct a graph G that defines value nodes for the values of σ(x) and whose evaluation nodes correspond exactly >>= :: M a → (a → M b) → M b to the variables in X. f >>= g = λγ.λi.λG. In Haskell, it is more convenient to represent the graph let (i0,G0, a) = f γ i G in (g a) γ i0 G0 as a collection of finite maps from nodes to datatypes that we shall call constructors. In our implementation, a Finally, we define a number of operations that allow us single node will correspond to a pair (m, n) of an ex- to read the current state of the computation or perform pression and value node in the previous development. a side-effecting operation. We give some examples in We map each node to a value constructor and optionally detail and then just describe the remaining operations. an evaluation constructor. Constructors encode both the To read the current value of a variable in the context, node and edge labels. For example, we use a constructor we define the lookup operation: EIf True n1 n2 to represent an if node with test- edge to n1, a test value of True, and then-edge to n2. lookup :: Var → M Node This approach builds many of the basic validity prop- lookup x = λγ.λi.λG.(i, G, γ(x)) erties of provenance graphs into the Haskell type sys- tem, making it easier to avoid trivial bugs. Of course, the To create a fresh node, we define the following monadic constructor-based graph can be translated to the explic- operation that creates a new node using the current index itly edge-labeled graphs used earlier. Figure 8 shows the and increments the index: basic datatypes for nodes, variables, contexts, and eval- uation and value constructors in Haskell. Note that for fresh :: M Node simplicity we use Haskell’s built-in list type for collec- fresh = λγ.λi.λG.(i + 1, G, Node(i)) tions; we also restrict attention to ordered pairs rather than general records. These differences are inessential. Similar operations can be defined to access the current We also employ a feature of Haskell called mon- context, add a variable binding to the context, and so ads [21] to structure the computation of the provenance on. These operations are shown in Figure 3. We also graph for a given expression. Basically, a monad is a include monadic versions of primitive functions (interp) generic type M a. A value of type M a is a computation and complex operations such as flattening (S). The oper- that produces a value of type a and may have some side- ation link econ vcon extends the graph with a new node effects. Because Haskell is a pure functional language, n bound to the constructors econ and vcon, returning n. all side-effects need to be encapsulated within a monad. Figures 4, 6 and 7 show how to evaluate an expres- Monads can also handle contextual information such as sion in a provenance graph, producing a value node in tracking the current values of variables. an augmented provenance graph. The helper functions Our monad will employ the type: − † and − ∗ shown in Figure 5 help simplify the def- inition.J K First,J K e † simply evaluates an expression e to type M a = Ctx -> Int -> Graph -> its value nodeJ andK also returns the node’s constructor. (Int,Graph,a) ∗ Second, e x∈vs evaluates e repeatedly, with x bound in turn to eachJ K element of vs. The examples in Figure 1 Here, Ctx represents the variable context, the Int param- were generated using the Haskell implementation and the eter/result is a counter used for generating fresh node ids, graphviz Unix tool1. and the Graph parameter/result is the graph being built. The definition of the monad type and its operations in Haskell is slightly different for technical reasons. For 5 Querying the provenance graph presentation reasons we suppress these differences. A monad is always equipped with two operations, here The provenance graph is a relational structure, and as called “>>=” (or “bind”) and “return”. The “return” op- such there are a wide variety of languages available for eration simply takes a value of type a and produces a querying the graph, ranging from simple path or reach- monad returning that value: ability queries, to SQL-like relational queries, to more expressive languages supporting recursive queries, such return :: a → M a as Datalog. Thus, in a sense the problem of querying return a = λγ.λi.λG.(i, G, a) 1http://www.graphviz.org

6 data Node = Node Int deriving (Eq,Ord,Show) type Var = String type Ctx = Var -> Node data VCon = VInt Int | VBool Bool | VPair Node Node | VSet [Node] | VCopy Node data ECon = EInt Int | EFun String [Node] | ELet Var Node Node | EVar Var | EPair Node Node | EProj Int Node | EBool Bool | EIf Bool Node Node | EEmpty | ESng Node | EUnion Node Node | EFor Node Var [Node] | EFlatten Node data Graph = Graph {emap :: Map Node (Maybe ECon), vmap :: Map Node VCon}

Figure 8: Haskell code defining provenance graphs. The type Map a b consists of finite maps from type a to b, from the Haskell standard library.

− :: Expr → M Node bindVar :: Var → Node → M a → M a J K x = lookup x getVCon :: Node → M VCon let x = e1 in Je2K = e1 >>= λn1. link :: ECon → VCon → M Node J K bindVarJ K x n1 e2 >>= λn2. J K interp :: String → VCon → M VCon link(ELet x n1 n2)(VCopy n2) i = link(EInt i)(VInt i) flatten :: [Node] → M [Node] J K f(e) = e † >>= λ(n, v). J K J K 0 interpf (v) >>= λv . Figure 3: Graph monad operations link(EFun f (n)) (v0) the provenance graph is already solved by known tech- Figure 4: Monadic semantics for building provenance niques for querying arbitrary graphs that happen to be graphs, part 1 (variables, let, primitive functions) provenance. † However, it still seems to be a challenge to define − :: Expr → M (Node, VCon) J K† known forms of provenance in databases in terms of e = e >>= λn. J K J K provenance graphs. In this section, we sketch prelimi- getVCon(n) >>= λv. nary ideas towards this goal, using Datalog over the raw return (n, v) ∗ provenance graphs. − x∈[] = return [] J∗ K 0 In order to define forms of where-provenance and e = bindVar x n0 e >>= λn . x∈n0:ns 0 J K ∗ J K0 why-provenance, we need a partial orders on evaluation e x∈ns >>= λns . J K 0 0 nodes. We start by ordering the evaluation nodes in G return (n0 : ns ) based on the following two orders: l † ∗ 1. Child(n1, n2) if n2 → n1 (i.e., there is an edge Figure 5: Helper functions e and e x∈vs. from n2 to n1) J K J K 2. Left(n1, n2) if • n1 is the test-node of a conditional and n2 is to be an evaluation node using a predicate Eval(e) that one of the branches lists all evaluation nodes. • n1 is the head-node of a let or for-node n and We also define the Copy relation on value nodes as the n2 is a body-node of n reflexive, transitive closure of the copy edge relation. We The partial order Before is defined as follows: use the predicate V alue(n) in the first rule to constrain n to be a value node, ensuring safety: Below(e, e) ← Eval(e) Copy(n, n) ← V alue(n) Below(e, f) ← Below(e, g), Child(g, f) 0 00 00 copy 0 Before(e, f) ← Below(e, f) Copy(n, n ) ← Copy(n, n ), n → n Before(e, f) ← Before(e, e0), Below(e0, g), We will define where-provenance and why- Left(g, h), Before(f, h) provenance queries on pairs (e, v) such that v is reachable by a directed path (possibly including where in the first rule we ensure safety by constraining e copy-links) from e. We will call such pairs instances.

7 5.2 Why-provenance (e1, e2) = e1 >>= λn1. J K J K e2 >>= λn2. Why-provenance was defined in [3] using witnesses. J K link(EPair(n1, n2)) (VPair(n1, n2))There, a witness to the existence of a part p of the output † πi(e) = e >>= λ(n, VPair(n1, n2)) of a query Q on input data d was defined as a subtree of J K J K link(EProj i n)(VCopy ni) the input d such that rerunning Q on the subtree still pro- b = link(EBool b)(VBool b) duces output part p. We generalize this idea as follows. J K † Let G be a provenance graph with a distinguished eval- if e then e1 else e2 = e >>= λ(n, VBool b). J K Jif bK uation node r whose value is v. Let U be a connected 0 then e1 >>= λn . subset of the result value nodes that contains v. Then a link(EIfJ K True n n0)(VCopy n0) witness to U in G is a consistent subgraph of G that con- 0 else e2 >>= λn . tains r and U. The why-provenance of V is then the set link(JEIfK False n n0)(VCopy n0) of all minimal witnesses to U in G. Alternatively, we could give a low-level definition that traverses the graph to construct a witness starting from Figure 6: Monadic semantics for building provenance a set of output value nodes, following similar lines to graphs, part 2 (pairs, conditionals) the low-level definition of Where above. We omit the details; the main differences are in rules for conditionals ∅ = link(EEmpty)(VSet []) and primitive functions where we continue tracking the J K dependencies of the test or input values respectively. {e} = e >>= λn. J K linkJ K (ESng n)(VSet [n]) † e1 ∪ e2 = e1 >>= λ(n1, VSet(vs1)). 5.3 Discussion J K J K† e2 >>= λ(n2, VSet(vs2)). This discussion suggests a number of interesting obser- JlinkK(EUnion n1 n2)(VSet(vs1 ++ vs2)) vations and questions which we will not attempt to re- {e | x ∈ e0} = e0 >>= λ(n0, VSet(ns)). J K J ∗K 0 solve here, including: e x∈vs >>= λns . 0 0 linkJ K (EFor n0 x ns )(VSet ns ) • The first definition of Where appears equivalent to S e = e >>= λ(n, VSet(ns)). the where-provenance model of [2]. Can we make this J K flattenJ K (ns) >>= λns0. connection precise? Likewise, can we formally relate link (EFlatten n)(VSet ns0) the why-provenance subgraphs to definitions of why- provenance or lineage for NRC (e.g. [11])? • The above definitions characterize where- Figure 7: Monadic semantics for building provenance provenance structurally by following a chain of graphs, part 3 (collections) copies from the input to the output (or vice versa). This exhibits a symmetry between querying the provenance graph “forward” vs. “backward”. Is this a unique feature 5.1 Where-provenance of where-provenance or are there other provenance We now can define a form of where-provenance on in- queries that have this property? stances (e, v) as follows: • In some prior work (e.g. [2]), forms of provenance have been defined by translating NRC queries to queries

Where((e1, v1), (e2, v2)) ← Before(e1, e2), that explicitly manage their own provenance informa- tion. Is this possible for provenance graphs? Copy(v2, v1) • We explored a number of design choices that could Intuitively, this says that the where-provenance of the be revisited. For example, we considered treating let value v2 returned by the evaluation ending at e2 is the transparently and avoiding using explicitly labeled vari- same as that of the value v1 returned by e1. Clearly, there able nodes. Another controversial choice was the use of should be a unique least (e1, v1) with respect to Before copy-links rather than directly sharing value nodes. Are so we can define the where-provenance as that instance these differences important, and how can we determine (e1, v1). This definition relies on the fact that our prove- this? nance graph already corresponds closely to the high- • How should updates be modeled? level view of where-provenance defined via annotation- We conclude this section with some sheer speculation. propagation in previous work [3, 2]. One important dif- Sensible provenance queries Consider all queries on ference is that here, we can refer to intermediate steps in provenance graphs expressible in, say, relational calculus the provenance graph. or Datalog. Clearly, this includes many strange queries

8 that test properties of the graph that seem irrelevant to Graphical notations for provenance have been used provenance. For example, a query might select all value extensively in many systems. Recent work of inter- nodes that appear exactly three times in the graph, or all est in some of these systems includes work on consis- graphs that contain fewer than seventeen nodes, or all tency and optimization for user views (abstractions) of graphs such that function f is called an even number of provenance graphs [26] and efficiently representing and times. Is there a natural characterization of “sensible” storing provenance graphs over nested collections [1]. provenance queries? Again, however, these papers focus on structural as- Provenance query answerability Consider the fol- pects of provenance graphs as produced by some work- lowing problem: given two provenance graph queries flow system, whereas we are studying the relationship q1, q2, can we answer q2 using the results of q1 (without between a provenance graph and the computation per- access to the original graph, expression or input data)? formed by the system. If so, it is reasonable to say that q1 is more general than Our work is inspired partly by “provenance traces” [6], q2. This problem is an instance of the problem of an- an approach to provenance for NRC in which evaluation swering queries using views, which has been studied for of expressions yields both a value and a “trace”, or de- relational and XML databases. Can these results be ap- tailed record of evaluation. Traces in that work are com- plied or adapted to provenance queries? plete in the sense that they can be used to replay the Query rewriting and provenance query equiva- computation under arbitrary changes to the input; our lence Many equivalent queries become inequivalent in provenance graphs do not try to support replay under the provenance graph semantics — for example, union arbitrary changes, but may therefore be more compact. is no longer commutative. Does this matter, or is it ac- Moreover, our provenance graphs explicate the relation- ceptable to optimize queries using ordinary equivalence ship between workflow and database provenance models, rules? Under what conditions is this reasonable? a question not addressed in [6]. Efficient provenance querying Provenance graphs Another closely related line of work is on the NRC- may be too large to construct or retain in practice. Given based dataflow language DFL, starting with Hidders et a provenance graph query q, can we compute the answer al. [17]. Hidders et al. [16] developed a run seman- to q more efficiently and without materializing the full tics for dataflow calculus programs along with a sketch provenance graph? Can we compute provenance graphs of how to extract provenance from runs. Subsequently “lazily” or for just a part of the result, in response to a Kwasnikowska and van den Bussche [19] showed how query, rather than “eagerly” computing the whole graph? to represent the run semantics using OPM. This work also modeled nondeterministic service (external func- tion) calls and modeled procedures using OPM accounts. 6 Related Work These refinements have been left out in our presentation but can easily be handled. Our work improves on this ap- We have attempted to remain close to de facto stan- proach by directly defining a provenance graph seman- dards for visualizing provenance as graphs, particularly tics that seems closer to typical workflow provenance, the Open Provenance Model [22], which distinguishes and avoiding some technical complications involved in between “process” (evaluation) and “artifact” (value) the previous approach, particularly the use of paths to re- nodes. However, we have not included other aspects of fer to parts of values and expressions indirectly. On the OPM and have not tried to make our graphs fit OPM ex- other hand, our pervasive use of freshly generated graph actly. Collections are not modeled directly in OPM 1.0, nodes introduces complications of its own, which we although a proposal for describing the provenance of col- have addressed using the Haskell monadic programming lections in OPM is under development [15]. OPM is an style. We also discuss specific provenance queries de- exchange and representation format for provenance in- fined using Datalog queries on the provenance graph, al- formation and so some of the the semantic issues we in- though we only have preliminary results and many open vestigate are outside its scope. questions are evident. Semantics and models of provenance have been stud- Although the semantics we propose here may not be ied in formal detail for some systems. Sroka et al. [25] universally acceptable or usable off-the-shelf in these develop a semantics for Taverna workflows based on a other settings, we believe the methods we outline are core language similar to NRC over lists, but including re-usable. In particular, the idea of using Haskell-style implicit coercions from elements to collections and also monads to define a precise semantics is very flexible, incorporating operations such as zip that are not express- since monads can incorporate many other kinds of side- ible in plain NRC. Missier et al. [20] discuss lightweight effects besides fresh node identifier generation. It may be lineage annotations for Taverna workflows but does not worthwhile to model the provenance semantics of more fully detail how to the full language presented in [25]. realistic workflow languages in Haskell, facilitating di-

9 rect comparisons of the behavior and provenance of dif- [6]C HENEY,J.,ACAR,U.A., AND AHMED, A. Provenance traces. ferent workflow languages. CoRR abs/0812.0564 (2008). [7]C HENEY,J.,AHMED,A., AND ACAR, U. A. Provenance as dependency analysis. In DBPL 2007 (Vienna, Austria, Septem- 7 Conclusions ber 2007), M. Arenas and M. I. Schwartzbach, Eds., no. 4797 in LNCS, Springer-Verlag, pp. 139–153. Although provenance has been studied in both database [8]C HENEY,J.,CHITICARIU,L., AND TAN, W. C. Provenance in databases: Why, how, and where. Foundations and Trends in and workflow settings for more than a decade, little has Databases 1, 4 (2009), 379–474. been done to relate the approaches. In this paper we [9]C UI, Y., WIDOM,J., AND WIENER, J. L. Tracing the lineage of make two contributions in this direction, in the con- view data in a warehousing environment. ACM Trans. Database text of the previously-introduced dataflow calculus DFL. Syst. 25, 2 (2000), 179–227. First, we detail a semantics that evaluates dataflow calcu- [10]F OSTER,I.,VOCKLER,J.,WILDE,M., AND ZHAO, Y. lus expressions to provenance graphs containing values, Chimera: A virtual data system for representing, querying, and automating data derivation. In SSDBM (July 2002), pp. 1–10. evaluation nodes, and links showing how the expression [11]F OSTER,J.N.,GREEN, T. J., AND TANNEN, V. Annotated evaluated, and we discuss interesting kinds of queries XML: queries and provenance. In PODS (2008), pp. 271–280. on the resulting structure, related to where-provenance [12]F REIRE,J.,KOOP,D., AND MOREAU, L., Eds. IPAW (2008), and why-provenance in databases. Second, we present a vol. 5272 of Lecture Notes in Computer Science, Springer. concise and precise formal version of this model imple- [13]G REEN, T. J., KARVOUNARAKIS,G., AND TANNEN, V. Prove- mented using Haskell, a high-level functional language. nance semirings. In PODS (2007), ACM, pp. 31–40. We believe our work helps bridge a gap between [14]G ROTH, P., LUCK,M., AND MOREAU, L. A protocol for the theoretical approaches that so far have largely been recording provenance in service-oriented grids. In OPODIS ’04 (2004). employed for database provenance and the practical, [15]G ROTH, P., MILES,S.,MISSIER, P., AND MOREAU,L.A but sometimes loosely-specified techniques developed in proposal for handling collections in the open provenance model, workflow systems. We also identified a number of inter- 2009. esting research questions concerning where- and why- [16]H IDDERS,J.,KWASNIKOWSKA,N.,SROKA,J., provenance queries over the provenance graph and their TYSZKIEWICZ,J., AND VANDEN BUSSCHE, J. A formal relationships to these forms of provenance in databases. model of dataflow repositories. In DILS (2007), vol. 4544 of LNCS, Springer, pp. 105–121. We are investigating these in ongoing work. Although [17]H IDDERS,J.,KWASNIKOWSKA,N.,SROKA,J., there is still room for debate about the particular design TYSZKIEWICZ,J., AND VANDEN BUSSCHE, J. DFL: A choices made in our approach, our formal model at least dataflow language based on petri nets and nested relational makes it easier to hold such a debate, and to experiment calculus. Inf. Syst. 33, 3 (2008), 261–284. with alternatives. [18]H UTTON,G. Programming in Haskell. Cambridge University Press, 2007. [19]K WASNIKOWSKA,N., AND VANDEN BUSSCHE, J. Mapping Acknowledgments We have discussed ideas related to the NRC dataflow model to the open provenance model. In Freire this work with Amal Ahmed. This work has been sup- et al. [12], pp. 3–16. ported by EPSRC grant EP/F028288/1. [20]M ISSIER, P., BELHAJJAME,K.,ZHAO,J.,ROOS,M., AND GOBLE, C. A. Data lineage model for taverna workflows with lightweight annotation requirements. In Freire et al. [12], pp. 17– References 30. [21]M OGGI, E. Notions of computation and monads. Inf. Comput. [1]A NAND,M.K.,BOWERS,S.,MCPHILLIPS, T., AND 93, 1 (1991), 55–92. ¨ LUDASCHER, B. Efficient provenance storage over nested data [22]M OREAU,L.,FREIRE,J.,FUTRELLE,J.,MCGRATH,R.E., EDBT collections. In (New York, NY, USA, 2009), ACM, MYERS,J., AND PAULSON, P. The open provenance model: An pp. 958–969. overview. In Freire et al. [12], pp. 323–326. [2]B UNEMAN, P., CHENEY,J., AND VANSUMMEREN, S. On the [23]M UNISWAMY-REDDY,K.-K.,HOLLAND,D.A.,BRAUN,U., expressiveness of implicit provenance in query and update lan- AND SELTZER, M. Provenance-aware storage systems. In guages. ACM Transactions on Database Systems 33, 4 (Novem- USENIX ATC (June 2006), USENIX, pp. 43–56. ber 2008), 28. [24]S IMMHAN, Y. L., PLALE,B., AND GANNON, D. Karma2: [3]B UNEMAN, P., KHANNA,S., AND TAN, W. Why and where: A Provenance management for data-driven workflows. Int. J. Web characterization of data provenance. In ICDT (2001), no. 1973 in Service Res. 5, 2 (2008), 1–22. LNCS, Springer, pp. 316–330. [25]S ROKA,J.,HIDDERS,J.,MISSIER, P., AND GOBLE,C.A Journal of [4]B UNEMAN, P., KHANNA,S., AND TAN, W. On propagation formal semantics for the taverna 2 workflow model. of deletions and annotations through views. In PODS (2002), Computer and System Sciences In Press, Corrected Proof (2009). pp. 150–158. [26]S UN, P., LIU,Z.,DAVIDSON,S.B., AND CHEN, Y. Detect- ing and resolving unsound workflow views for correct prove- [5]B UNEMAN, P., NAQVI,S.A.,TANNEN, V., AND WONG,L. nance analysis. In SIGMOD (New York, NY, USA, 2009), ACM, Principles of programming with complex objects and collection pp. 549–562. types. Theor. Comp. Sci. 149, 1 (1995), 3–48.

10