Dynamic Provenance for SPARQL Updates using Named Graphs

Harry Halpin James Cheney Consortium University of Edinburgh [email protected] [email protected]

Abstract nance for collections or other structured artifacts. The workflow community has largely focused on static prove- The (Semantic) Web currently does not have an offi- nance for atomic artifacts, whereas much of the work on cial or de facto standard way exhibit provenance infor- provenance in databases has focused on static or dynamic mation. While some provenance models and annota- provenance for collections (e.g. tuples in relations). tion techniques originally developed with databases or The workflow community has largely focused on workflows in mind transfer readily to RDF, RDFS and declaratively describing causality or derivation steps of SPARQL, these techniques do not readily adapt to de- processes to aid repeatability for scientific experiments, scribing changes in dynamic RDF datasets over time. and these requirements have been a key motivation for Named graphs have been introduced to RDF motivated the Open Provenance Model (OPM) [12, 11], a vocab- as a way of grouping triples in order to facilitate anno- ulary and data model for describing processes including tation, provenance and other descriptive . Al- (but certainly not limited to) runs of scientific workflows. though their semantics is not yet officially part of RDF, While important in its own right, and likely to form the there appears to be a consensus based on their syntax basis for a W3C standard interchange format for prove- and semantics in SPARQL queries. Meanwhile, up- nance, OPM does not address the semantics of the prove- dates are being introduced as part of the next version nance represented in it — one could in principle use it ei- of SPARQL. In this paper we explore how to adapt the ther to represent static provenance for processes that con- dynamic copy-paste provenance model of Buneman et struct new data from existing artifacts, or to represent dy- al. [2] to RDF datasets that change over time in re- namic provenance that represents how data change over sponse to SPARQL updates, how to represent the result- time. Most applications of OPM, however, seem to focus ing provenance records themselves as RDF using named on static provenance, and so it may require further exten- graphs, and how the provenance information can be pro- sion to handle dynamic provenance. Conversely, many vided as a SPARQL end-point. elements of the OPM vocabulary may not be necessary for describing changes to RDF data over time. 1 Introduction It is an open question on how to best access prove- nance data for RDF. Currently provenance techniques are It is becoming routine to publish scientific and govern- not easily usable to the vast majority of developers and mental data on the Web as RDF (the Resource Descrip- other people working in open data. If we want RDF to tion Framework, a W3C standard for structured data on become a leading format for open data, having an easy the Web). In doing so, it is crucial not to lose track of and straightforward manner to access provenance data its provenance, including both its derivation process and could be crucial. We believe that dynamic provenance changes to it over time. There are many different kinds of techniques should be as easy to use as version control. provenance that possibly could be associated with RDF. Version control is a classic and well-studied problem for There is a difference between static provenance for pro- information systems ranging from source code manage- cesses that create new artifacts from existing ones, ver- ment systems to temporal databases and data archiving. sus dynamic provenance that describes how artifacts have Version control systems such as CVS, Subversion, and evolved over time. Second, there is a difference between Git have a solid track record of tracking versions and provenance for atomic artifacts that expose no internal (coarse-grained) provenance for text (e.g. source code) structure as part of the provenance record, versus prove- over time. Recent work on provenance in relational databases proposals for versioning and provenance-tracking for has aimed precisely at “provenance as version control”. RDF graphs and SPARQL updates. Foundational work on provenance for queries distin- guishes between where-provenance, which is the “loca- 2 Named graphs for dynamic provenance tions in the source databases from which the data was ex- tracted,” and why-provenance, which is “the source data From an early stage it was recognized that provenance that had some influence on the existence of the data” [3]. management in RDF requires the ability to refer to triples There is less work considering provenance for updates; or graphs. Hence, there is unofficial yet widely imple- our previous work focused on simple atomic update oper- mented support for named graphs, where each graph G ations: insertion, deletion, and copy [2]. A provenance- is identified with a name URI [5]. The idea of using aware database should support queries that permit users named graphs for provenance is not new, but previously to find the ultimate or proximate ‘sources’ of some data, has been restricted to making and propagating trust mea- explore its change history over time, or apply data prove- sures [9] or authorship assertions/signatures [5] through nance techniques to understand why a given part of the inferences or queries. We believe that the named graph data was inserted or deleted by a given update. A simple approach is also well-suited to recording detailed prove- approach that records the full provenance of every part of nance traces for dynamic data. the data during every update would likely lead to an ex- Our proposal is that the provenance information plosion of database size, so current research is studying should be accessible from the name URI of a named ways of optimizing provenance storage and new formal graph by asking for the provenance content associated models for relational data [2]. with the graph. Obviously, the provenance store for How can we track provenance for dynamic RDF each graph should be accessible, ideally using Linked data using current components of the Semantic Web? Data principles, from the name URI of the named graph. Static provenance techniques have now been investigated Moreover, named graphs can be used to represent both for RDF, including provenance-tracking for SPARQL past versions of a graph and changes to the graph in or- queries (see [15, 10]) and for RDFS inferences (see [4, der to support both versioning and dynamic provenance. 8]) over RDF datasets. Some of this work considers up- In practice on the Web of Data, the graph name is usually dates, particularly Flouris et al. [8], who consider the the URI used to access the the RDF graph or the domain problem of how to maintain provenance information for name of the server. For example, a group of RDF triples RDFS inferences when tuples are inserted or deleted, us- from DBPedia about Paris may be named by http:// ing the coherence semantics. In this paper we consider www.dbpedia.org/data/Paris. One could imagine a provenance model for SPARQL queries and updates that past versions of this graph might be named http: to data stores involving named graphs, whose purpose //www.dbpedia.org/data/Paris v1, etc., with addi- is to provide a record of how the raw data in a dataset tional URIs representing the updates that have been ap- has changed over time, and then tackle the problem of plied in a machine-readable format, along with conven- recording and providing (queryable) access to the de- tional provenance metadata about the updates such as au- tailed change history of an RDF graph that is updated thor, time, and the text of the update. over time via SPARQL updates. However, one problem with minting a new name Our hypothesis is that a simple vocabulary, composed URI for each past version of the graph would be that of insert, delete, and copy operations as introduced by there would soon be an overwhelming number of name Buneman et al. [2], along with explicit identifiers for up- URIs, with one for each change to the graph. While they date steps, versioning relationships, and metadata about could still be connected to each other via the provenance updates provides a flexible format for data provenance data, a higher-level alternative would be to use query on the Semantic Web. The harder problem of devis- parameters to qualify the base URI. So one could have ing a combined vocabulary for describing both static and http://www.dbpedia.org/data/Paris?version=1 dynamic provenance remains; nevertheless our proposal to retrieve the first version, http://www.dbpedia. sheds some light on the issues. A primary advantage of org/data/Paris?version=2 to retrieve the second, our methodology is it keeps the changes to raw data sepa- and so on. A “date” parameter to retrieve by date would rate from the changes in metadata, so legacy applications also be useful, such as http://www.dbpedia.org/ will continue to work and the cost of storing and provid- data/Paris?date=2010-08-30T21%3A33%2A50. ing access to provenance can be isolated from that of the To get the provenance metadata, one could then add raw data. a special “provenance” parameter like http://www. We will briefly review named graphs and their rela- dbpedia.org/data/Paris?provenance&version=2 tionship to versioning and provenance, then introduce and http://www.dbpedia.org/data/Paris? SPARQL queries and updates, and finally describe our provenance&date=2010-08-30T21%3A33%2A50.

2 PREFIX dc: HTTP/1.1 200 OK SELECT ?book ?who Date: Fri, 01 July 2011 20:55:12 GMT WHERE { ?book dc:creator ?who } Server: Apache/1.3.29 (Unix) PHP/4.3.4 DAV/1.0.3 Connection: close Content-Type: application/-results+ Figure 1: SPARQL Query Example. PREFIX dc: DELETE DATA { GRAPH { dc:creator "Harry Halpin" } } INSERT DATA { GRAPH { http://www.example/book/book5 dc:creator "Henry Story" Harry Halpin } } ... Figure 2: SPARQL Update Example

GET /sparql/?query=EncodedQuery \ Figure 4: SPARQL Response for Provenance Informa- &provenance-date=2010-08-30 \ tion &http://www.example/bookstore Host: www.example User-agent: sparql-client/0.1 Figure 2) for the name of the book before the edits were made on August 30th 2010. The user transmits this via HTTP GET, specifying the necessary provenance date of Figure 3: Asking for provenance using HTTP GET August 30th 2010 in Figure 3. Then the SPARQL end- point gives response with a return date using a prove- nance element in Figure 4. Note that multiple name URIs constructed with different parameters may result in the same graph.Furthermore, the same conventions could also be added to SPARQL 4 SPARQL Background Protocol for RDF [6] to allow easy access of RDF as shown in an example explained below in Section 3. Queries Let Lit be a set of literals (e.g. strings), let Id be a set of resource identifiers, and let Var be a set of variables usually written ?X. We write Atom = Lit ∪ Id 3 Strawman Example for the set of atomic values, that is literals or ids. The syntax of core algebra for SPARQL discussed in [1] is as Our approach is complementary to techniques for pub- follows: lishing or disseminating versioned Web resources (e.g. the Memento protocol of Van de Sompel et al. [16]) or A,B,C ::= l ∈ Lit | ι ∈ Id | ?X ∈ Var accompanying static provenance or metadata (e.g. Cop- t ::= hABCi pens et al. [7]). Our advantage is to give a formal seman- 0 C ::= {t1,...,tn} | GRAPH A {t1,...,tn} | CC tics that could be used to optimize this provenance, and a R ::= BOUND(?x) | A = B | R ∧ R0 | R ∨ R0 | ¬R more simple query-parameter based method for access- 0 0 0 ing provenance that does not require new HTTP headers P ::= C | P . P | P UNION P | P OPT P and subsequent changes to server and client software. | P FILTER R Suppose we discover that the author of a book is Q ::= SELECT ?~X WHERE P | CONSTRUCT C WHERE P wrong in an RDF graph, and it needs to be changed. This is transmitted via this use of SPARQL Update to the Here, C denotes basic graph (or dataset) patterns that end point of the named graph for the bookstore, http:// may contain variables; R denotes conditions; P denotes example.com/bookStore, as given in Figure 1. Then patterns, and Q denotes queries. We do not distinguish another user-agent wishes to query (using the query in between subject, predicate and object components of

3 triples, so this is a mild generalization of [1], since real upd:load, upd:clear, upd:create, or SPARQL does not permit literals to appear in the predi- upd:drop. cate position or as the name of a graph in the GRAPH A {P} • For all updates except create, an upd:input edge pattern. We also do not consider blank nodes, which pose linking ui to G_vi. complications especially when updates are concerned. • For all updates except drop, an upd:output edge Another point to note is that we allow named graph pat- linking ui to G_vi+1. terns only as part of basic patterns. The semantics of core • For insert and delete updates, an edge SPARQL algebra (from [1]) is given in the appendix. ui upd:data G_ui where G_ui is a named graph containing the triples that were inserted or Updates We will employ a simplified core language deleted by ui. for atomic updates, based upon [14]: • Edges ui upd:source n linking each update to each named graph n that was consulted by ui. For U ::= INSERT {C} WHERE P | DELETE {C} WHERE P an insert or delete, this includes all graphs that were | LOAD g INTO g0 | CLEAR g | CREATE g | DROP g consulted while evaluating P (note that this may only be known at run time); for a load update, this is The SPARQL Update working draft specifies that trans- the name of the graph whose contents were loaded; actions consisting of multiple updates should be applied create, drop and clear updates have no sources. atomically, but leaves some semantic questions unre- • Additional edges from ui providing metadata (such solved, such as whether aborted transactions have to roll- as author, commit time, log message, or the source back partial changes. It also does not specify whether up- text of the update) for the update; possibly using dates in a transaction are applied sequentially (as in most a standard vocabulary such as Dublin Core, or us- imperative languages), or using a snapshot semantics (as ing OPM-style agent nodes and wasControlledBy in most database update languages). Both alternatives edges. We omit the details of this metadata. pose complications, so in this paper we focus on trans- Note that this representation does not directly link actions consisting of single atomic updates, leaving the triples in a given version to places from which they were general case for future work. “copied”. However, it does provide enough information to recover this on request. Moreover, if we retain the 5 Provenance model source text of the update statements performed by each update, as well as all intermediate versions of the graph, A single SPARQL update can read from and write to sev- we can trace backwards through the update sequence to eral named graphs (and possibly also the default graph). identify places where triples were inserted or copied into For simplicity, we restrict attention to the problem of or deleted from the graph. tracking the provenance of updates to a single (possi- This is a high-level, logical model of the provenance of bly named) RDF graph. All operations may still use a graph as it evolves over time, along with all of its inter- the default graph or other named graphs in the dataset mediate versions. It is not a concrete proposal for how to as sources. The general case can be handled using the store or query this graph information efficiently. We ex- same ideas as for a single anonymous graph, only with pect that sharing techniques similar to those used by ver- more bureaucracy to account for versioning of all of the sion control systems or temporal databases can be used named graphs managed in a given dataset. to store the sequence of versions efficiently; once this is We model the provenance of a single RDF graph G done, the contents of the named graphs G_ui that store that is updated over time as a set of history records, inserted or deleted triples can be represented efficiently including a special graph named prov_G and addi- by just storing the insert or delete statements themselves tional auxiliary named graphs such as G_v0,. . . ,G_vn and recovering the graphs lazily on demand. However, and G_u1...,G_um. Intuitively, G_vi is the named graph we have not implemented this model and developing a showing G’s state in version i and G_ui is another named practical implementation is future work. graph showing the triples inserted into or deleted from G by update i. 6 Provenance semantics The provenance graph prov_G includes several kinds of nodes and edges: For queries, we consider a simple form of provenance • G_vi upd:nextVersion G_vi+1 edges that which calculates a set of named graphs “consulted” by show the sequence of versions; the query. (For the purposes of this paper, this is an ad • nodes u1,. . . ,un representing the updates that have hoc, syntactic notion.) This could be viewed as a simple been applied to G, along with a upd:type edge form of why-provenance, analogous to determining the linking to one of upd:insert, upd:delete, names of the relations mentioned in a database query.

4 r r r a b a b a b t s t t u c d d d u

g_v0 g_v1 g_v2

input input u1 output u2 output meta insert delete data m2 G_u1 m1 G_u2 5pm 4pm Harry INSERT { g {?x u ?y } James DELETE WHERE { } WHERE { g {?x s ?y . g {?x t ?y} a a ?y t ?z } } u s } u prov c d d

Figure 5: Example provenance record

Note, however, that unlike in a relational language, the as follows, symmetrically to creation: names of the graphs consulted by a query are dependent on the data, since a pattern such as GRAPH ?X {ha b ci} DROP g; can consult any graph that happens to contain ha b ci. DELETE WHERE {GRAPH prov {hg currentg vi }i}; We define the provenance of an atomic update by INSERT DATA {GRAPH prov { translation to a sequence of updates that, in addition hui type dropi,hui input g vii, to performing the requested updates to a given named hui meta mii,(metadata) graph, also constructs some auxiliary named graphs and }} triples in a special named graph for provenance informa- tion called prov. where g vi is the current version of g. Note that since this We consider only special cases of insert and delete operation deletes g, after this step the URI g no longer operations that target a single, statically known, named names a graph in the store; it is possible to create a new graph g; full SPARQL Updates (according to the current graph named g, which will result in a new sequence of graph) can simultaneously update several named graphs versions being created for it. The old chain of versions along with the default graph. will still be linked to g via the version edges, but there A graph creation CREATE g is translated to will be a gap in the chain. A clear graph operation CLEAR g is handled as follows: CREATE g; CREATE g v0; CLEAR g; INSERT DATA {GRAPH prov { DELETE WHERE {GRAPH prov {hg currentg vi }i}; hg version g v0i,hg current g v0i, INSERT DATA {GRAPH prov { hu1 type createi,hu1 output g v0i, hg version g v0i,hg current g v0i, hu1 meta mii,(metadata) hui type cleari,hui input g vii, }} hui output g vi+1i,hui meta mii, (metadata) A drop operation (deleting a graph) DROP g is handled }}

5 A load graph operation LOAD h INTO g is handled as References follows: [1] ARENAS,M.,GUTIERREZ,C., AND PEREZ, J. On the seman- tics of SPARQL. In Semantic Web Information Management: A LOAD h INTO g; Model Based Perspective, 1st ed. Springer, 2009. DELETE WHERE {GRAPH prov {hg currentg vi }i}; [2] BUNEMAN, P., CHAPMAN,A., AND CHENEY, J. Provenance INSERT DATA {GRAPH prov { management in curated databases. In Proceedings of the 2006 hg version g vi+1i,hg current g vi+1i, ACM SIGMOD international conference on Management of data hui type loadi,hui input g vii, (New York, NY, USA, 2006), SIGMOD ’06, ACM, pp. 539–550. hui output g vi+1i,hui source h ji, [3] BUNEMAN, P., KHANNA,S., AND TAN, W. Why and where: A characterization of data provenance. In ICDT (2001), no. 1973 in hui meta mii,(metadata) }} LNCS, pp. 316–330. [4] BUNEMAN, P., AND KOSTYLEV, E. Annotation algebras for RDFS. In SWPM (2010). where h j is the current version of h. [5] CARROLL,J.J.,BIZER,C.,HAYES, P., AND STICKLER, P. An insertion INSERT {GRAPH g {C}} WHERE P is trans- Named graphs. Web Semant. 3 (December 2005), 247–267. lated to a sequence of updates that creates a new version [6] CHARBONEAU,D., AND FEIGENBAUM, L. SPARQL 1.1 proto- and links it to URIs representing the update, as well as col for RDF. W3C Working Draft, January 2010. http://www. links to the source graphs identified by the query prove- w3.org/TR/2010/WD-sparql11-protocol-20100126/. nance semantics and a named graph containing the in- [7] COPPENS,S.,MANNENS,E., VAN DEURSEN,D.,HOCHSTEN- serted triples: BACH, P., JANSSENS,B., AND VAN DE WALLE, R. Publishing provenance information on the web using the Memento datetime content negotiation. In LDOW (2011). CREATE g ui; INSERT {GRAPH g ui {C}} WHERE P; [8] FLOURIS,G.,FUNDULAKI,I.,PEDIADITIS, P., THEOHARIS, Y., AND CHRISTOPHIDES, V. Coloring RDF triples to capture INSERT {GRAPH g {C}} WHERE P; provenance. In Proceedings of the 8th International Semantic CREATE g vi+1; Web Conference (Berlin, Heidelberg, 2009), ISWC ’09, Springer- LOAD g INTO g vi+1; Verlag, pp. 196–212. DELETE DATA {GRAPH prov {hg current g vii}}; [9] HARTIG, O. Querying trust in rdf data with tsparql. In Pro- INSERT DATA {GRAPH prov { ceedings of the 6th European Semantic Web Conference on The Semantic Web: Research and Applications (Berlin, Heidelberg, hg version g vi+1i,hg current g vi+1i, 2009), ESWC 2009 Heraklion, Springer-Verlag, pp. 5–20. hui input g vii,hui output g vi+1i, [10] LOPES,N.,POLLERES,A.,STRACCIA,U., AND ZIMMER- hui type inserti,hui data g uii MANN, A. AnQL: SPARQLing up annotated RDFS. In Pro- hui source S1i,...,hui source Smi, ceedings of the 9th international semantic web conference - Part I hui meta mii,(metadata)}} (Berlin, Heidelberg, 2010), ISWC’10, Springer-Verlag, pp. 518– 533.

where s1,...,sm are the source graph names of P. [11] MOREAU, L. Provenance-based reproducibility in the semantic A deletion DELETE {GRAPH g {C}} WHERE P is handled web. Journal of Web Semantics (2011). To appear. similarly to an insert, except for the update type annota- [12] MOREAU,L.,CLIFFORD,B.,FREIRE,J.,FUTRELLE,J.,GIL, tion. Y., GROTH, P., KWASNIKOWSKA,N.,MILES,S.,MISSIER, P., MYERS,J.,PLALE,B.,SIMMHAN, Y., STEPHAN,E., AND VANDEN BUSSCHE, J. The open provenance model core speci- fication (v1.1). Future Generation Computer Systems 27, 6 (June 7 Conclusion 2011), 743–756. [13] SCHENK,S.,GEARON, P., AND PASSANT, A. SPARQL 1.1 Provenance is a challenging problem for RDF. While Update. W3C Working Draft, October 2010. http://www.w3. some progress has been made on provenance and anno- org/TR/2010/WD-sparql11-update-20101014. tation for RDFS inferences and SPARQL queries, so far [14] UDREA,O.,RECUPERO,D.R., AND SUBRAHMANIAN, V. S. there has not been work on provenance for SPARQL Up- Annotated RDF. ACM Trans. Comput. Logic 11 (January 2010), A10. dates, partly because the update language itself is work in progress. In this paper we have outlined an approach to [15] VANDE SOMPEL,H.,SANDERSON,R.,NELSON,M.L.,BAL- AKIREVA,L.L.,SHANKAR,H., AND AINSWORTH, S. An the problem drawing on similar work in database archiv- HTTP-based versioning mechanism for linked data. In LDOW ing and copy-paste provenance, and sketched how this (2010). approach can be incorporated into existing protocols for accessing RDF datasets. We hope this will contribute to discussion of how to standardize descriptions of changes to RDF datasets, and possibly provide a way to translate changes to underlying (e.g. relational or XML) databases to RDF representations.

6