Dynamic Provenance for SPARQL Updates Using Named Graphs
Total Page:16
File Type:pdf, Size:1020Kb
Dynamic Provenance for SPARQL Updates using Named Graphs Harry Halpin James Cheney World Wide Web Consortium University of Edinburgh [email protected] [email protected] Abstract nance for collections or other structured artifacts. The workflow community has largely focused on static prove- The (Semantic) Web currently does not have an offi- nance for atomic artifacts, whereas much of the work on cial or de facto standard way exhibit provenance infor- provenance in databases has focused on static or dynamic mation. While some provenance models and annota- provenance for collections (e.g. tuples in relations). tion techniques originally developed with databases or The workflow community has largely focused on workflows in mind transfer readily to RDF, RDFS and declaratively describing causality or derivation steps of SPARQL, these techniques do not readily adapt to de- processes to aid repeatability for scientific experiments, scribing changes in dynamic RDF datasets over time. and these requirements have been a key motivation for Named graphs have been introduced to RDF motivated the Open Provenance Model (OPM) [12, 11], a vocab- as a way of grouping triples in order to facilitate anno- ulary and data model for describing processes including tation, provenance and other descriptive metadata. Al- (but certainly not limited to) runs of scientific workflows. though their semantics is not yet officially part of RDF, While important in its own right, and likely to form the there appears to be a consensus based on their syntax basis for a W3C standard interchange format for prove- and semantics in SPARQL queries. Meanwhile, up- nance, OPM does not address the semantics of the prove- dates are being introduced as part of the next version nance represented in it — one could in principle use it ei- of SPARQL. In this paper we explore how to adapt the ther to represent static provenance for processes that con- dynamic copy-paste provenance model of Buneman et struct new data from existing artifacts, or to represent dy- al. [2] to RDF datasets that change over time in re- namic provenance that represents how data change over sponse to SPARQL updates, how to represent the result- time. Most applications of OPM, however, seem to focus ing provenance records themselves as RDF using named on static provenance, and so it may require further exten- graphs, and how the provenance information can be pro- sion to handle dynamic provenance. Conversely, many vided as a SPARQL end-point. elements of the OPM vocabulary may not be necessary for describing changes to RDF data over time. 1 Introduction It is an open question on how to best access prove- nance data for RDF. Currently provenance techniques are It is becoming routine to publish scientific and govern- not easily usable to the vast majority of developers and mental data on the Web as RDF (the Resource Descrip- other people working in open data. If we want RDF to tion Framework, a W3C standard for structured data on become a leading format for open data, having an easy the Web). In doing so, it is crucial not to lose track of and straightforward manner to access provenance data its provenance, including both its derivation process and could be crucial. We believe that dynamic provenance changes to it over time. There are many different kinds of techniques should be as easy to use as version control. provenance that possibly could be associated with RDF. Version control is a classic and well-studied problem for There is a difference between static provenance for pro- information systems ranging from source code manage- cesses that create new artifacts from existing ones, ver- ment systems to temporal databases and data archiving. sus dynamic provenance that describes how artifacts have Version control systems such as CVS, Subversion, and evolved over time. Second, there is a difference between Git have a solid track record of tracking versions and provenance for atomic artifacts that expose no internal (coarse-grained) provenance for text (e.g. source code) structure as part of the provenance record, versus prove- over time. Recent work on provenance in relational databases proposals for versioning and provenance-tracking for has aimed precisely at “provenance as version control”. RDF graphs and SPARQL updates. Foundational work on provenance for queries distin- guishes between where-provenance, which is the “loca- 2 Named graphs for dynamic provenance tions in the source databases from which the data was ex- tracted,” and why-provenance, which is “the source data From an early stage it was recognized that provenance that had some influence on the existence of the data” [3]. management in RDF requires the ability to refer to triples There is less work considering provenance for updates; or graphs. Hence, there is unofficial yet widely imple- our previous work focused on simple atomic update oper- mented support for named graphs, where each graph G ations: insertion, deletion, and copy [2]. A provenance- is identified with a name URI [5]. The idea of using aware database should support queries that permit users named graphs for provenance is not new, but previously to find the ultimate or proximate ‘sources’ of some data, has been restricted to making and propagating trust mea- explore its change history over time, or apply data prove- sures [9] or authorship assertions/signatures [5] through nance techniques to understand why a given part of the inferences or queries. We believe that the named graph data was inserted or deleted by a given update. A simple approach is also well-suited to recording detailed prove- approach that records the full provenance of every part of nance traces for dynamic data. the data during every update would likely lead to an ex- Our proposal is that the provenance information plosion of database size, so current research is studying should be accessible from the name URI of a named ways of optimizing provenance storage and new formal graph by asking for the provenance content associated models for relational data [2]. with the graph. Obviously, the provenance store for How can we track provenance for dynamic RDF each graph should be accessible, ideally using Linked data using current components of the Semantic Web? Data principles, from the name URI of the named graph. Static provenance techniques have now been investigated Moreover, named graphs can be used to represent both for RDF, including provenance-tracking for SPARQL past versions of a graph and changes to the graph in or- queries (see [15, 10]) and for RDFS inferences (see [4, der to support both versioning and dynamic provenance. 8]) over RDF datasets. Some of this work considers up- In practice on the Web of Data, the graph name is usually dates, particularly Flouris et al. [8], who consider the the URI used to access the the RDF graph or the domain problem of how to maintain provenance information for name of the server. For example, a group of RDF triples RDFS inferences when tuples are inserted or deleted, us- from DBPedia about Paris may be named by http:// ing the coherence semantics. In this paper we consider www.dbpedia.org/data/Paris. One could imagine a provenance model for SPARQL queries and updates that past versions of this graph might be named http: to data stores involving named graphs, whose purpose //www.dbpedia.org/data/Paris v1, etc., with addi- is to provide a record of how the raw data in a dataset tional URIs representing the updates that have been ap- has changed over time, and then tackle the problem of plied in a machine-readable format, along with conven- recording and providing (queryable) access to the de- tional provenance metadata about the updates such as au- tailed change history of an RDF graph that is updated thor, time, and the text of the update. over time via SPARQL updates. However, one problem with minting a new name Our hypothesis is that a simple vocabulary, composed URI for each past version of the graph would be that of insert, delete, and copy operations as introduced by there would soon be an overwhelming number of name Buneman et al. [2], along with explicit identifiers for up- URIs, with one for each change to the graph. While they date steps, versioning relationships, and metadata about could still be connected to each other via the provenance updates provides a flexible format for data provenance data, a higher-level alternative would be to use query on the Semantic Web. The harder problem of devis- parameters to qualify the base URI. So one could have ing a combined vocabulary for describing both static and http://www.dbpedia.org/data/Paris?version=1 dynamic provenance remains; nevertheless our proposal to retrieve the first version, http://www.dbpedia. sheds some light on the issues. A primary advantage of org/data/Paris?version=2 to retrieve the second, our methodology is it keeps the changes to raw data sepa- and so on. A “date” parameter to retrieve by date would rate from the changes in metadata, so legacy applications also be useful, such as http://www.dbpedia.org/ will continue to work and the cost of storing and provid- data/Paris?date=2010-08-30T21%3A33%2A50. ing access to provenance can be isolated from that of the To get the provenance metadata, one could then add raw data. a special “provenance” parameter like http://www. We will briefly review named graphs and their rela- dbpedia.org/data/Paris?provenance&version=2 tionship to versioning and provenance, then introduce and http://www.dbpedia.org/data/Paris? SPARQL queries and updates, and finally describe our provenance&date=2010-08-30T21%3A33%2A50.