WEBSIG: A DIGITAL SIGNATURE FRAMEWORK FOR THE WEB

By

James P. McCusker

A Dissertation Submitted to the Graduate Faculty of Rensselaer Polytechnic Institute in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY Major Subject: COMPUTER SCIENCE

Examining Committee:

Deborah L. McGuinness, Dissertation Adviser

James Hendler, Member

Peter Fox, Member

Michel Dumontier, Member

Rensselaer Polytechnic Institute Troy, New York

July 2015 (For Graduation August 2015) c Copyright 2015 by James P. McCusker All Rights Reserved

ii CONTENTS

LIST OF TABLES ...... vi

LIST OF FIGURES ...... vii

ACKNOWLEDGMENT ...... viii

ABSTRACT ...... x

1. Introduction ...... 1 1.1 Background ...... 4 1.2 Legal Framework ...... 7 1.3 Use Case ...... 8 1.4 Introducing WebSig ...... 8 1.5 Organization ...... 10

2. Repudiation, Trust, and Computability in Web Documents ...... 11 2.1 Trusting Computable Documents on the Web ...... 14 2.1.1 Sufficient Qualities for Verifiable Computable Signature Schemes 16 2.2 Repudiation of a Signature ...... 18 2.2.1 Signatory Identification ...... 18 2.2.2 Intention to Sign ...... 19 2.2.3 Adoption of Document ...... 19 2.3 Minimizing Repudiation ...... 20 2.3.1 Signatory Identification ...... 21 2.3.2 Intention to Sign ...... 21 2.3.3 Adoption of Document ...... 22

3. Related Digital Signature Schemes ...... 23 3.1 Non-Cryptographic Signatures ...... 24 3.2 Basic Digital Signatures ...... 25 3.3 XML Digital Signatures ...... 26 3.4 RDF Digital Signatures ...... 27 3.5 Conclusions ...... 28

iii 4. Parallel Identities for Managing Open Government Data ...... 29 4.1 Introduction ...... 29 4.1.1 Use case: Trusting Integrated Data ...... 32 4.2 Related Work ...... 34 4.2.1 RDF Conversion Tools ...... 35 4.2.2 Current Provenance Models ...... 35 4.2.3 Models from Library Science ...... 36 4.2.4 Existing Content-Based Cryptographic Digests ...... 36 4.3 Approach ...... 37 4.4 Methods ...... 39 4.5 Results ...... 40 4.6 Discussion ...... 41 4.6.1 Future Work ...... 42 4.7 Conclusions ...... 43

5. Information Resource Provenance on the Web ...... 49 5.1 Introduction ...... 49 5.1.1 A Weather Example ...... 50 5.2 Background: Existing W3C Recommendations ...... 52 5.3 The semiotics of HTTP URLs ...... 53 5.4 FRBR and FRIR ...... 54 5.5 Explaining HTTP with FRBR, FRIR, and PROV-O ...... 57 5.6 Implementation ...... 60 5.7 Discussion ...... 60 5.8 Conclusion ...... 62

6. RDF Graph Digest Algorithm 1 ...... 66 6.1 Introduction ...... 66 6.2 Related Work ...... 67 6.2.1 Graph Canonicalization ...... 68 6.3 Implementation ...... 69 6.4 Evaluation Methods ...... 73 6.5 Evaluation ...... 74 6.5.1 Complexity ...... 75

iv 6.5.2 Algorithm Portability ...... 75 6.5.3 Benchmark Results ...... 76 6.6 Discussion ...... 80 6.6.1 Future Work ...... 81 6.7 Conclusion ...... 82

7. WebSig: A Digital Signature Framework for the Web ...... 83 7.1 Introduction ...... 83 7.2 The WebSig Signature Scheme ...... 84 7.3 Implementing WebSig ...... 87 7.4 Evaluation ...... 88 7.4.1 Use Case ...... 89 7.4.2 WebSig is Linkable ...... 94 7.4.3 WebSig is Attributable ...... 94 7.4.4 WebSig is Portable ...... 95 7.4.5 WebSig is Revisable ...... 96 7.4.6 WebSig is Verifiable ...... 96 7.4.7 Performance ...... 98 7.5 Conclusion ...... 98

8. Discussion ...... 99 8.1 Future Work ...... 102

9. Conclusion ...... 104

References ...... 105

v LIST OF TABLES

3.1 Digital signature schemes and their properties...... 24

4.1 Different levels of abstraction in the FRBR stack and how they are identified ...... 38

5.1 The first table contains class mappings between FRBR and PROV-O. . 56

6.1 178 ontologies could not be loaded from Bioportal using RDFlib . . . . 75

vi LIST OF FIGURES

1.1 The technology stack, or “layer cake,” as of 2007...... 3

4.1 A simple use case where a data consumer must choose between the gov- ernment’s original data or one of five data files offered by third parties. 34

4.2 The data products from the use case ...... 44

4.3 FRBR provenance when Data Integrators E and W retrieve two different URLs...... 45

4.4 FRBR provenance when Data Integrator E converts the CSV to raw RDF...... 46

4.5 FRBR provenance of the CSV, raw RDF, and a conversion of the raw RDF into RDF/XML ...... 47

4.6 FRBR provenance applying enhancement parameters to the CSV’s con- version to RDF ...... 48

5.1 The relationships between identifier, resource, and representation from Architecture of the ...... 51

5.2 AWWW’s URL and Resource correspond to the semiotic triangle’s Sym- bol and Referent, respectively...... 54

5.3 Relating URIs, Resources, and Representations using FRIR, FRBR, and the semiotic triangle ...... 58

5.4 Results of applying pcurl.py to retrieve the weather result example . . . 61

5.5 An example of transcoding a histogram image from a large JPEG to a small thumbnail PNG ...... 64

5.6 An example of mirroring content between web sites ...... 65

6.1 Th runtime of graphs without blank nodes...... 76

6.2 Th runtime of graphs with blank nodes ...... 77

6.3 Ti performance for non-blank node and blank node graphs where Ti ≈ 0.00145S (R > 0.98, p < 2.22 ∗ 10−77)...... 78

1.88 6.4 Tc performance for blank node graphs where Tc ≈ 0.00032B (R > 0.95, p < 3.0 ∗ 10−94)...... 79

vii 7.1 A web signature is a special kind of nanopublication, where the Asser- tion and PublicationInfo are identified by their graph digest ...... 85

7.2 Verification of a web signature ...... 86

7.3 The sequence used by the signer, signature requester, and signature service agents when a signer is actively interacting with the signature requester ...... 88

7.4 The sequence used by the signer, signature requester, and signature ser- vice agents when a signer is not actively interacting with the signature requester ...... 89

7.5 An example web signature assertion that has been signed via the Prove- nance and PublicationInfo graphs...... 90

7.6 An example web signature PublicationInfo Graph, that includes an at- tribution of the Assertion graph to the Signing Agent...... 91

7.7 An example web signature Provenance graph, with the signature itself, the public key and PublicationInfo graph it was derived from, and the Signing Agent it was attributed to...... 92

7.8 The new agreement that allows Bob to access Alice’s date of birth, because it was removed from the list of restricted fields...... 93

7.9 The submission request for the new agreement includes that says that the new assertions are a revision of the old one, and that the old assertion was invalidated on 8/3/2014...... 94

7.10 A screen shot of the signed PPO document with a description of the activities it will be used in...... 95

7.11 Web signatures are legal signatures because they follow a chain of proof 97

viii ACKNOWLEDGMENT

Proposition. ∀s ∈ P hDStudent → ∃f hasF amily (s, f) ∧ P atientF amily (f) Behind every PhD student is a patient family.

Corollary. ∀s ∈ (Married ∧ P hDStudent) → ∃p hasSpouse (s, p) ∧ V eryP atientSpouse (p) Behind every married PhD student is a very patient spouse.

I would like to dedicate this dissertation to my wife, Sarah McCusker, and my son, Ian McCusker. I wouldn’t have been able to do this without their support, understanding, and their ability to turn a blind eye to countless late nights and groggy mornings. I would also like to thank my mother, Patricia McCusker, and my grandmother, Clara Nowak, who have always believed in me even when I come up with crazy ideas like working while getting a PhD and raising a family. I would like to thank my brother and sister, Jason McCusker and Katie Chykirda, for putting up with a know-it-all big brother, and their spouses,Kate McCusker and Dan Chykirda for letting me bore them with my work now and again at family gatherings. I would also like to thank family members who aren’t with us anymore, but especially my father, Jim McCusker Jr., who helped me develop my interest in math and science. I’d also like to thank the graduate students at Yale’s Pathology Informatics group for making it look so easy that I thought I had a chance to get a PhD done too. Thanks also to the US cancer informatics community and members of the World Wide Web Consortium Healthcare and Life Sciences Special Interest Group for encouraging my efforts. I wouldn’t have been able to do this without the support of my employers, Michael Krauthammer at Yale School of Medicine and Will FitzHugh at 5AM Solutions. A big thanks to my fellow members of the RPI Tetherless World Constellation, especially my fellow students, Tim Lebo, Alvaro Graves, Dom Difranzo, and many others for collaborations, ideas, and shoulders to whine on. Big thanks also to the staff at TWC, especially Jacky Carley, Patrick West, and John Erickson, who all make sure the lab keeps functioning. If I ever get a lab of my own, I hope to find

ix staff at least half as good as you guys. An extra big thanks to my advisor, Deborah L. McGuinness, and my doctoral committee, Jim Handler, , and Michel Dumontier, for providing wonderful and challenging feedback. I would especially like to thank K. Krasnow Waterman for assisting me with Chapter 2, especially in guiding me through the methods of legal research and pro- viding feedback on the requirements for minimal repudiability. As this dissertation evolved into the development and evaluation of a legally binding digital signature scheme for the web, we realized that, in order to evaluate the legal aspects of cre- ating legally-binding digital signatures on computable documents, we would need to consult a lawyer who was a technology expert and also well-versed with seman- tics. We consulted K. because she is an internationally-recognized expert in law, technology, privacy, security, and is a visiting fellow at MIT. She has published papers with Dr. McGuinness on privacy, Dr. Hendler on big data analytics, and with Dr. Tim Berners-Lee on semantic representations of legal policy. Even in our initial conversations, she immediately saw the value of this dissertation topic and provided valuable feedback on direction, and has been an invaluable contributor to the process since. I would like to thank the librarians at the CT State Library Law and Legislative Reference Unit, who helped immensely with my reference questions. I would like to thank Brendan McKay and Adolfo Piperno for helping me to understand the traces algorithm. I would also like to thank the RDFlib team, especially Urs Holzer and J¨ornHees, for their timely and thorough feedback on incorporating the RGDA1 implementation into RDFlib. Finally, I would like to thank Michel Dumontier for encouraging me to work towards a complete graph digest solution for all possible RDF graphs. This dissertation was funded by the Weissman Family Graduate Fellowship, the Population Sciences Grid, the Semantic Sea Ice Interoperability Initiative, the Foresight and Understanding from Scientific Exposition (FUSE) project, the Repur- posing Drugs with Semantics (ReDrugS) project, and 5AM Solutions, Inc.

x ABSTRACT

WebSig is a digital signature scheme for the web that uses Resource Description Framework (RDF) graphs to express its documents, document metadata, and sig- nature data in a way that leverages existing trustable digital signature schemes to create signatures on computable documents that are trustable and minimally repu- diable. WebSig is a proof of concept that shows that a digital signature scheme for RDF can be trustable across any possible representation of an RDF document and minimize the opportunities for repudiation of those signatures. We demonstrate this by showing how digital signature scheme that are attributable, verifiable, linkable, revisable, and portable, are also computable and trustable digital signature schemes. We also introduce evaluation criteria for those five qualities and demonstrate how WebSig provides all five. WebSig supports the verifiable signing of any RDF graph through the use of another contribution, the Functional Requirements for Infor- mation Resources (FRIR) information identity framework. FRIR is a provenance- driven identity framework that can provide interrelated identities for RDF graphs and other information resources. The RDF Graph Digest Algorithm 1 (RGDA1), a third contribution, provides an algorithm that can create platform-independent, cryptographically secure, reproducible identifiers for all RDF graphs. FRIR and the RGDA1 both supply the means to securely identify the signed document and any supporting RDF graphs, and are essential to supplying all five qualities needed to provide computable and trustable signatures. WebSig builds off of existing tech- nologies and vocabularies from the domains of cryptography, computer security, semantic web services, semantic publishing, library science, and provenance. This dissertation’s contributions will be presented as follows: 1) Sufficiency proof that attributable, verifiable, portable, linkable, revisable digital signature schemes are trustable, computable, and minimally repudiable; 2) Functional Re- quirements for Information Resources (FRIR), a provenance-enabled, trustable, computable identity framework for information resources; 3) experimental evidence that RDF Graph Digest Algorithm 1 (RGDA1) provides reproducible identifiers for

xi all RDF graphs in average case polynomial time; and 4) WebSig, a framework that lets users create legally-binding electronic documents that are both trustable and computable.

xii CHAPTER 1 Introduction

In this dissertation we set out to test the following hypothesis: WebSig is a digital signature scheme that preserves trust in digital signatures for RDF graphs across varied representations and revisions. It also provides a generalized mechanism for complex signatures, including signature proxies and witnesses, minimizing the op- portunities for legal challenge to WebSig signatures. In Chapter 2 we will show that five qualities, attributability, linkability, portability, revisability, and verifia- bility are sufficient to make digital signature schemes trustable, computable, and minimally repudiable. We also review the literature of existing secure digital sig- nature schemes in Chapter 3 and evaluate them against our evaluation criteria to be trustable, computable, and minimally repudiable. In Chapters 4 and 5 we will present Functional Requirements for Information Resources (FRIR), an information resource vocabulary and identification strategy that helps WebSig be portable and revisable. Chapter 6 presents RDF Graph Digest Algorithm 1 (RGDA1), the first secure graph digest algorithm that can compute a unique, reproducible digest for all RDF graphs. RGDA1 is evaluated using a collection of 170 biomedical ontolo- gies and an efficient average-case run time is reported. FRIR, RGDA1, and the RSA digital signature algorithm allow WebSig to be portable and verifiable. Chap- ter 7 presents the WebSig digital signature scheme and signature request protocol and show how WebSig realizes all five qualities that we prove are sufficient to be trustable, computable, and minimally repudiable. Since the inception of the Semantic Web, there has been a desire to provide a level of trust around data that is expressed in the web. A key feature enabling that trust has been digital signatures and cryptography (see Figure 1.1). However, conventional digital signatures have not taken hold within the semantic web. As the Resource Description Framework (RDF) evolved and became a larger goal, it became clear that graphs were more important than documents. This is partly because the data graph of RDF has become decoupled from any one data for-

1 2 mat, and what was originally a focus on documents in eXtensible (XML) [1] quickly evolved into a need to provide any number of formats, including embedding RDF graphs in HTML [2] and JavaScript Object Notation (JSON) [3]. Further, Application Programming Interfaces (APIs) have been developed around Linked Data principles, where the data provided is serialized into formats on an as-needed basis using content negotiation [4]. This new environment makes it very difficult to simply sign a byte stream, as is usual practice in conventional digital signatures. The groundwork was laid, however, when efforts started work in formally iden- tifying graphs of data instead of serializations of data, resulting in initial attempts at RDF graph digest algorithms [6, 7]. These algorithms suffered from challenges because it is possible to create RDF graphs that are difficult to clearly identify, and in fact identifying all RDF graphs solves a known NP-hard problem of graph isomorphism [7]. To solve this, in Chapter 6 we contribute a new RDF graph digest algorithm that can compute reproducible digests for a subset of RDF graphs we call Graphs of Practical Interest. These graphs can be efficiently signed by our algorithm in polynomial time and contain all graphs that are likely to be published as Linked Data. Another major issue with digital signatures in the semantic web is that digital signatures do not contain clear semantics. Digital signature schemes by themselves do not provide a way to link to documents, signatures or signatories in a web-friendly way. They also do not make it possible to easily issue revisions and revocations, determine if the signatory is actually a party to the document they have signed, nor determine if the signatory is signing as a witness, signing on behalf of another, or serving in some other role. RDF makes it possible to unambiguously express policies and contracts in the domains of user privacy [8], social media authorization [9], and public policy [10]. These efforts have focused on the ability to compute, or act, on various enabling or disabling statements. However, all of these systems rely on trusted and import mechanisms to provide the actual policies to act on. In a decentralized environment like the web, it is very important to know that these authorizations have come from valid sources, have not been forged, and have 3

Figure 1.1: The semantic web technology stack, or “layer cake,” as of 2007 [5]. Cryptography has remained a missing component in the semantic web stack, especially digital signatures. 4 not been revoked or modified. In Chapters 4 and 5 we will integrate the World Wide Web Consortium’s Provenance Ontology (PROV-O) [11] with Functional Requirements for Biblio- graphic Resources (FRBR) [12] to create Functional Requirements for Information Resources (FRIR). FRIR is new a way to express provenance about information resources, especially ones on the web. Chapter 6 introduces RGDA1, the first RDF graph digest algorithm. In Chapter 7 we will combine FRIR, conventional digital signatures, our new RDF graph digest algorithm, and a new convention for ex- pressing assertions, as well as publication information and provenance about those assertions (called nanopublications [13]), to create a new digital signature scheme, WebSig. WebSig is semantic web-native and addresses the issues above, providing a means to make the semantic web simultaneously trustable and computable.

1.1 Background Modern digital signature schemes are based on the concepts introduced by Diffie and Hellman [14] and expanded by Rivest et al [15], Goldwasser et al. [16], and others. In a simplified form, signatories create and use public/private key pairs and then use the private key to encrypt the cryptographic digest, or unique identifying number, for a message or document. The details of how this digest is created depends on the algorithm, but generally, it must be a one-way (trapdoor) function [14] in order to prevent attackers inferring the contents of a document from its digest. Further, for digest algorithms to be secure, it must be astronomically unlikely to produce digest collisions using different documents, either by chance or by attack. A digest collision is where the same digest number is computed from different documents. When this happens, it is possible to claim that the digital signature was actually produced from the alternative document, which was not the case. There are significant differences between physical signatures and digital signa- tures, as they are currently constituted. According to Reed [17], signatures provide evidence for three legal matters:

• the identity of the signatory; 5

• that the signatory intended the signature to be his signature; and

• that the signatory approves of and adopts the contents of the document.[17]

Current digital signature schemes do not fully address the first and third re- quirements. For instance, the identity of the signatory is held in the public key, and any link between key and signatory is left for individual applications. Usually, these algorithms take the form of Public Key Infrastructures (PKI) such as X.509 [18], or the web of trust. X.509 relies on a Trusted Third Party, called a certificate authority, to pass down identity verification. Conversely, the web of trust requires that users cross-validate their identities by signing each others public keys. All PKI infrastructures have disadvantages of one form or another [19], but the main problem with requirements for signatures is that the identifier used in PKI infras- tructures rarely used outside of the infrastructure. For instance, one cannot access the Distinguished Name (DN) of an X.509 subject and learn about which public keys they use, or how one might contact the subject, securely or otherwise. Further, because the structure of signed documents is simply a byte stream, it is impossible to know when a particular document actually mentions the signatory (most signed documents mention the signatory in some way or role). Because of this, it is hard to tell under what authority a signatory has signed a document. Additionally, because digital signatures are almost entirely mathematical ab- stractions (products of a particular mathematical function), until now they have not been able to directly express the relationship of witnessing a signature, signing on behalf of another agent, or any other specific semantics associated with the meaning of a signature. Existing proxy digital signature schemes involve a markedly different signature algorithm from non-proxy signatures (see Boldyreva et al. [20], for exam- ple). An infrastructure that takes advantage of these signatures would therefore need to support multiple schemes, and there seems to be no way to merge proxy and non-proxy signature schemes as they currently exist [21]. For a digital signature scheme to work in the way that physical signatures do, they will need to support not just the authentication of a signature against a public key, but also the authorization the signatory has to perform that signature. Authorization happens at two levels: is the signatory a party to the document, 6 and if not, are they authorized to act on behalf of one? This fits in with the third requirement, “that the signatory approves of and adopts the contents of the document.”[17] Alice cannot approve and adopt the contents of a document that she is not a party to unless she has been given authorization to do so on behalf of someone who is. RDF is, in its simplest form, a way of talking about things (resources) in a way that interlinks them and describes them at the same time [22]. RDF data is composed of sets of statements, where a subject, always a resource, is linked to an object, which can be a resource or literal value (strings, numbers, dates, etc.) via another resource, called a predicate. On the web, link to each other without any labels on those links. It is simply a → b, with no description of that link. On the semantic web, resources link to other resources (and literals) and are labeled using predicates. A statement in the semantic web might look like Alice →knows Bob, where Alice is the subject, knows is the predicate, and Bob is the object. In RDF, resources are identified by Uniform Resource Identifiers (URIs), or are anonymous. Predicates cannot be anonymous, and subjects cannot be literals. RDF graphs, or sets of statements, can be collected and published on the web and in different kinds of RDF graph databases. These graphs can sometimes be identified using URIs as well, breaking up a more “global” RDF graph into local subgraphs that address a particular topic. Linked Data is a set of principles and data that follow those principles [4]. The first is that data should be identified using URLs that can be dereferenced; or ac- cessed to gain information. The information that the URL should return should say something about the thing that is identified by that URL. Finally, that information should be able to be returned in a way that is easy for machines to interpret, using standards that are open. Here, plain text is difficult for machines, while spread- sheets are easier, and RDF is the easiest. Microsoft Excel workbooks might be hard for many machines to interpret without help from a human because the format is proprietary and difficult to support, while formats such as Comma Separated Value (CSV) are much easier to support. 7

1.2 Legal Framework The legal force of electronic signatures in private industry, with respect to any transaction in or affecting interstate or foreign commerce was firmly established in 2000 with the E-SIGN act [23], which states in 15 USC §7001(a)(1-2):

(1) a signature, contract, or other record relating to such transaction may not be denied legal effect, validity, or enforceability solely because it is in electronic form; and

(2) a contract relating to such transaction may not be denied legal ef- fect, validity, or enforceability solely because an electronic signature or electronic record was used in its formation.

This has been upheld in the case law since, as shown in [24, 25, 26, 27, 28, 29]. Digital signatures are a cryptographically driven form of electronic signature, and are accepted as legally binding signatures in the United States, the European Union, and United Kingdom [30]. More specifically, digital signature schemes attempt to prove that the resulting signatures are non-repudiable. McCullagh and Caelli [31] provide a comprehensive discussion of non-repudiation in the digital world, but criticize this perspective as misguided and not grounded in realistic legal frameworks. The content structures of digitally signed documents are not usually addressed in digital signature schemes, usually being left to specific applications. If the semantics of signed documents could be more closely defined, the identity of an agent (beyond any one key pair) could be discussed within the document itself. This would be an ideal situation, rather than using keys directly to identify agents, since agent identifiers are presumably more durable than key pairs, as specific key strengths gradually become weaker as the computational power needed to break them become stronger. Existing digital signature schemes also assume an application-specific mechanism to link key pair with agents, including the the XML Digital Signature standard [32], which means that these digital signature schemes cannot validate the relationship between the signer of a document with the document itself. 8

1.3 Use Case We will use the following use case scenario to illustrate our requirements. This scenario requires a way to produce document signatures that are: authentic, where one can verify (a) the identity of the signatory and (b) that the signatory intended the signature to be his signature; authorized, where one can verify that (a) that the signatory approves of and (b) adopts the contents of the document and that the signing agent has standing in the document to do so; and computable, where software agents can reason on the contents of the document without further intervention from a human agent.

Alice has been keeping track of her weight, sleep quality, and resting pulse for a year by pushing data from various instruments and apps into an RDF store. When she enrolls in a medical study, the investigator, Bob, sends her a document that has a privacy preference written in the Privacy Preferences Ontology (PPO) [8] to let his organization, Cure- Lab, access part of her data. It also includes a prov:Activity [11] that represents the study and describes (in human-readable form) how the data will be used in it and the organization that is conducting it. Before sending it, Bob signs the document on behalf of CureLab, assuring that Alice’s data will be used in a particular way. Bob then sends the docu- ment to Alice, who also signs it. Alice then loads the signed document into her RDF store which is able to verify that the document was signed by all parties to the document. Later on, Bob realizes that the study needs additional information in order to proceed. He sends a revision of the original document to Alice, who signs it. A year later, Alice has a change of heart, and decides to revoke the contract, which suspends CureLab’s access to her data.

1.4 Introducing WebSig WebSig is a web signature scheme that is trustable, computable, and minimally repudiable, and is therefore appropriate for use in a distributed web environment. It does this by building on existing public key infrastructure and standards developed 9 in the semantic web community for provenance, document publication, and semantic web services. In the same way that concurrency research has used semaphores as primitives to build locks and synchronization methods [33], we use the RSA digital signature scheme as a primitive to build the WebSig digital signature scheme. WebSig uses the Resource Description Framework (RDF) and the Web On- tology Language (OWL) to provide a foundation that establishes the semantics of a signature. These standards allow for unambiguous interpretation of signatures, identities for signers, and the roles they play within the document itself. WebSig provides a framework for trusting computable documents, but does not itself at- tempt any computations on documents beyond what is needed to validate them. It also is able to detect that there have been changes made to computable docu- ments, but does not provide a means to determine what those differences are. Other technologies, such as RDF Delta [34], already provide those capabilities. WebSig relies on three other contributions: a proof of sufficiency that link- able, verifiable, portable, revisable, attributable digital signature schemes are also trustable and computable; FRIR; and RGDA1. The first contribution, in Chapter 2, is a proof that certain qualities of digital signature schemes are sufficient for them to be trustable, computable, and minimally repudiable. The second contribution, FRIR, is a foundational contribution that integrates provenance across different lev- els of abstraction. FRIR is introduced in Chapters 4 and 5. FRIR provides a means to reliably identify the content of the documents being signed, as well as the original bytes of the document. It unifies provenance expressed at high levels, such as version invariant provenance or content-specific provenance, and low levels, such as specific serializations of data or individual copies. It demonstrates this with the use case of open government data. FRIR was previously used to provide provenance traces for transactions on the web [35]. The third contribution, the RGDA1, is introduced in Chapter 6. It produces cryptographically secure identifiers for RDF graphs that are unique and reproducible across all RDF graphs. This makes it possible to sign the contents of a graph in a reproducible way, rather than simply signing a particular serialization of information. FRIR and RGDA1 enable two essential qualities in trustable, computable, and minimally repudiable digital signature schemes, as they 10 together allow WebSig to be portable, verifiable, and revisable. The fourth contri- bution, described and evaluated in Chapter 7, is WebSig itself. We will demonstrate how WebSig is a digital signature scheme for the web that has all five of the neces- sary and sufficient qualities that we discussed above to be a computable, trustable, minimally repudiable digital signature framework for the web.

1.5 Organization This dissertation is organized as follows: Chapter 2 provides this disserta- tion’s first contribution, a proof showing that a small set of qualities of a digital signature scheme make them able to produce signatures that are authentic, autho- rized, and computable. It also lays out evaluation criteria for each quality discussed. Two contributions provide foundational support for WebSig in modeling and graph digest algorithms. Chapters 4 and 5 provide the second contribution, an informa- tion structure that supports identification and expression of complex provenance on information resources, especially RDF documents. Chapter 6 provides a third contribution, RGDA1, a content digest algorithm for all RDF graphs. Chapter 7 provides the fourth contribution, the specification and evaluation of WebSig, a digi- tal signature scheme for the web. The evaluation consists of assessments of WebSig against the criteria set forth in Chapter 2 and examples of WebSig in action im- plementing the use case described above. Chapter 8 provides a discussion of the consequences of WebSig and how its application focus is different from conventional digital signatures and discusses future work in applying WebSig to informed consent management. Chapter 9 concludes the dissertation and lays out the contributions and claims. CHAPTER 2 Repudiation, Trust, and Computability in Web Documents

We set out a set of qualities that enable users to trust and compute on documents on the web. We also show how some of those qualities minimize the opportunities to repudiate those documents in a court of law. For this chapter, we sought the help of an expert in technology and law. Her role in this dissertation is discussed in more detail in the Acknowledgements. For a digital signature to be useful on the web it must be possible for agents to interpret the content of a signed document in a way that they can act on it while knowing that they are using the same interpretation that any other agent would use. Further if digital signatures are used to sign legally binding documents, they must be able to actually serve as a signature. Winn [36], and McCullagh and Caelli [31] have argued that digital signatures are not up to this task. Winn poses three questions about digital signatures:

1. Does the metaphor of “signature” make sense for asymmetric cryptography and public key infrastructures?

2. Why do signatures matter in traditional contracting practices?

3. What does “non-repudiation” mean?

Winn starts by asking if a digital signature can be considered a signature in the legal sense at all. The Restatement (Second) of Contracts [37] as defining a signature as “any symbol made or adopted with an intention, actual or apparent, to authenticate the writing as that of the signer.” Winn suggests that:

Under appropriate circumstances, the act of affixing a digital signature certificate to a message that has been signed by the private key asso- ciated with that certificate might actually constitute a signature, but anyone making such a claim would have to be able to establish a con- nection between the mental state of the individual to be bound and the act of affixing the certificate and digital signature. The magnitude and

11 12

complexity of the network architecture and information system security operating at each node on the network necessary to make that connection in a reliable, routine manner is one of the major obstacles now impeding the implementation of digital signature technologies [36].

Winn points out that, to interpret the use of a private key to generate legally binding digital signatures, the signer will need to keep the private key out of the hands of unauthorized agents. The information system that manages the private key, including the software implementation that generates the signature, needs to guarantee that the signature could not be created without the private key. Finally, anyone who interprets the signature would need to be able to determine if the signature was made to indicate that the signed document is attributable to the signer. Winn’s second question raises the history of physical signatures. He suggests that signatures were non-controversial mechanisms for contractual binding. In fact, signatures themselves were not necessary for contractual binding under common law if other evidence was available to indicate that parties to an agreement had bound themselves to it. Winn speculates that:

It is possible that the common law of contracts came to accept a signature as part of the proof that should be offered of intent to be bound so many centuries ago, and that the practice has continued for so long with relatively little change, that the topic scarcely seemed worthy of discussion. [36].

However, it is important to note that there must be some indication of intent to be bound by a given signed document, that the signer can be held to the terms of that document. Finally, Winn [36] discusses the concept of “non-repudiation”, a common con- cept in digital signature research. He suggests that the concept, as used in cryp- tography and digital signature research, “is not a term that currently has any sig- nificance in contract law.” Contract law instead discusses forms of “anticipatory repudiation”. Anticipatory repudiation occurs when a party to the contract breaks 13 it before its terms begin [38]. This does suggest that, in certain circumstances, con- tracts can be broken or modified unilaterally or bilaterally, removing the binding of that document to the signers. However, repudiation does not generally cover the process of denying that a signature is authentic and authorized. Winn’s discussion fits neatly with an analysis by Reed [17], who states that signatures provide evidence for three legal matters:

• the identity of the signatory;

• that the signatory intended the signature to be his signature; and

• that the signatory approves of and adopts the contents of the document.[17]

Reed separately derives these matters from the key cases of L’Estrange v. Graucob [39] and Saunders v Anglia Building Society [40] in UK law. However, these requirements, as discussed by Winn, suggest that if digital signatures are to serve the same role as physical signatures, they must be able to perform as evidence for all three matters. For each of these matters, a signature must be both authentic and authorized. Otherwise, the signer can declare that the signature is not valid, or is no longer valid. In such a system there might always be opportunities to repu- diate a signature [31], but a signature scheme (digital or otherwise) must attempt to minimize the ability for a signatory to challenge a signature, in ways already established by case law, by providing a framework for evaluating such repudiation challenges. We describe these challenges in Section 2.2. We therefore identify three qualities that will enable legally binding, com- putable documents on the web. To perform in this role, a digital signature scheme must be: trustable, computable, and minimally repudiable. We will formally define the first two qualities and show how three concrete qualities can be used to satisfy trustablity and computability. We will then give a set of likely repudiation methods based on Reed’s framework of signatures as legal evidence and show how adding two more qualities minimizes the ability to challenge digital signatures in legal matters. 14

2.1 Trusting Computable Documents on the Web We will begin by defining what we mean by a document, what makes a doc- ument computable, and its signature trustable. We will then define three qualities – linkability, verifiability, and portability, that enable a digital signature scheme to produce signatures that are trustable on documents that are computable.

Definition: Document Documents D are abstract information entities that have concrete representations R:

∀d ∈ D ↔ (∃r represents (r, d))

All representations have specific byte sequences.

Definition: Formal Semantic Interpretation A formal semantic interpretation

If is a conceptual information model that has some known systematic method of truth value assignment based on a Tarski-based system of semantics Sf [41].

Definition: Computable Document Computable documents Dc are documents

D with concrete representations Rc that are expressed in a formal language Lf and have a formal semantic interpretation (If ) that uses unambiguous denotation:

∀d ∈ Dc ↔ (∃!ic ∈ If ∀rc represents (rc, d) → interprets (ic, rc))

Unambiguous denotation is required to limit the document to a single formal semantic interpretation. Unambiguous denotation is where, in a system of denota- tion, any given symbol or identifier can only refer to a singular referent.

Lemma 2.1.1. All non-computable documents {D ∧ ¬Dc} can have only one rep- resentation:

∀(d) ∈ {D ∧ ¬Dc} ↔ ∃!rc ∈ R represents (rc, dc)

Proof. A document d with a representation r that has a formal semantic interpre- tation i ∈ If is a computable document (d ∈ Dc). Without a formal semantic interpretation, it is impossible to know, formally, if two representations contain the 15 same information. Representations that do not have formal semantic interpretations can only be identified by their byte stream, and therefore is opaque to interpretation. Any process that interprets the document does so using its own methods, and are specific to that process. If a process transforms an opaque representation from r to r0, it is impossible to determine if the new representation contains the same infor- mation or different, because the interpretation that the process uses has not been formalized. We must therefore assume that the new representation r0 represents an entirely new document d0, because any two processes that might create an r0 may create them based on different interpretations and produce different results. If there are no transformations of a given r that can be formally shown to produce the same interpretation, then any given document d without a formal semantic interpretation i∈ / If can only have that one representation r.

Definition: Computable Digital Signature Scheme Digital signature schemes that can sign computable documents are computable:

∀ds ∈ DSc ↔ ∀d ∈ Dc∃s ∈ S ∧ wasGeneratedBy (s, ds) ∧ used (ds, Dc)

Definition: Existential Forgery An existential forgery is a digital signature that is produced for some document given only the public key. The contents of the document do not have to be anything in particular, but instead simply have to be some sequence of bytes [16].

Definition: Trustable Digital Signature Scheme A trustable digital signature scheme: a) produces signatures that are impervious to existential forgery (Se) and b) for all representations R of any document d, a trustable digital signature scheme always produces the same signature:

  ∀d ∈ D∃!s ∈ St ∧ wasGeneratedBy (s, ds)   ∀ds ∈ DSt ↔  ∧∀r represents (r, d)    → s ∈ Se ∧ signatureOf (s, r)

As shown above, only computable documents can have multiple representa- tions. Non-computable documents have only one opaque representation (a byte 16 stream). If a document is computable (the formal semantic interpretation is known), then it may have many different representations. A scheme that assumes an opaque representation would produce different signatures for each representation, and so would not be trustable on computable documents. A digital signature scheme can only be trustable if it always produces the same signature for all representations of a document, and it can only be computable if it has a formal semantic interpretation with unambiguous denotation. Any digital signature scheme that is trustable and computable must therefore always produce the same signature for transparent documents that may have many different repre- sentations. Because they rely on message digests, conventional digital signature schemes only sign bytewise serializations of documents [16]. The signatures of such schemes can only be considered trustable if we pretend that the documents have no formal semantic interpretation, and can therefore have only one representation. That would make the document (and the scheme) non-computable. Conversely, if we acknowl- edge that a document can have multiple representations but have only a single, unambiguous formal semantic interpretation, then we can no longer trust the sig- nature on the document, but only on a single, representation of the document as identified by its byte stream.

2.1.1 Sufficient Qualities for Verifiable Computable Signature Schemes

Definition: Verifiable Document For all verifiable documents d in Dv, there exists a representation r that can be signed by signature s that can verified in a way that is impervious to existential forgery:

∀d ∈ Dv ↔ (∃r∃s ∈ Se ∧ represents (r, d) ∧ signatureOf (s, r))

Definition: Linkable Digital Signature Scheme follows and supports the prin- ciples of linked data [4]. For the purposes of this thesis however, we only need to consider that the scheme can sign documents represented in RDF:

• For all linkable signatures s in Sl, s is represented in RDF 17

• For all signatories a in Al, a is represented in RDF.

• For all linkable documents d in Dl, d is represented in RDF.

Definition: Portable Digital Signature Scheme For all portable digital signa- ture schemes ds in DSp, a signatory’s public key k in Pk, and computable documents d in Dc, there exists a single signature s in Sp that can be generated from any rep- resentation of d:

∀ds ∈ DSp∀d ∈ Dc∀k ∈ Pk∃!s ∈ Sp wasGeneratedBy (s, ds) wasDerivedFrom (s, k) → (∀r represents (r, d) ∧ signatureOf (s, r))

Lemma 2.1.2. All linkable documents are computable: Dl ⊆ Dc

Proof. All Linkable documents are expressed in RDF. RDF is a formal language that parses to a Directed Labeled Graph. It has a formal semantic interpretation [42] with unambiguous denotation (through use of URIs as identifiers). Because computable documents are documents with a formal semantic interpretation with unambiguous denotation, all linkable documents are computable.

Corollary 2.1.3. All linkable digital signature schemes are computable: DSl ⊆ DSc

Proof. If a computable digital signature scheme is one that can sign computable documents, and all linkable documents are computable, then all linkable digital signature schemes are computable.

Lemma 2.1.4. All digital signature schemes that are portable and verifiable are trustable: DSp ∩ DSv ⊆ DSt

Proof. By definition, portable digital signature schemes produce the same signature from any representation of a given document, the first requirement of a trustable digital signature scheme. A scheme that is verifiable is impervious to existential forgery, the second requirement of a trustable digital signature scheme. Any digital signature scheme that is both portable and verifiable is therefore trustable. 18

Theorem 2.1.5. All digital signature schemes that are linkable, portable, and ver- ifiable are computable and trustable:

DSl ∩ DSp ∩ DSv ⊆ DSc ∩ Dt

Proof. If a linkable digital signature scheme is computable (DSl ⊆ DSc) and a portable, verifiable digital signature scheme is trustable (DSp ∩ DSv ⊆ DSt), then a digital signature scheme that is linkable, portable, and verifiable is both trustable and computable (DSl ∩ DSp ∩ DSv ⊆ DSc ∩ DSt).

While this kind of digital signature scheme would be a significant improvement on existing solutions, there are still ways in which these kinds of schemes can be circumvented through the use of repudiation claims. We will therefore introduce two additional qualities of digital signature schemes, attributability and revisability, in order to minimize the ability for signatories to make false repudiation claims.

2.2 Repudiation of a Signature There are a number of ways in which a document signature can be repudi- ated, mostly revolving around challenging the authenticity and authorization of the signature itself or the acceptance of the content. Reed [17] suggests three forms of evidence, the identity of the signatory, their intent to treat the signature as a signa- ture, and the approval and adoption of the documents content. McCullagh argues that guarantees of non-repudiation are misguided and not a viable legal strategy [31]. However, the ways in which signatures can be repudiated can guide us to ways that can strengthen digital signature schemes.

2.2.1 Signatory Identification Signatures must identify a signatory. An alleged signatory can challenge a signature on several fronts.

Authentication Challenge: Someone Else The first way to challenge a signa- ture is to claim that the signature does not identify them. From a technical 19

perspective, anyone can sign a document, but from a legal perspective, the sig- nature should be tied to the relationship of the signer to the document, such as grantor, grantee, or witness. Any other signatures would have no effect. Within digital signatures, a claim could be made that the public key used to sign the document belongs to a person who has such a relationship. If that claim is false, then the signature has falsely identified the signatory, and can be challenged.

Authorization Challenge: Proxies A proxy signatory could claim that they are signing the document on behalf of the signatory, but if the proxy cannot produce documents that authorize them to do so, the original signatory can claim that they did not authorize the signature on their behalf. A signatory could also have revoked proxy authorization before the signature was made, or the proxy authorization may have expired when the signature was created.

2.2.2 Intention to Sign The act of signing is taken to mean that the signatory intended to sign the document.

Authentication Challenge: Forgery Forgeries can exist, which disrupts the in- tention of the signatory to sign the document. Signatures can be challenged on their authenticity. Forgeries always identify the right person, but makes it look like the signatory intended to sign when they did not.

Authorization Challenge: Identity Theft A valid digital signature can poten- tially be produced that the signatory did not intend if the private key of the signatory was stolen.

2.2.3 Adoption of Document Signatures also indicate that the signatory “approves of and adopts the con- tents of the document” [17]. This approval and adoption are often directly encoded in contracts. 20

Authentication Challenge: Interpretation Signatories may want to sign a doc- ument for numerous reasons, not just to claim the contents are true. Whether or not a witness claims a document to be true can be irrelevant, as they are witnessing that the signatory believes so, and is only signing that they in turn believe that the signature is valid. Sometimes the witness needs to assert that the signatory is “of sound mind and body”, but these assertions are separate from what the signatory themselves are asserting and adopting. Alternatively, a signatory may wish to indicate that they have seen, read, reviewed, or simply received the document. None of these signatures should be construed to state that the signatory has approved of and adopted the contents of the document.

Authorization Challenge: Revision and Revocation Signatories signing new revisions of a document may also revoke their approval of prior versions. Older versions of the document cease to be in force. If a document does not contain clear semantics, then it is possible for a signatory to repudiate interpretations of the document. Further, signatories who have decided to revoke their ap- proval of the document can repudiate it. Finally, many documents may expire after a certain time period. After that date, signatories can repudiate the document after that time period.

We will attempt to find appropriate channels for each of these forms of repu- diation, so that if those channels are not used, it is harder (but not impossible) to repudiate a signed document.

2.3 Minimizing Repudiation To minimize the opportunities to repudiate digital signatures, we introduce two additional qualities of digital signatures: Attributability and Revisability to complement the previous three qualities (Linkability, Portability, and Verifiability) that provide Computability and Trustability.

Definition A digital signature scheme is attributable if and only if:

1. the identity of the signatory is denoted by some unambiguous identifier, 21

2. that identifier is used to provide evidence supporting signature authentication (such as a public key),

3. the identifier can be used to express what the signatory is indicating by signing the document,

4. the identifier can be used to express why the signatory is authorized to sign the document, and

5. the identifier can be used to refer to the signatory within the document.

Definition Digital signature schemes that are revisable provide a means to express changes, revocations, expiration, and other modifications to signed documents and their signatures.

We can also show that each method of repudiation discussed in Section 2.2 can be minimized through combinations of our five digital signature qualities.

2.3.1 Signatory Identification In digital signature schemes that are linkable and attributable, the signatory is unambiguously identified. Such a signatory therefore cannot deny that identity without relinquishing the identifier completely. Proxy signatures can be validated by checking against other linked, signed documents that themselves contain signatures by the authorizing agent. If a signatory is not a party to a document or if the signatory claims to sign on behalf of agent who is without authorization, attributable digital signature schemes will be able to detect that and issue a warning that the signatory is not a proxy for any party to the document.

2.3.2 Intention to Sign Forgeries of digital signatures are, generally, very hard to produce without ac- cess to the private key of the target. Verifiable digital signature schemes ensure that a signature cannot be produced on a particular document except by using a particu- lar private key. Identity theft, on the other hand, can only be guarded against by the signatory themselves by limiting access to their private key. Signers can repudiate a 22 signature by claiming identity theft, but any other signatures signed with the same private key in the same timeframe would also become suspect. Verifiable signatures are therefore as strong as can be expected against unintentional signatures.

2.3.3 Adoption of Document An attributable digital signature scheme allows the signatory’s identifier to “... be used to express what the signatory is indicating by signing the document” (from the definition of attributable digital signature schemes). Linkable digital signatures in turn make it possible for the signatory to be identified inside the signed document, linking the document itself to the signatory in an interpretable way. Further, revisable digital signatures provide a way to revoke approval of documents and issue revisions. This means that a signatory can only repudiate older versions if a newer version has been issued and signed, or if the document has expired. More specifically, a signatory must sign something declaring the older document to be null and . This can either stand on its own (revocation) or be part of the signature of a newer version of the document (revision). CHAPTER 3 Related Digital Signature Schemes

WebSig is by no means the first digital signature scheme, nor the first to address these challenges. This chapter reviews existing work in digital signature schemes and evaluates them based on their ability to be computable, trustable, and minimally repudiable. These qualities are defined in Chapter 2. We will start by evaluat- ing physical signatures and non-cryptographic electronic signatures as a baseline. The digital signature schemes are categorized as follows: basic digital signatures, like RSA and PGP; XML-based digital signatures, like the ones based on XMLDig- italSignature; and RDF-based digital signatures. Table 3.1 provides a summary evaluation for all of the signature schemes reviewed, including how many criteria are met for computability, trustability, and minimal reputability. In order to limit the scope of this review, we focus our efforts on digital sig- nature schemes that can potentially be trustable, computable, and minimally repu- diable. Menezes et al. [43, p. 426] defines them as follows: “A digital signature scheme (or mechanism) consists of a signature generation algorithm and an asso- ciated verification algorithm.” A well-known cryptographic frameworks that uses digital signature schemes, but is not itself one, is the Security Assertion Markup Language (SAML) [44]. SAML is restricted in its scope to identifying agents and providing authentication and authorization statements about them, and does not provide a signature generation algorithm over certain kinds of documents, nor does it have a verification algorithm. SAML assertions may be signed using an XML Digital Signature, but there is no way to sign anything other than SAML documents, which are limited in scope. Another example is the X.509 protocol [18], which uses digital signature schemes to exchange certificates for use in standards like . The focus here, again is on authentication of clients and servers to initiate encrypted communication, rather than providing general-purpose digital signatures. Additionally, we do not review digital signature schemes that have known attacks that are existential or stronger [16]. An exception is made for the seminal digital

23 24

Table 3.1: Digital signature schemes and their properties. T : this scheme is trustable. C: this scheme is computable. MR: this scheme is minimally repudiable for n of 7 challenges identified. WebSig, described in later chapters, minimizes repudiability in all seven challenges, and is trustable and computable.

Scheme TCMR Non-cryptographic Physical signature N N 5/7 Electronic signature N N 5/7 Basic RSA [15] Y N 0/7 Diffie-Hellman [14] Y N 0/7 DSA [45] Y N 0/7 Identity-based Signatures [46] Y N 1/7 ECDSA [47] Y N 1/7 Proxy Signature Schemes [20] Y N 1/7 XML XML Signature [32] Y N 1/7 RDF Kasten and Scherp [48, 49] Y N 1/7 Cloran and Irwin [50] Y N 1/7 Carroll [7] Y N 1/7 Tummarello and Morbindoni [51] Y N 1/7 web-payments.org [52] Y N 1/7 WebSig Y Y 7/7 signature scheme by Diffie and Hellman [14].

3.1 Non-Cryptographic Signatures A physical signature is a mark on paper made by a signatory (usually a human) that provides evidence for three legal matters [17]: “the identity of the signatory; that the signatory intended the signature to be his signature; and that the signatory approves of and adopts the contents of the document.” Because of their variability, physical signatures are not easily verified, even by human experts [53]. A signatory can therefore easily claim that a physical signature is a forgery. Handwriting anal- ysis may or may not be able to confirm the claim, depending on the consistency of the signature and the quality of the forgery. Physical signatures are not trustable by our standards, nor are they computable. Within the context of human interpre- tation though, physical signatures are usually used within metadata that attempt to explicitly identify signatories, provide the ability to designate signature proxies, 25 allow the reader to interpret the role of the signatory within the document, can be used to declare revisions of documents, and can be used to revoke prior documents. None of these are available in any form that a computer can understand. Electronic signatures are defined in 15 USC §7006(5) as “The term ‘electronic signature’ means an electronic sound, symbol, or process, attached to or logically associated with a contract or other record and executed or adopted by a person with the intent to sign the record.” Electronic signatures that are not secured via some cryptographic scheme are especially prone to forgery. Any mark can potentially be made on behalf of another signatory without their knowledge. If the mark is recorded through some sort of authentication framework, the signature is only as secure as the credentials needed to access that identity. If the mark is a scanned image of a physical signature, the forger only has to obtain an example of the signature and add it to the document themselves. Electronic signatures are therefore not trustable by our definitions, and are not computable. The regimes of repudiation minimization from physical signatures apply to electronic ones if the same linguistic structures surround it. Again, none of these are automatically understandable by computers.

3.2 Basic Digital Signatures Conventional digital signature schemes, such as RSA [16], DSA [45], Elliptic Curve Digital Signature Algorithm (ECDSA)[47], identity-based signatures [46], and proxy signature schemes[20] all use message digests, mathematical functions that compute a unique identifier for each byte stream [43, p. 321]. Because these digests focus on byte streams, these signatures are opaque to formal semantic interpreta- tion, and cannot be computable. These algorithms are trustable, as there are no cryptographic attacks that can produce existential forgeries. None of these schemes address all of the repudiation challenges, but two of these schemes attempt to solve parts of it. In principal, any of these schemes could be made portable through the use of a cryptographic hash algorithm that computes using the document’s formal semantic interpretation, but metadata needs to be recorded identifying the hash used. Because the metadata isn’t recorded in anything like a linkable data struc- ture, these schemes are not computable by themselves. We will use the RSA digital 26 signature scheme as the algorithm to produce the mathematical signatures. Proxy Signature Schemes [20] provide a means for a third party proxy to sign on behalf of a primary signatory. It is important to note that this scheme has a very interesting concept of transparency. It defines a transparent signature as one where a validator of the signature cannot tell who signed on behalf of the primary signatory, that the signature will always be the same and the proxy signatory remains anonymous. From our perspective, this seems like an opaque signature, since the provenance of that signature has been obscured mathematically. A transparent proxy might be better described as one where the provenance of the signature is clear to the consumer. Identity-based digital signatures [46] attempt to minimize the ability of a sig- natory to repudiate their key by computing their public key from a set of well-known facts about the signatory. This is an interesting concept, but can have limited prac- tical use. If someone changes their name, their public key might also change, while that signatory may not wish to change their identity, merely their name. This scheme also requires deciding which set of invariant attributes of the agent should be used. Some agents with very common names may actually result in key colli- sions, if, for instance, two John Smiths were born on the same day in the same busy hospital. These contributions are interesting partial solutions, but it is clear that the mathematical attempts at solving repudiation minimization fall short, and an infor- mation modeling approach may be better suited to these demands.

3.3 XML Digital Signatures The W3C XML Signature Recommendation [32] provides a framework around conventional digital signatures that let uses sign documents and express those sig- natures using XML. Users can specify document canonicalization algorithms, a URI for the public key, and other specifics about how the signature was made. The di- gest of those documents are included in a SignedInfo element, which is canonicalized and encrypted into a SignatureValue element. It is possible to apply canonicaliza- tion transformations to the signed document before producing the digest. However, 27 validation of such canonical forms cannot easily be done directly from an abstract representation, as it needs to be serialized out in order to compute the digest. Fur- ther, while the public key is identified by a URI, the identity of the signatory (who can own one or more key pairs) is not addressed. Because of this, identity theft is easier to accomplish because the identity is tied to the key pair, rather than some other identifier that is separately controlled. Finally, the recommendation only ad- dresses direct signatures of documents without explicitly stating what the semantics of that signature are. As a result, while XML Signatures are trustable, they are not, at the same time, computable. Of the seven opportunities for repudiation, XML Sig- nature does not allow for signatory identification, proxies, interpretation, revision, identity theft, or revocation. It does support the discovery of forgery.

3.4 RDF Digital Signatures Digital signature schemes have been proposed for RDF: The Web Payments1 project has developed a vocabulary for digital signatures, including a specific method for producing signatures of RDF graphs.2 It only seeks to replicate current structures of digital signatures, without increasing their expressiveness. Carroll [7] proposed a system by which a canonical serialization of RDF can be used, against which a digital signature could be created. Tummarello and Morbindoni [51] proposed a means by which reified triples could be given a graph signature based on signing a Minimum Self-contained Graph (MSG), which has a scheme similar to RDF Molecules [54]. An MSG is a set of statements that can stand on their own without the need for other statements to provide context. While Tummarello and Morbindoni link the digital signature of each MSG to the URI of a PGP certificate, this is insufficient to determine what agent has actually signed the graph because a PGP certificate does not necessarily identify an agent. Carroll does not discuss the role of attribution in his work at all. The Web Payments vocabulary, Carroll, and Tummarello and Morbindoni all rely on canonical serializations, which suffer from the need to provide identifiers for blank nodes and also require explicit ordering of statements. Canonical

1https://web-payments.org 2https://web-payments.org/specs/source/vocabs/security.html#GraphSignature2012 28 serialization is also an issue when the size of a RDF graph makes it expensive to write out separate from a . These issues are discussed in greater detail in Chapter 6. Kasten and Scherp [48, 49] and Cloran and Irwin [50] both rely on the XML Digital Signature standard [32], which does not directly support a means to validate the public keys used in the signature. While it provides a means to identify the keys used in the signature, it does not identify the agents who control those keys, making the actual attribution of the resulting signatures and graphs unclear. Additionally, each chapter within this dissertation discusses work related to each specific contribution. Please see Sections 4.2, 5.2, and 6.2.

3.5 Conclusions Physical and non-cryptographic electronic signatures, while they are neither computable nor trustable, do provide mechanisms to minimize repudiation in five of the seven challenges, only missing minimizing forgery and identity theft. On the other hand, digital signatures are trustable, not computable, and generally fare poorly in minimizing the repudiation challenges we set forth. A digital signature scheme, like WebSig, that can provide trustability, computability, and minimize repudiation would therefore stand alone in the field of cryptography and may realize the practical use of digital signatures for legally binding documents. CHAPTER 4 Parallel Identities for Managing Open Government Data

The path to a new digital signature framework initially led through the world of Open government data. Many of the same challenges that occur in producing trustable and computable digital signatures arise when managing and transform- ing large amounts of data that is published as-is, especially when attempting to track and leverage provenance from that process. The widespread availability of Open Government Data is exposing significant challenges to trusting its unplanned applications. As data are accumulated, trans- formed, and presented through a chain of independent third parties, there is a growing need for sophisticated models of provenance. Although significant progress has been made in describing data derivation, it has been limited by its inability to distinguish transformations that change content from transformations that merely change representation. We have found that Functional Requirements for Biblio- graphic Records (FRBR) can, when paired with a derivational provenance model and cryptographic digest algorithms, successfully represent web resource accession, distinguish between transformations of content and format, and facilitate veracity. FRBR provides a means to unify parallel levels of abstraction in data representation that span from abstract to concrete. We show how FRBR concepts, cryptographic digests, and the World Wide Web Consortium’s provenance standard, PROV [11], can be used to provide an automated method to coordinate the many, parallel iden- tities of information resources that can be used by data consumers to make informed decisions about which data product to use for their application.

4.1 Introduction Open Government Data (OGD) is a new and rapidly growing phenomenon that has drastically increased the scale of government data publication by changing

This chapter previously appeared as: J. P. McCusker et al., “Parallel Identities for Managing Open Government Data,” IEEE Intell. Sys., vol. 27, no. 3, p. 55, May 2012.

29 30 the equation of data release. Rather than deciding to release data on a case-by-case basis, data is now expected to be released unless there are reasons to not do so, such as security or privacy. Catalyzed in 2009 by countries including the United States and the United Kingdom, governments from local to national levels are publishing their data for public use [55]. These data are available for personal or commercial use and offer the potential to increase the quality of life for communities, businesses, and government alike. Such benefits include helping citizens understand pollutants near their home [56], crimes in their neighborhood, public works, natural disasters, and political activities [57]. While individual datasets are interesting on their own, there is a hope and expectation that combining disparate datasets will lead to even more insight and value – the whole should be greater than the sum of its parts. Unfortunately, combining datasets is more difficult than simply providing each as data files on a web site [57]. A number of social and technical challenges remain. Simply “releasing” data, even with good documentation, does not make it inherently useful. First, consistent or automated ways to discover, access, and obtain new datasets are not ubiquitous. Second, once a dataset is obtained, it is often difficult to quickly and easily merge it with others because it often differs in formatting (zip, csv, ) and modeling paradigms (tabular, relational, hierarchical), uses domain- specific terminology, uses shortcuts and abbreviations that are difficult to interpret, and refers to entities in differing ways (e.g., “POTUS” and “Barack Obama”) [58]. These low-level challenges need to be addressed for each dataset before one can begin to explore more interesting high-level questions. Challenges are compounded by the fact that groups around the world are undertaking similar uncoordinated activities to discover, collect, interpret, analyze, publish, and display results derived from the same sources. Linked Open Government Data (LOGD), [59] the integration of OGD using semantic web and Linked Data principles, has the potential to meet the unmet ex- pectations for a valuable, combined whole of disparate government datasets. Accord- ing to Linked Data design principles, the Resource Description Framework (RDF) is used to associate data elements within each dataset. When data elements are named using web-accessible Uniform Resource Identifiers (URIs), they not only get a global 31 name but also provide a direct way to request more information about that entity. For example, when interpreting a data element labeled “id”, one may need to seek documentation, contact another person for help, or make an educated guess at its meaning. Instead, by using a URI such as http://logd.tw.rpi.edu/id/us/state/Idaho, the data element leads to documentation and supplemental description3 when its identifier is requested from the web using HTTP. Because relationships are also named with URIs, they offer the same benefits. For example, if an organization or company is based near Idaho, an RDF triple such as

. leads to supplemental information about not only the IDAHO POWER CO and Idaho, but also how they relate – simply by requesting any of the three URIs from the web. By reapplying this Linked Data approach to additional datasets, reused vocabulary and interconnected entities provide an explicit basis for a whole that is greater than the sum of its parts. This ability distinguishes RDF from alternative representations such as CSV, XML, and JSON that do not provide intrinsic means to achieve such cross-dataset integration. As part of the LOGD community, the Tetherless World Constellation at Rens- selaer Polytechnic Institute has been developing tools and exploring how to apply Linked Data principles to integrate and use government data. The project’s primary tool, csv2rdf4lod 4, [59] embodies a URI design and data transformation methodol- ogy tailored to collect, retrieve, convert, enhance, and publish original government data sources as RDF while maintaining provenance concerning where and how the data was obtained and what was done to it. Developed over the past two years as dozens of team members have processed thousands of datasets, the design enables us to accumulate and derive additional value at any stage while ensuring data are backward-compatible and annotated with provenance. These efforts have resulted

3e.g., that it is a state of the United States and was admitted in 1890. 4https://github.com/timrdf/csv2rdf4lod-automation/wiki 32 in over ten billion RDF triples about a multitude of government topics. Although creating Linked Data from government data reduces integration costs and increases the potential for insights and value, it implicitly raises challenges for those choosing to use Linked Data instead of the original form. One issue is that the Linked Data version is often hosted by third parties instead of being maintained on the original host. Additionally, the third party is providing a transformed version of the reputable data originally provided by the government. What assurances does a consumer have that the data from a third party is just as good as that from the government? Do the benefits of integrated and comprehensible datasets provided by the third party outweigh the risks that they may contain mistakes, or, worse, malicious intent? If the same original government dataset is integrated by two different third parties, which should a consumer use? However, simply providing provenance to data consumers does not mean they will understand it. We claim that a disparity of abstraction is the principal bar- rier for data consumers to trust provenance-annotated data. While humans phrase management of data in terms of high-level abstractions, most existing provenance representations exhaustively record low-level details. Instead of forcing a choice between high-level and low-level provenance, we propose to unify them with four parallel levels that span from abstract to concrete. Any of the four parallel identi- ties of an information resource can then be considered at the level appropriate for a particular task. By parallel identities, we mean a given entity can have more than one identity (id1, id2) such that one cannot say that id1 is mathematically identical to id2. We describe how structuring aspects of an information resource across levels of abstraction minimizes the need for exhaustive provenance capture and how the resulting formal model can be used to increase transparency into how OGD is used.

4.1.1 Use case: Trusting Integrated Data We describe a simple use case to provide an example of the challenges that Linked OGD consumers face. Although simple, it is prototypical of many situations that we have encountered. The use case also serves as a basis for demonstrating and evaluating our technical approach. Figure 4.1 illustrates the four actors and seven 33 resources5 involved: A Government: provides a single CSV file at a URL6. Two other URLs (URL 17 and URL 28) point (i.e., redirect) to the CSV file. Two Data Integrators (W and E) independently retrieve URLs 1 and 2, respectively, and store results locally for processing and re-publishing. Integrator W: rehosts their CSV copy on their own site. Integrator E: applies three transformations before hosting the results on their own site. Each transformation produces a different RDF file. The first, aid.raw.ttl is derived from the CSV using a naive, domain-independent interpretation. The second, aid.raw.rdf is derived from the first by re-serializing the RDF model from syntax to RDF/XML syntax. The third, aid.enhanced.ttl is derived from the CSV using enhancement parameters developed by a human curator familiar with the original content and Linked Data design. This data is no longer tabular, but instead fully takes advantage of representing entities and relationships using RDF graphs. A Data Consumer: is faced with the decision to use any of the seven data files available: either the two URLs provided by the government, or one of the five offered by the third-parties. The consumer’s challenges center around under-described, un-coordinated pro- liferation that is characteristic of the Web. While Data mirroring and derivation Understanding the relationships among the choices can lead the consumer to a more informed and confident decision. Without knowing the source and the nature of the transformations that led to each data file, a cautious consumer must assume their results are different until shown otherwise. If a third party offers a result whose content is equivalent to that offered by the original source, the format and trans- formations leading to it are intrinsically satisfactory. What assurances can the integrator provide to convince the consumer to use their results instead of the rep- utable form from the government? Third party data integrators need to convince

5Additional use case information including technical details and links to the actual resources is available at http://purl.org/twc/pub/mccusker2012parallel 6http://gbk.eads.usaidallnet.gov/data/files/us economic assistance.csv 7http://www.data.gov/download/1554/csv 8http://explore.data.gov/download/5gah-bvex/CSV 34

Figure 4.1: A simple use case where a data consumer must choose be- tween the government’s original data or one of five data files offered by third parties. consumers that their results are not only just as good, but better than the original; the processed results need to be more discoverable, comprehensible, and integrated – all while preserving the content and reputability of the original.

4.2 Related Work We cover four kinds of related work: RDF conversion tools, current provenance models, the FRBR model from Library Science, which provides the four layers of abstraction for information resources, and existing content-based cryptographic di- gests. 35

4.2.1 RDF Conversion Tools LOD2’s recent report [60] surveys two dozen leading tools for knowledge ex- traction, and [61] lists more. Open Refine [62] is prominent and offers a web interface to modify tabular data, while an extension permits the construction of RDF export templates. Although easy to use for small, individual datasets, Refine does not readily scale to the overwhelming number of OGD datasets that need to be exposed as Linked Data. This is mostly due to its prevalence as a desktop application, and does not attempt to automate conversions of thousands of datasets. The export extension also does not provide reasonable default URI creation, which increases the amount of human effort required to consistently name instances across datasets. The conversion tool used in this paper, csv2rdf4lod, uses an RDF vocabulary9 to describe how to interpret spreadsheets into well-structured RDF representations that reuse existing vocabularies and explicitly connect entities across datasets. Un- like the R2RML language in development by the W3C [63], the vocabulary that csv2rdf4lod uses does not assume an underlying relational database. Instead, it borrows design from RDFS and addresses a wider variety of tabular encodings, in- cluding n-ary and statistical data. Using declarative “enhancement parameters” [59] reduces the need for custom software and thus reduces the likelihood for human error and the time required for a third party to become familiar with the enhance- ment. Declarative parameters enable others to automatically reproduce conversions using the same uniform terminology.

4.2.2 Current Provenance Models Current provenance models describe the provenance of derivation and events relatively successfully. Models like the Open Provenance Model (OPM) [64], Proof Markup Language (PML) [65], and the World Wide Web Consortium (W3C) stan- dard for provenance, PROV [11], describe the derivational history of information and other entities. These provenance models tend to describe derivation links as edges between entities, prov:wasDerivedFrom and opm:wasDerivedFrom being ex- amples. OPM and PROV also describe events as additional nodes in the same graph.

9http://purl.org/twc/vocab/conversion/ 36

OPM calls an event a Process, while PROV calls it an Activity. We call these sorts of events and links derivational provenance, since both record what happened and where things came from.

4.2.3 Models from Library Science Functional Requirements for Bibliographic Records (FRBR) [12] is a model developed by the library science community to describe the world of different bib- liographic resources, where an author’s work can assume many forms such as a pa- perback , , or audiobook. After almost twenty years of development, the Library of Congress, the National Agricultural Library, and the National Library of Medicine have announced their intent to adopt systems based on the FRBR model [66]. FRBR uses four levels of abstraction to distinguish among parallel aspects of an author’s work: Work, Expression, Manifestation, and Item. Two “identical copies” of a book are distinct Items because they occupy different physical space, but they share the same Manifestation because they have the same physical struc- ture. Audio recordings share the same Expression with the book because they both convey the same content. When the same conceptual story (Work) is revised, a new Expression is created and is associated with the same Work as the previous Expres- sion. We call an Item’s connection to its Manifestation, Expression, and Work a “FRBR stack”, which comprises four parallel identities. Figure 4.2 illustrates an example of how FRBR can be applied to organize the different aspects of biblio- graphic resources. Although core FRBR has no derivational provenance model, its OWL representation10 provides some minimal properties to create derivational links within and across levels of abstraction.

4.2.4 Existing Content-Based Cryptographic Digests As it becomes easier to shift between data formats, the ability to verify that in- formation is the same has become weakened because cryptographic message digests work only at the bitstream level. This is because shifts between data formats are becoming more common, especially with the use of content negotiation in HTTP

10http://purl.org/vocab/frbr/core# 37

[67]. For example, two RDF graphs that assert “George a :Person.” can be serialized in any number of ways, none of which changes the content of the graph. RDF graph digest algorithms [68] have been developed that are resilient to asser- tion ordering and other issues. Strategies such as canonical serialization have been used for other non-graph representations, including dataset publication using the Universal Numerical Fingerprint (UNF) [69]. However, canonical serialization in- troduces a number of issues that we discuss in Chapter 6, which introduces our own RDF graph digest based on Sayers and Karp’s algorithm. Finally, work in creating content-based digests for images and movies can identify image-based content across a large number of mechanical transformations. [70]

4.3 Approach By applying FRBR’s four levels of abstraction to digital information resources and naming their four parallel aspects using cryptographic digests, we can sup- port useful explanations of manipulated data products. FRBR’s bibliographic re- sources such as , albums, films, and magazines are, at their core, information resources. FRBR’s four levels of abstraction also naturally apply to electronic in- formation sources. Copies of files (Items) are exemplars of the same Manifestation (byte sequence). Similarly, an Excel file created from a CSV will have a different Manifestation from the CSV, but maintain the same Expression because they both store the same data. Finally, if the data is modified, the original and resulting spreadsheets have different Expressions (visual content) of the same abstract Work. Figure 4.2 illustrates how this digital FRBR approach can be applied to or- ganize the use case’s data products according to their four parallel identities. Any file (Item) retrieved from the two government URLs or their rehosted locations will result in the same bitstream, and thus share an identical Manifestation. The naive transformation producing raw.ttl does not change the tabular content of the original CSV, so they share identical Expressions while differing in Manifestations (bitstream). The reserialized RDF model in raw.rdf again changes Manifestation, but retains an Expression identical to both raw.ttl and the CSV. The Item en- hanced.ttl does not preserve an identical (tabular) Expression as the others because 38

Table 4.1: Different levels of abstraction in the FRBR stack and how they are identified. A match at any level implies a match at all levels above. If two resources have the same message digest, for instance, they have the same Manifestation, but also have the same Expression and Work.

Level Digest Item Message Digest Manifestation Message Digest Expression Content Digest Work User Defined or Content Digest it was restructured during curation and can contain more or less content. The en- hancement’s Work is also distinct because its content was created by Integrator E instead of the government. Cryptographic digests make it possible to automatically identify electronic in- formation resources in a number of ways (see Chapter 3 for more details about cryptographic digests). First, since the physical structure of Manifestations corre- spond to the sequences in data streams, it is possible to create a unique, repeatable number – a message digest – to identify that data stream. This is the principal application of cryptographic digests; anyone else who encounters that data stream can compute the same digest and know that they have received the same sequence. Similarly, digest algorithms have been developed for specific content types that ap- ply independently of its serialization. Digests for RDF graphs [68] produce the same hash regardless of statement order or serialization format. Digest algorithms can be used to recognize reproduced content - the same content digest identifies the same information. If two message digests (Manifestations) differ, but share the same con- tent digest (Expression), then the content is serialized in alternative representations. The different levels are shown in Table 4.1. Although message digests apply to any digital file, content digests only apply to specific types of content. Effort must be made to find a format-invariant inter- pretation of the file contents identified by a single number. Because graph digests are only useful for RDF graphs, we have identified requirements for other content digests that can be used to automatically identify Expressions. While mapping the same content to the same digest is ideal, mapping it to different digests is also ac- 39 ceptable as an approximation. One may conservatively fall back to identifying an Expression using the message digest of the Manifestation that embodies it in situ- ations where no content digest is available. In addition to the content digest types discussed above, we have developed a digest algorithm for raw spreadsheet tables. Simply take the graph hash aggregate of every cell where the cell is a tuple (row, column, value). For spreadsheets with multiple sheets, the tuple would instead be (sheet, row, column, value).

4.4 Methods Message-level and content-level digest algorithms were implemented in two stand-alone python utilities. The first, fstack.py, produces an RDF description using terms from the Functional Requirements for Information Resources (FRIR) vocabulary,11. FRIR was created to extend Ian Davis’ FRBR-core ontology,12 [12] Nepomuk’s File Ontology [71], and W3C’s PROV ontology [11]. The second utility, pcurl.py, produces a similar RDF description for a file retrieved from a URL, but includes information about the URL and its HTTP response. The csv2rdf4lod-automation data integration toolset was extended to incor- porate the results from pcurl.py and fstack.py when retrieving URLs and when converting data files. We created a script that performs retrieval, conversion, refor- matting, and enhancement in the use case described in Section 4.1.1. It produces a file for different combinations of the events that took place. We will use these files to evaluate the success in implementing the use case. The first file compares the FRBR stacks created when Integrators W and E retrieve URLs 1 and 2, respectively. The second file compares the FRBR stacks created when Integrator E converts the CSV to Turtle with a naive, domain-independent interpretation. The third file compares the FRBR stacks of the CSV, the Turtle derived from the CSV, and the RDF/XML derived from the Turtle – all created by Integrator E. The fourth file compares the FRBR stacks of the same CSV and Turtle to the FRBR stack of the enhanced Turtle – all created by Integrator E. Each of the four comparison files were inspected for

11http://purl.org/twc/ontology/frir.owl 12http://purl.org/vocab/frbr/core 40 common parallel identities between the FRBR stacks that were independently con- structed during the use case. Source code, results, and details about the apparatus are available at our online appendix13.

4.5 Results Figure 4.3 illustrates the first of the four comparison files created, where In- tegrators W and E request different URLs and receive distinct files with the same Manifestion (message digest). The result shown is a union of two independently- asserted FRBR stacks. Because URL 1 redirects to URL 2, Integrator W mentions both in the FRBR stack. Meanwhile, Integrator E only mentions URL 2. Both mention URL 3 and identify the same Work, Expression, and Manifestation, which correspond to Work 1, Expression 1, and Manifestation 1 in our objective organi- zation in Figure 4.2. In the second comparison file, the message digest used to identify the Manifes- tations of the CSV and Turtle files differ. This tells us that the physical structure of the files differ. Because both files convey tabular content, the content digest used to identify their Expressions are identical. Although we omit the result here, the structure can be seen in Figure 4.2 with http://b.org/csv and http://b.org/raw.ttl. In the third comparison file, where the Turtle derived from the CSV is reserialized to RDF/XML, the message digests used to identify all three Manifestations differ, yet they all share the same Expression because the tabular content digest recog- nized identical tabular content. Again, this structure can be seen Figure 4.2 with http://b.org/csv, http://b.org/raw.ttl, and http://b.org/raw.rdf. In the fourth com- parison file, the tabular content digest did not apply to http://b.org/enhanced.ttl, because the curated Turtle did not exhibit a tabular structure like http://b.org/csv and http://b.org/raw.ttl. Instead, the RDF graph digest was used to identify the enhancement’s Expression. This structure, too, is omitted for space but can be seen in Figure 4.2. Figure 4.3 was automatically constructed from the comparison file. Some abbreviations were made for presentation purposes, including shortening the cryp-

13http://purl.org/twc/pub/mccusker2012parallel 41 tographic digests in the URIs naming the Items, Manifestations, Expressions, and Works. The consolidation of higher-level endeavors is the principal characteristic to consider when observing FRBR stacks of files and their manipulations; when higher levels are consolidated, more information is known about the more concrete forms and whether or not they can be used for a particular application. Figure 4.4 illustrates the CSV to raw RDF conversion. The Expression remains the same, even though the Manifestations are different. Figure 4.5 shows the raw RDF to RDF/XML conversion. The Expression remains the same, even though there has been no record of derivation of the files from each other. Figure 4.6 shows how enhanced RDF conversions do not resolve to the same Expression, because new content has been added that makes it impossible to hash that RDF to the CSV or raw conversion.

4.6 Discussion This chapter provides the initial basis for the portability and revisability of WebSig digital signatures. FRIR’s combination of FRBR and PROV provide the means to express changes at multiple levels of abstraction, including revision of information resources. The attribution and creation relations that are part of PROV will also provide a means to express attribution in an effective way. The ability to tell what kinds of transformations are recorded in provenance makes it simpler to show relevant provenance information to users. Additionally, using cryptographic content digests as identifiers makes it simple to verify the iden- tity of content and prove that the same information is used in multiple settings out of band. The uncertainty of not knowing what is contained in each file is managed through the automatic combination of FRBR entities using cryptographic digests. The expansion of these digests will allow for validation of content across representations, making sure that the content is the most important aspect of data transmission, not the format. For our data conversion use case, consumers can verify that the raw RDF con- version we provide has the same content as the file retrieved from the government. When they use an enhanced RDF conversion, we can tell them that it was derived 42 from something they trust. Additionally, our extension of FRBR to electronic infor- mation resources and use of content and message digests to identify these resources should make it much easier for digital libraries to manage resources that are both electronic and physical [72]. Because of the fine-grained identity assertions that can be made, independent third parties can provide assurances to data consumers that they are producing and providing data that is just as good as the original data, and in cases where enhancements have taken place, is better than the original. Two independently generated raw conversions of government datasets can be trusted directly because of matched content digests, and trust of the enhancement can be earned by inspection of the results and conversion parameters. We believe that these sorts of abstractive relationships among entities are important for accurately expressing the provenance of information resources. Cur- rently, the W3C Provenance Working Group, in its work to develop a provenance standard for the web, is including a property to express abstractive relationships. It is our view that, as FRBR provides significant value in expressing provenance for information resources, PROV should include FRBR relationships and classes in the core ontology to encourage re-use. Barring that, an extension that includes FRBR should be recommended so it can be applied to information resources.

4.6.1 Future Work We would like to investigate the use of FRBR to handle composite workflows and for providing high-level visualizations of workflow history. While higher granu- larity workflows work at more concrete levels, lower granularity workflows work at more abstract levels. Similarity of concrete traces may be able to be determined through analysis of workflows at higher levels using FRBR. We can also provide veracity of enhanced conversions by supplying digests of the original data, enhance- ment parameters, and the resulting enhanced data. Users can then reproduce the original conversion and verify it via content digests. While the use case presented only uses one dataset, it should be possible to show how datasets can be combined from multiple sources. 43

Additional work is needed in new types of content hashes for other types of media. We have covered RDF graphs, spreadsheets, databases, and moving and still images. Other media types, including audio, video with audio, and text, need to be explored to determine if they can be given content digests. Content digests for these types would make nearly all information resources identifiable by their content.

4.7 Conclusions As part of the linked open government data project, we perform aggregation and curation by applying Linked Data principles when we generate linked open data. With OGD, combining datasets with the semantic web gives value to that data. However, for data consumers to trust the data enough to act on it, they need to have an accurate picture the sources depended upon, what content has been created and how it has been modified. We developed a use case that expresses these needs, and showed that using FRBR to build multiple levels of abstraction of information resources, when paired with content-based cryptographic digests, allows for easy identification and validation of information resource content. The use of these digests to identify Expressions makes it possible for data consumers to trust third parties with management of data by making that management transparent at a level that is relevant to the consumer. This use of multiple levels of identity, especially content-based identity, makes it possible for data consumers to trust what modifications, if any, have been made to the data they use. As our LOGD system is a form of , our experiences with improving trust and transparency can possibly be applied to that domain as well. This chapter provides a way to show consumers why they should (or should not) believe that the data they asked for is actually the data they received. This chapter also provides the initial basis for building data structures that can be used to create portable (through FRBR and RDF graph digests), revisable (through FRBR and PROV relations), attributable (through PROV relations) digital signature schemes. The next chapter will expand and formalize the FRBR/PROV mapping and show how it can be used to create provenance traces of information resource access on the web. 44

Figure 4.2: The data products from the use case in Figure 4.1 can be or- ganized according to their four parallel identities established by FRBR’s Work, Expression, Manifestation, and Item. In this figure, Items are unlabeled as such but are the innermost boxes, representing individual instances of resources. 45

Figure 4.3: FRBR provenance when Data Integrators E and W retrieve two different URLs. The relations among the requested URLs becomes apparent: URL 1 (eventually) redirects to URL 2, which redirects to URL 3. Although retrieved independently, the files share the same Manifestation and Expression be- cause the message digest and content digest were used to name them, respectively. The unlabeled dashed lines are rdf:type triples. 46

Figure 4.4: FRBR provenance when Data Integrator E converts the CSV to raw RDF. Although the files’ Manifestations differ, the Expression is the same. By this, we know that no new content was created (or lost) in the conversion. 47

Figure 4.5: FRBR provenance of the CSV, raw RDF, and a conversion of the raw RDF into RDF/XML. Although the RDF/XML is not stated to be derived from the original, their common Expression permits us to view them as content-equivalent. 48

Figure 4.6: FRBR provenance applying enhancement parameters to the CSV’s conversion to RDF. Although the raw RDF’s Expres- sion was recognized as tabular and mapped to the same Ex- pression as the CSV, the enhanced RDF is a new graph. This new content structure results in a digest derived from the RDF Abstract Model instead of a table. The new Expression is associated with a new derived Work. CHAPTER 5 Information Resource Provenance on the Web

HTTP transactions have semantics that can be interpreted in many ways. At a low level, a physical stream of bits is transmitted from server to client. Higher up, those bits resolve into a message with a specific bit pattern. More abstractly, information, regardless of the physical representation, has been transferred. While the mechanisms associated with these abstractions, such as content negotiation, are well established, the semantics behind these abstractions are not. We extend the library science resource model Functional Requirements for Bibliographic Resources (FRBR) with cryptographic message and content digests to create a Functional Requirements for Information Resources (FRIR) ontology that is integrated with the W3C Provenance Ontology (PROV-O) to model HTTP transactions in a way that clarifies the many relationships between a given URL and all representations received from its request. Use of this model provides fine-grained provenance explanations that are complementary to existing explanations of web resources. Furthermore, we provide a formal explanation of the relationship between HTTP URLs and their representations that conforms with the existing World Wide Web architecture. This establishes the semiotic relationships between different information abstractions, their symbols, and the things they represent.

5.1 Introduction The architecture of the World Wide Web [73] defines the relations between URLs, Resources, and Representations, which is illustrated in Figure 5.1. However, these relationships are incomplete, since the content of representations can change over time and content negotiation can result in different data being transferred. For example, the temperature reading in a weather report will change regularly, while different requests for the same weather report can return a variety of formats such

This chapter previously appeared as: J. P. McCusker, et al., “Functional requirements for in- formation resource provenance on the web,” in Provenance and Annotation of Data and Processes, Santa Barbara, CA, Jun. 2012, vol. 7525, pp. 52-66.

49 50 as HTML, XML, RDF, and JSON. The ability to explain what an HTTP client sees as a result of a transaction and how, exactly, it relates to the URL that it requested is critical to the understanding of both how information resources14 work on the web and how the provenance of web information resource access should be represented. We look to library science and provenance models to help provide these explanations, along with some help from the field of semiotics. There are many reasons to clarify these semantics. For instance, the content of an image is more important than its format. Validating that a pathologist reviewed a particular image relies on the fact that the pathologist saw a particular image, not what file format it was saved in. In fact, transcoding of that image from a database to the client may happen as part of a web application. If it were possible to identify content regardless of format, our doctor would be able to make verifiable claims that she not just read data from a particular file, but that she saw a particular image. Similarly, web site mirroring mechanisms allow the same content to be available from multiple locations. Content-based identity of information would allow users to discover alternative locations for data, and validate that the information is actually the same regardless of source or format.

5.1.1 A Weather Example To illustrate some of the issues regarding the relationship between a URL and the variety of representations that its request may return, we use a weather re- port provided by the National Oceanic and Atmospheric Administration’s (NOAA) National Weather Service. Current weather conditions are provided for locations across the United States and include fundamental measures such as time, temper- ature, wind direction, and visibility distance. The latest hourly reports for Boston are provided in both RSS15 and XML16 formats. Although the service reports that it updates every hour on the hour, updates occur at unpredictable intervals. In this particular example, the service updated at 3:00 and 4:00, handled RSS requests at 3:05 and 4:05, and handled XML requests at 3:10 and 4:10.

14Because we consider URLs returning status codes other than 200 to be non-information re- sources, they are out of scope in this paper. 15http://www.weather.gov/xml/current obs/KBOS.rss 16http://www.weather.gov/xml/current obs/KBOS.xml 51

Figure 5.1: The relationships between identifier, resource, and represen- tation from Architecture of the World Wide Web [73].

Given the current Web Architecture, what can we say about these two URLs (the XML and RSS versions) and the four representations retrieved by their request? According to the Architecture of the World Wide web (AWWW), [73] the two RSS files (which are different, as the weather has changed) represent the referent identified by the URL, while, because the URL is different, the XML files represent another referent. That these are alternative representations of the same referent (the weather in Boston) means that we need a more sophisticated understanding of how the four files relate to one another and whether each relates to its URL differently. How can this be accomplished? We could compare files, but different formats would make it impractical to see their similarities. We could look to the files’ creation date to learn when each file was received, but we cannot know how content has changed over 52 time or if two transactions returned the same content in different representations. If different clients received the different representations, how can they begin to rationally discuss, compare, and share their individual representations?

5.2 Background: Existing W3C Recommendations This leads us to wonder if there are any other existing semantics defined in W3C recommendations relating to how URIs, XML entities and RDF resources are related. This may appear to be a surprising question after years of success of W3C recommendations. However, the latest recommendations for XML [74] and RDF/XML [22] do not illuminate the issue. The XML recommendation [74] comes no closer to the issue than to state the following:

“Attempts to retrieve the resource identified by a URI may be redirected at the parser level (for example, in an entity resolver) or below (at the protocol level, for example, via an HTTP Location: header). In the absence of additional information outside the scope of this specification within the resource, the base URI of a resource is always the URI of the actual resource returned. In other words, it is the URI of the resource retrieved after all redirection has occurred.”

From this definition, one can infer that more than one representation of a resource may be returned from a URL and that the exact nature of this resource can be unpredictable. This is because an HTTP-based entity resolver implies the ability to return multiple representations of the same content. Similarly, the RDF/XML recommendation [22] states that:

“nodes are RDF URI references, RDF literals or are blank nodes. Blank nodes may be given a document-local, non-RDF URI references identifier called a blank node identifier. Predicates are RDF URI references and can be interpreted as either a relationship between the two nodes or as defining an attribute value (object node) for some subject node.”

but goes no further. An “RDF URI reference” is syntactically described, and the recommendation further discloses that “RDF URI references are compatible 53 with the anyURI datatype as defined by XML schema datatypes [75], constrained to be an absolute rather than a relative URI reference.” None of the illuminate what a URI is supposed to represent, nor the relationship between a URL and the information that can be retrieved from it. Again, this leaves the recommendation reader without an explanation of what is the meaning of a URI in an RDF graph.

5.3 The semiotics of HTTP URLs The dereferencing of a URL can be mapped to a semiotic interpretation. For example, it is possible to use Ogden and Richards’ Semiotic Triangle [76], a model of how real world objects are related to symbols and how people think about those objects from a linguistic perspective. In order to consider HTTP operations in these terms, it is important to remember that a URL is not only a symbol but also an address for information about that symbol. For example, http://www.weather. gov/xml/current_obs/KBOS.xml indicates that a web page can be accessed using the HTTP protocol against the server denoted by the name www.weather.gov and requesting the document ’/xml/current obs/KBOS.xml’. The document obtained is a representation (an XML document) of the thing identified by this URL. Figure 5.2 illustrates the partial correspondence between the semiotic triangle and the web architecture. While a URL is a Symbol that stands for and identifies a Referent Resource, the correspondence to thoughts (from the Semiotic Triangle) or representations (from the AWWW) isn’t immediately clear. The document retrieved cannot be defined only as a representation of a resource: The document can be described in terms of either its content or the set of bytes used to represent it – or both. So, the document must be described further to achieve our purpose. A potential solution is to refine the representation into its constituent identities that are based on different levels of abstraction. In the next section we will introduce a model that, when paired with a provenance model, can provide the necessary distinctions to fully express both the semiotic relationships inherent within HTTP and the means to provide provenance traces for HTTP transactions at the levels of abstraction that are inherent within the protocol. 54

Figure 5.2: AWWW’s URL and Resource correspond to the semiotic tri- angle’s Symbol and Referent, respectively. A representation is itself another referent that is not identified here, but will be elaborated on in Section 5.4.

5.4 FRBR and FRIR Functional Requirements for Bibliographic Resources (FRBR) [77] is a ma- ture model from the library science community that distinguishes four aspects of an author’s literary work, ranging from purely concrete to completely abstract. For instance, FRBR can describe how different copies of the same book, or different editions of the book, relate to each other. The most concrete aspect is the Item – the physical book that exists in the world. Items are singular entities; making a copy of an Item results in a new Item. Items are exemplars of Manifestations, which represent similar physical structure. For instance, an exact copy of an Item preserves the original Manifestation. If the copy is inexact, or if the book is turned into an audio book, then the Manifestation changes. However, the Expression of the paperback and audio book remains the same, because the Expression reflects par- ticular content regardless of physical configuration. An Expression in turn realizes a Work, which is “a distinct intellectual or artistic creation” [78]. A Work remains the same through different realized Expressions that result from translation, revision, or 55 any other change. To facilitate discussion, we use the term FRBR stack to refer to a tuple (frbr:Work, frbr:Expression, frbr:Manifestation, frbr:Item) that represents these four distinct aspects of a resource. Introduced in Chapter 4, Functional Requirements for Information Resources (FRIR) [79] extends the use of frbr:Work, frbr:Expression, frbr:Manifestation, and frbr:Item to electronic resources, and therefore any information resource. FRIR integrates FRBR with the W3C Provenance Ontology (PROV-O) by declaring frbr:Endeavour (the superclass of frbr:Work, frbr:Expression, frbr:Manifestation, and frbr:Item) to be a subclass of prov:Entity and mapping 14 of 18 frbr:related- Endeavour subproperties as subproperties of one or more of prov:wasDerivedFrom, prov:alternateOf, and prov:specializationOf, as shown in Table 5.1. Within electronic resources, a frbr:Work remains a distinct intellectual or artis- tic creation. A frbr:Work corresponds to the Resource or Referent in the semiotic framework discussed above, and can be identified by a URL, as was shown in Figure 5.2. URLs that identify information resources cannot be assumed to be content- invariant, since what is returned can change over time. It is also not format invari- ant, as content negotiation allows for other content types to be returned based on the HTTP Accept header. Since the content from a given URL can be both format and content variant, the appropriate level of representation is frbr:Work. Taken together, frbr:Expression, frbr:Manifestation, and frbr:Item are all aspects of the Representation, and are each Referents in their own rights. Inasmuch as they can be identified or symbolized, they have symbols that identify them. frbr:Expression corresponds to a specific set of content regardless of its serialization. For instance, two files would have the same frbr:Expression, but different frbr:Manifestations if they are the same picture stored in two different formats (e.g., JPG and PNG). Similarly, a spreadsheet stored in both CSV and Excel would still have the same frbr:Expression but different frbr:Manifestations. frbr:Manifestations correspond to a specific bit pattern. If a file is an exact copy of another file, they have the same frbr:Manifestation. An frbr:Item is a specific copy of information stored some- where or transmitted through a communication link. If a copy of the frbr:Item is made, it results in a new frbr:Item with the same frbr:Work, frbr:Expression, and 56 frbr:Manifestation.

Table 5.1: The first table contains class mappings between FRBR and PROV-O. The second table contains property mappings be- tween FRBR, FRIR, and PROV. PROV super properties are columns and FRBR and FRIR subproperties are rows. Prefix mappings for the first two tables are in the third.

Subclass Superclass frbr:Event prov:Activity frbr:ResponsibleEntity prov:Agent frbr:Endeavour prov:Entity nie:DataObject prov:Entity Subproperty wasDerivedFrom alternateOf specializationOf frbr:adaptionOf X frbr:imitationOf X frbr:reconfigurationOf X frbr:transformationOf X frbr:abridgementOf X X frbr:arrangementOf X X frbr:reproductionOf X X frbr:summarizationOf X X frbr:translationOf X X frbr:alternateOf X frbr:revisionOf X frir:redirectsToTransitive X frbr:embodimentOf X frbr:exemplarOf X frbr:realizationOf X Prefix URI frbr: http://purl.org/vocab/frbr/core# frir: http://purl.org/twc/ontology/frir.owl# prov: http://www.w3.org/ns/prov# nie: http://www.semanticdesktop.org/ontologies/2007/01/19/nie#

As introduced in Chapter 4, we have identified two levels of cryptographically computable identity: content and message. A number of content digests have been developed for RDF graphs, spreadsheets, images, and XML documents that provide the same digest hash regardless of any particular serialization. We use this to com- putationally identify frbr:Expressions. Further work on creating content digests will allow us to incrementally improve the ability to identify common frbr:Expressions. 57

These identifiers fill out the means by which to identify the representation referents, as shown in Figure 5.3.

5.5 Explaining HTTP with FRBR, FRIR, and PROV-O A URL denotes a single frbr:Work. Unless otherwise specified, two identifiers can, in RDF, potentially denote the same thing. URLs are perfect examples of this. If a web site is mirrored, a page on the mirror corresponds to a page on the original. Those two pages can be thought of as the same frbr:Work within the FRBR/FRIR perspective. Content retrieved from URLs can change over time, but are expected to have a similar sort of coherence as defined by frbr:Work as “a distinct intellectual or artistic creation [78].” HTTP 1.1 [67] introduced content negotiation, which makes it possible to abstract a URL away from any one particular file format. When a client asks an HTTP server for a mime type at a URL, the server can respond with many different possible files depending on how the content is negotiated. If the client asks for plain text, the server will try to find the best way of representing the content of the URL in plain text. This idea of “same content regardless of format” is built into frbr:Expression. As previously discussed, the bit sequence of a file aligns very closely with frbr:Manifestation, so we use message digests to express this. frbr:Items can be files on disk, but they can also be data as streamed over a network connection. We uniquely identify the data streamed over a particular HTTP transaction using the combined message digest of the HTTP header and content. Since the header includes the exact time that the transaction occurred, the likelihood of a frbr:Item collision is very low. This enables provenance trace assertions to be applied to individual HTTP transactions without having to store the entire transaction. An HTTP GET can be a very simple transaction. A client makes a request to a server for a particular URL, the server looks up which file corresponds to that URL, and copies it to the network channel in response. The client then copies the data it sees from the network channel and either saves it to disk or displays it on screen. Things can become much more complicated on both ends, but these complications can be explained using current provenance representations, including 58

Figure 5.3: Relating URIs, Resources, and Representations using FRIR, FRBR, and the semiotic triangle. URLs are symbols that identify resources, which in semiotics are referents and con- sidered frbr:Works in FRBR. The representation of that re- source is the content that comes from dereferencing the URL, and is composed of an frbr:Expression, frbr:Manifestation, and frbr:Item. The proposed content identities create im- plicit symbols (URIs) for each level of representation. Users can then use the level of abstraction that suits their task. 59 the W3C PROV-O standard [11]. This simple case, however, belies the subtleties that we discuss above. The following is a formalization of an HTTP GET request and response composed of common provenance constructs (events, generated by, used, etc.) from PROV-O:

HTTP GET: The server and client both share an event, the HTTP connection, which is composed of a request and response. The request is generated by the client and transmitted to the server. It is itself an Item with a singular FRBR stack. The request is for a specific frbr:Work, and if there are Accept headers sent, then the request is for a frbr:Manifestation with specific properties (the file format). The server then uses the request to generate a response, which is an Item of the URL’s frbr:Work. This frbr:Item only exists on the network channel, and if the client saves the frbr:Manifestation to disk, it produces another frbr:Item. The response Item is derived from the server’s file frbr:Item, and the client’s file frbr:Item is derived from the response Item. All three items share the same Manifestation, Expression and Work (the URL).

HTTP POST: A similar explanation can be made for HTTP POST requests, which send a document as input content. In this case, both request and re- sponse content can be represented as FRBR stacks with no explicitly identified frbr:Work. Because web servers that handle POST requests derive their re- sponses from the request, their handling can be formalized as a derivation edge in a provenance graph using the POST URL as an agent controlling the transaction process. Two HTTP request methods, PUT and DELETE, are used specifically to change the value of the frbr:Work by creating a new frbr:Expression (PUT) or invalidating existing frbr:Expressions (DELETE).

HTTP also provides other request methods to ask for services and informa- tion about a particular resource. These metadata request methods, HEAD and OPTIONS, do not provide information resources as discussed, and so are not in the scope of the paper. Similarly, the HTTP request methods TRACE and CONNECT are more functional in nature and deal more with the actual server than its content and are also outside the scope of this paper. 60

5.6 Implementation We provide an implementation of curl called pcurl.py17 that will record the provenance of an HTTP GET transaction using FRBR18, FRIR19, Nepomuk File Ontology (NFO)20, PROV-O21, and HTTP-in-RDF22. We show a retrieval of the HTTP-in-RDF core classes as an example in Figure 5.4. We use message and content hashes to generate URIs for frbr:Expressions, frbr:Manifestations, and frbr:Items to allow for automatic aggregation of endeavors that share the same hash. Future use of OWL keys and multiple digest algorithms is enabled through creation of frir:ContentDigest and nfo:FileHash instances. In Figures 5.5 and 5.6 we also show how transcoding and mirroring are represented in the FRIR model.

5.7 Discussion Using a four-part FRIR stack to identify web resources makes it possible to do a number of useful things. For instance, in cases such as the weather report, RSS feeds and XML files, the same information is conveyed in multiple formats at different URLs. FRIR naturally expresses this by asserting that the frbr:Works (the page URL) are owl:sameAs each other. More concrete levels of the FRBR stack, such as frbr:Manifestation and frbr:Item, however, will be distinct because of the differing formats and different physical locations of the representations. The identity between the frbr:Works and frbr:Expressions of these two URLs can be expressed in semantic site maps so that tools that prefer one format of data over another can discover which URLs can be accessed without concern for missing content from varying formats. In fact, we have previously argued that owl:sameAs has been overextended to assertions in Linked Data [81]. Additionally, we have shown how FRIR can be used to provide clarity to management of Open Government Data (OGD) [79] and have argued that FRBR constructs can be used to provide a clear description of sources of information on the web [82].

17http://purl.org/twc/software/pcurl.py 18http://purl.org/vocab/frbr/core 19http://purl.org/twc/ontology/frir.owl 20http://www.semanticdesktop.org/ontologies/2007/03/22/nfo 21http://purl.org/twc/page/prov-o 22http://www.w3.org/TR/HTTP-in-RDF/ 61

Figure 5.4: Results of applying pcurl.py to retrieve the weather result example. The HTTP-in-RDF, FBIR, FRIR, PROV-O, and NFO vocabularies are used to create RDF descriptions of the representation received when the URL is requested. Enti- ties are named using message and content digests, the HTTP transaction Item is associated to the file Item, which in turn has a FRBR stack representing all four aspects from the con- crete file to the abstract URL/Referent/Work. 62

By formally modeling the description of web retrieval, we can compare the content received at different levels. In our weather report example, as the weather changes, so does the data. Two clients may see different data if they access it at different times. frbr:Works and frbr:Expressions can be used to show that con- tent has changed, even when accounting for potentially different data formats, as is the case with the weather (XML vs RSS). Similarly, using content digests and cryptographic signatures, clients can assert that they have seen specific content, re- gardless of format, on a particular web page. This is critical for access to scientific databases. Available information is changing daily and released in different repre- sentations, which is convenient but can hinder experimental repeatability. Being able to assert which data was used in an experiment improves the transparency and veracity of the sciences that take advantage of that data. Finally, this theory of web information and access provides a way to create consistent provenance assertions about access of web information resources, which can improve interoperability of provenance statements about access of information resources.

5.8 Conclusion We have shown how the use of FRBR and FRIR can help to describe the relation between a URL and the representation obtained using HTTP. We have also shown how this new representation describes a richer set of entities that can be identified by different elements from FRIR. Thus it is possible to use Content, Message and Transaction digests to identify the Expression, Manifestation and Item aspects of the representation. This can lead to a semantically richer description of an HTTP GET operation that includes provenance about the information published and transmitted on the web at each level of abstraction. As future work, there are several paths we may take: In this paper, our focus has been on URLs that identify information resources. There is also the question of what (and how) non-information resources can be described in terms of FRBR and FRIR. This is particularly interesting when considering the relationships between URL frbr:Works that are associated by HTTP 303 redirections. Additionally, the solutions we present here are in principle compatible with the proposed changes to 63 what has been called the HttpRange-14 issue [83]. Applying FRBR and FRIR to address the relationships between a resource, its representations, and its identifiers in a clear manner can serve as a standard pattern for the provenance of web resource access, comparison, and integration. RDF Viewer Demo http://dl.dropbox.com/u/9752413/ipaw2012/transcoding.html

64

a Work

realizationOf

a Expression

embodimentOf

Full-size JPEG Histology Thumbnail PNG Histology a Manifestation a Manifestation

exemplarOf exemplarOf

a Response a Item a Response a Item

reproductionOf reproductionOf used wasGeneratedBy used wasGeneratedBy has response has response

a Request, Activity a Request, Activity

wasAttributedTo hadPlan

Jim McCusker HTTP 1.1 GET a Agent a Plan

Figure 5.5: An example of transcoding a histogram image from a large JPEG to a small thumbnail PNG. The frbr:Expression and frbr:Work are the same across the transcoding, but the frbr:Manifestations and frbr:Items are all distinct. This al- lows, for instance, a patient to verify that the low resolution image shown to them is the same content as the higher resolu- tion image used to actually perform the analysis, even though the format and sizes are different. This graph was produced using pcurl.py

1 of 1 3/28/12 11:58 PM 65

RDF Viewer Demo http://dl.dropbox.com/u/9752413/ipaw2012/mirroring.html

a Work

realizationOf

a Expression

embodimentOf

Full-size JPEG Histology a Manifestation

reproductionOf exemplarOf

a Item, Response a Item has response reproductionOf a Item a Response wasGeneratedBy used wasGeneratedBy has response used

Request to PubMed Central Request to Yale Image Finder a Activity, Request a Activity, Request

hadPlan wasAttributedTo

HTTP 1.1 GET Jim McCusker a Plan a Agent

Figure 5.6: An example of mirroring content between web sites. Here the Yale Image Finder [80] provides a mirror of an image published at PubMed Central. Since the file is an exact copy, the frbr:Work, frbr:Expression and frbr:Manifestation align, while the individual copies are different. This graph was produced using pcurl.py

1 of 1 3/29/12 12:49 AM CHAPTER 6 RDF Graph Digest Algorithm 1

We present a new digest algorithm for Resource Description Framework (RDF) graphs called RGDA1 (RDF Graph Digest Algorithm 1) based on the digest method from Sayers and Karp and the graph canonicalization method traces from McKay and Piperno. RGDA1 provides average case polynomial time computation of graph reproducible, unique identifiers for all RDF graphs. Since it is based in RDF, RGDA1 also supports the labeling of directed graphs with edge labelings, and is optimized for use where most nodes in the graph are already labeled. RGDA1 has been incorporated into RDFlib as its default graph canonicalization algorithm. We evaluate the advantages and limits of its use in identifying RDF data by creating the BioDomain benchmark, which uses 229 OWL ontologies from BioPortal and Open Biomedical Ontologies (OBO). The benchmark on RGDA1 experimentally shows weakly polynomial performance to RDF statement count within those ontologies.

6.1 Introduction Graph canonicalization, a critical first step in computing an RDF graph digest, has no known worst case polynomial time algorithm, but a number of polynomial algorithms are available for the average case [84]. We will discuss a variation on the traces [84] technique that produces reproducible cryptographic digests for all RDF graphs. We hold the contents of the graph to be static, as different inference rules (or subsets of rules) produce different graphs. Further, different reasoners can even produce different results using the same rules, based on their evaluation strategies. Computing a graph identifier for graphs with inference closure requires knowing the exact inference rules, which is variable knowledge external to the graph. To understand the problem of identifying graphs is to understand the role of blank nodes in RDF. Any RDF statement that has either an identified resource or literal in all slots of its triple is identifiable by itself, as there is no additional context to consider. Further, RDF states that its graphs are sets of statements. Any given

66 67 statement can only exist in a graph once. However, the use of blank nodes provides an opportunity to have statements that could be indistinguishable, depending on the scheme used to identify blank nodes. Blank nodes have been subject to significant discussion, while the practical use of them in most RDF has varied based on the implementer’s understanding of them [85]. However, they do have practical uses, especially since they are required to express property restrictions, complementary, union, and intersection classes.

6.2 Related Work The idea of computing graph digests for identifying RDF graphs is not new. Carroll [7] proposed a system based on canonical serialization. Carroll’s algorithm only attempts one-step deterministic labeling based on the immediate neighbors of each blank node. In fact, Carroll give examples of blank nodes which are not suc- cessfully labeled, and therefore does not successfully produce stable identifiers for all RDF graphs. Canonical serialization poses a significant number of problems, the first of which is that one must first establish an order to indistinguishable state- ments in a graph. This is because of the difficulty in unambiguously interpreting automorphisms. An additional problem is that canonical serializations require a significant number of decisions to be made as to how the graph will be canonical- ized. For instance, what will the canonical serialization be? Should it be XML, Turtle, N-Triples, or some other format? Should all URIs be fully expanded with no prefix namespaces? Further, ordering of triples must be determined: should they be sorted using code point ordering? Finally, canonical serialization requires that the document be serialized before the identity is computed. Algorithms that work directly on graphs can potentially be processed directly on the data structure, which reduces computational and memory overhead for larger documents. Additionally, the algorithm suggested to canonicalize blank nodes leaves many nodes unlabeled that could be canonically labeled using other techniques. Sayers and Karp proposed a serialization-independent method for producing graph digests [6]. This method has significant benefits - if the graphs are disjoint (contain no common statements), then the digest of the combined graph can be com- 68 puted from the digests of each constituent graph. The algorithm produces a serializa- tion for each statement, hashes the serialization, and applies an order-independent aggregation function (such as sum or product) to those hashes. However, they pro- vide a means of skolemizing (locally naming) blank nodes that requires coordination between the generator and consumer. In our use cases, no such coordination is pos- sible, since the consumer can be anyone. The skolemization method must therefore be integrated into the digest algorithm and we must locally identify blank nodes to the best of our ability. We extend their work with a more appropriate treatment of blank node identities based on graph canonicalization. Kuhn and Dumontier [86] provide a canonical serialization-based algorithm as part of their ”Trusty URI” scheme, but are able to force users to skolemize the blank nodes in their graphs. In most cases, however, graph canonicalization algorithms must take RDF as it comes and accept all legal RDF graphs.

6.2.1 Graph Canonicalization We cannot review the entire field of graph canonicalization, as the literature about it is vast. Read and Corniel [87] and Gati [88] document the field up to 1979, while [84] covers the best-performing algorithms to date (2014). Parris and Read [89] developed the current dominant strategy of vertex group refinement and individua- tion. Most articles about graph canonicalization provide very formal, mathematical descriptions of the algorithms that can be very dense. We will attempt to provide informal descriptions that can provide a foothold into understanding the algorithms. Refinement starts by recursively splitting nodes into separate colors based on adja- cency information. As nodes are given a particular color, that different in adjacency is passed on to their neighbors. Once all possible distinctions are made between nodes, algorithms then attempt to individuate vertices with shared colors. An in- dividuation operation takes a vertex that shares a color with others and produces a unique color for it in some consistent way. The method of doing this is generally specific to the algorithm in question. Individuation algorithms attempt to provide a consistent order to individuations so that the resulting colorings are always repro- ducible. This usually requires exploring a tree of individuation options and choosing 69 a “best” one based on some consistent criteria. Each individuation is followed by a refinement operation that uses that individuation to provide distinctions between as many vertices as possible. McKay [90] introduced the first canonicalization algorithm that could handle hundreds of vertices and large automorphism groups. Now called nauty, it uses depth-first search to choose individuations and prunes subtrees that are either auto- morphic with the best node so far, or are lexicographically larger in the sort. Bliss [91] and saucy [92] follow similar depth-first strategies as nauty, except are optimized for sparse graphs. Traces [84] proposes a method of breadth-first search with exper- imental incursions to leaves of the individuation tree to discover automorphisms. As discussed above, most RDF graph canonicalization algorithms do not at- tempt to address blank node canonicalization, or do so only in a limited manner. The only prior implementation that attempted to canonicalize blank nodes is bun- dled with RDFlib [93]. The RDFlib implementation as written has three minor flaws and one major flaw. The first is that it directly hashes Python tuples containing RDFlib objects. This makes it difficult to implement a matching function in another language. Second, it relies on the MD5 hash algorithm [94] to provide a hash of each tuple. MD5 was broken in 2005 [95] and is deprecated in favor of SHA-256 and its siblings. The third flaw in the RDFlib implementation is that, after computing a hash for each component of the blank node’s neighborhood, it sorts the hashes and computes a new hash based on the resulting tuple. The major limitation to the RDFlib algorithm is that it does not attempt to break automorphisms of any kind. When used with a graph digest algorithm, nodes that use the same identifier produce replicate statements within the graph. This results in graphs with identical labelings that are actually different. Our new implementation of RGDA1 in RDFlib fixes these issues by combining Sayers and Karp’s algorithm with an implementation of traces.

6.3 Implementation One major benefit of RGDA1 is that it attempts to minimize the number of decisions made in its implementation, and then use methods for the remaining de- 70 cisions that are easily reproducible in a variety of implementations. We contributed a reference implementation of RGDA1 to the RDFlib project in Github23. The RGDA1 algorithm was release as part of RDFlib 4.2.0.24 As of April 20th, 2015, RDFlib 4.2.0 has been installed over 16,000 times in the past month via the Python Package Index (PyPI).25 We use Sayers and Karp’s statement digest and aggregation technique[6], where a message digest is computed for each statement in a graph and then com- bined using an aggregation function. We chose to use the sum function to aggregate statement digests and use SHA-256 to compute a message digest of each statement. Sayers and Karp suggest the use of either product (Q) or sum (P). Product has the disadvantage of quickly requiring an integer overflow size on the result, which means that the size needs to be communicated with the hash. Sum, on the other hand, does not need an integer overflow size and therefore requires less pre-coordination. While many graph algorithms have relied in the past on MD5 [94], it is possible to produce digest collisions at will to produce forgeries and has been deprecated in favor of the SHA family of message digests [95]. SHA-1 has also been found to be vulnerable to similar attacks as MD5, so it as well is deprecated [96]. SHA-256 is an improved version of the SHA [97] that is generally considered secure against forgery attacks [98]. Further, we use the N3-style canonical serialization of s, p, and o de- limited by spaces, as seen in Equation 6.1. N3 serialization is stable, self-contained, and consistent across implementations. Further, it does not vary unless the actual content varies.

n X SHA-256 (’ ’.join(N3(si), N3(pi), N3(oi))) (6.1) i=0 In order to implement traces for RDF, we make some adjustments to parts of the algorithm. RDF is a directed graph where some of the nodes are already labeled by URIs and literals, and all edges are labeled by URIs. A successful graph canoni- calization algorithm will need to account for that, and we adjust traces accordingly.

23http://github.com/jimmccusker/rdflib 24https://github.com/RDFLib/rdflib/releases/tag/4.2.0 25https://pypi.python.org/pypi/rdflib 71

First, we only need to consider nodes that are adjacent to blank nodes and blank nodes themselves. Because literals and URI-identified resources are already labeled, we do not need to find labels for them. Further, we use the literal and URI identifiers to provide colors for non-blank nodes. Rather than using the degree of the node to distinguish between nodes, we distinguish based on the color of the neighbor and the label of the edge. Node colors are expressed as tuples of tuples: as a node is distinguished, a new tuple is added to the color. Each distinguishing tuple is of the form (s, p, o) where p = 1 and o = 3 if p or o is the blank node being distinguished. When a blank node is used to distinguish another, its current color is converted into a single integer using Sayers and Karp’s algorithm on the tuple of tuples. This hash is used in the coloring for other nodes. It therefore allows for the aggregation of color from broader and broader contexts, but only incorporates enough to establish distinct colors for nodes. This color hash is used to produce the final identifier for the blank node once the algorithm is complete. For individuation, the algorithm proceeds as in traces. The maximal indi- viduation is produced by sorting the nodes of the tree at given level by using the tuple-based color of the graph at that point of individuation. This has the benefit of sorting the most distinct colorings last, which minimizes the depth of the search tree. See Listing 6.1 for pseudocode of the combined algorithm:

Listing 6.1: Pseudocode of RGDA1, including the reference implementa- tion of traces. def rgda1(G): V, E in G, (all BNodes in V) in C C = refine(c in C where not discrete(c)) while any Color in C not discrete: C = individuate(c in C where not discrete(c)) C = traces(c in C where not discrete(c)) for c in C: c.nodes[0].id = hash(c.tuple) return hash(E) def hash(triples): 72

r e s u l t = 0 for t in triples: result += sha256(’ ’.join([N3(x) for x in t])) return result def refine(colors): colors = sort colors by length then hash(color.tuple) seq = shallow copy of colors while | sequence | > 0 and any Color in colors not discrete: W= sequence.pop() for c in coloring where not discrete(c): colors = sort(distinguish(c, W) by hash(x.tuple)) replace c in coloring and sequence with colors return colors def discrete(color): return color has more than one node def distinguish(color , W): for every node in color: extend its existing color tuple with: triples in V containing (node, ?p, w) for w in W, using pattern (1, ?p, hash(W.tuple)) extend its existing color tuple with: triples in V containing (w, ?p, node) for w in W, using pattern (hash(W.tuple), ?p, 3) return coloring with new colors def individuate(node, color , coloring): c o l o r copy = copy(color ).remove(node) individuated = copy(coloring) 73

individuated.remove(color ).append(color c o p y ) individuated .append({ tuple: color.tuple + tuple( | nodes in color | ), nodes: [node] }) return individuated def traces(coloring): for color, candidate in non−discrete colors: individuated = individuate(candidate). refined = refine(individuated) key = sum([hash(C.tuple) for C in refined]) exp = experimental path of individuations/refinements. exp key = sum([hash(C.tuple) for C in exp]) best = refined where min(key) and unique(exp key ) discrete = [x for x in best if discrete(x)] i f | d i s c r e t e | == 0 : for coloring in best: new coloring = traces(coloring) key = sum([hash(C.tuple) for C in refined]) discrete = new coloring where min(key) return discrete[0]

6.4 Evaluation Methods Our run of the BioDomain benchmark records the following metrics for 181 ontologies in the biomedical domain from BioPortal [99] and 48 from the Open Biomedical Ontologies (OBO) Foundry [100], totaling 229 ontologies:

Th the total time (in seconds) it takes to fully compute a digest for a given RDF graph.

Tc the time (in seconds) it takes to find repeatable, canonical identifiers for each 74

blank node in a given RDF graph.

S the number of statements (triples) contained in a given RDF graph.

B the number of blank nodes in a given RDF graph.

From these metrics we can infer the following additional metrics:

Ti = Th − Tc the time (in seconds) it takes to compute a graph digest given a graph with fully canonicalized blank nodes. This is the time it takes to run the Sayers and Karp algorithm.

All benchmarks were run on an Web Services (AWS) Elastic Com- pute Cloud (EC2) c3.4xlarge node using Amazon Machine Image ami-127def7a. The image is a stock Ubuntu 14.04 installation with minimally installed software, but has been configured to easily run the BioDomain benchmark. The benchmark can be configured for multithreaded execution, so we were able to process graphs con- currently using the EC2 c3.4xlarge node configuration, which provides 16 compute cores. Approximately half of the ontologies (177) downloaded from Bioportal do not successfully load by RDFlib. A summary of load errors are included in Table 6.1. We include the first 229 ontologies finished by the benchmark tool, 181 from Bioportal and 48 from OBO. They are available as supplemental material on re- quest. Many of the ontologies that are not parseable by RDFlib are from the OBO Foundry, which publishes using its own non-RDF file format. We crawled the OBO Foundry web site for OWL representations of these ontologies and added them to the benchmark.

6.5 Evaluation We will provide three methods of evaluation: an analysis of algorithm com- plexity, an assessment of th a performance benchmark of the algorithm on the OWL ontologies published at BioPortal [99] and the ontologies published by the OBO foundry. The BioDomain benchmark shows how the RGDA1 performs in the aver- age case against diverse, semantically rich, real life ontologies. 75

Table 6.1: 178 ontologies could not be loaded from Bioportal using RD- Flib. Most of the problem ontologies were in OBO file format, leading us to access the OBO OWL ontologies directly from their web site.

Problem Count OBO file format 91 Not Well Formed 33 Repeat Node Elements 23 Not Found 17 Forbidden 11 Invalid Token 2 Invalid Property Attribute URI 1

6.5.1 Complexity The complexity of our algorithm is taken in two parts: the graph digest algo- rithm itself, and the blank node canonicalization. Sayers and Karp show that the graph digest algorithm is O(N) in time complexity [6], but may have a large con- stant time per statement when using highly secure cryptographic hash algorithms. The traces algorithm provides an average case polynomial time, and a worst case exponential time when dealing with large automorphism sets. Since the number of blank nodes (and automorphisms) is generally a small fraction of most RDF graphs, the cost of this algorithm remains low as a function of triple count.

6.5.2 Algorithm Portability RGDA1 is platform independent. All algorithms rely on the use of publicly- available algorithms, such as SHA-256 and the N3 serialization format, or are sim- ple derivations of the same. Any implementing platform or programming language must support bigint (arbitrarily-sized integers), SHA-256, and N3-based serializa- tion. These features are supported or implementable in most major programming languages. The RGDA1 is also non-forgeable based on current cryptographic re- search. Because RGDA1 Graph Digests are composed of SHA-256 message digests, they have the same resiliency against forgery and key collisions. 76

Figure 6.1: Th runtime of graphs without blank nodes. Here, Th ≈ 0.0012S (r > 0.98, p < 8.72 · 10−39).

6.5.3 Benchmark Results Overall performance on graph canonicalization is mostly dependent on the size of the graph (r = 0.74, p < 1.5 · 10−41, where r is the correlation coef- ficient and p is the two-sided probability of the null hypothesis, that the slope of the line is actually zero) but diverges significantly based on whether or not a graph contains blank nodes. Non-blank node graphs (see Figure 6.1) display linear performance overall at 763 triples/sec. Graphs with blank nodes have a polyno- 1.59 mial time penalty for computing node color, where Th ≈ 0.000051S (see Figure 6.2). Graphs with automorphisms do not perform significantly differently overall, 77

Figure 6.2: Th runtime of graphs with blank nodes. Here, Th ≈ 0.000051S1.59 (r > 0.87, p < 2.0 · 10−57). Graphs with blank nodes that required individuation are colored yellow, and other graphs with blank nodes are colored blue.

although tend to take as long as the worst-performing non automorphic graphs. This suggests that individuating autmorphisms does not significantly impact the average performance of graph canonicalization in real-world graphs. Rather, most time per-statement is spent computing distinctions between blank nodes that do not need to be individuated. Out of 229 ontologies tested, only 13 exhibited blank nodes that needed to be individuated. All of these ontologies have redun- dancies in assertions. For instance, VIVO Core (http://vivoweb.org/ontology/ core#) has a duplicate restriction on vivo:ConferencePaper. The Clusters of Or- 78

Figure 6.3: Ti performance for non-blank node and blank node graphs −77 where Ti ≈ 0.00145S (R > 0.98, p < 2.22 ∗ 10 ). Graphs with no blank nodes are colored red, graphs with blank nodes that required individuation are colored yellow, and other graphs with blank nodes are colored blue.

thologous Groups (COG) Analysis Ontology (http://purl.obolibrary.obo/cao. owl#) declares that the class Brucella melitensis (NCBITaxon:NCBITaxon 29459 ) obo:hasRelatedItem 20 empty blank nodes with no other assertions about them. It is unclear what the provenance of these statements are, but these kinds of repetitions can generally be safely removed without changing the content of the ontology. A duplicate statement checker may be a useful step before computing the graph digest, but this may need to be presented to the user as they author the ontology, rather 79

1.88 Figure 6.4: Tc performance for blank node graphs where Tc ≈ 0.00032B (R > 0.95, p < 3.0 ∗ 10−94). Graphs with no blank nodes are colored red, graphs with blank nodes that required individu- ation are colored yellow, and other graphs with blank nodes are colored blue.

than during an automated process like graph digest computation. Once canonical identifiers for all blank nodes are determined, the data show that computing a digest for the graph is a linear operation, where Ti ≈ 0.00145S. Figure 6.3 shows the results of that operation. This translates to an average perfor- mance across all processed graphs of 811 triples/sec on this step. When considering the graph canonicalization step only, graphs without blank nodes suffer a small linear penalty of 9.39·10−06 s per triple to determine if the graph 80 contains blank nodes. If there are blank nodes, the predicted time to canonicalize 1.88 −94 them is Tc ≈ 0.00032B (r > 0.95, p < 3.0 · 10 ). See Figure 6.4 for the performance curve.

6.6 Discussion RDF graphs that do not contain blank nodes are strongly linear in time com- plexity. Since Linked Data publishing principles recommend avoiding blank nodes when possible, Linked Data-oriented RDF graphs should perform well using RGDA1. In fact, because the Sayers and Karp algorithm is associative across triples, it is eas- ily parallelized for large graphs using technologies like Graphics Processing Units (GPUs) [101]. OWL ontologies, while they do exhibit nonlinear performance, seem to avoid the computational pitfalls of hard to identify graphs. It is especially notable that for real-world RDF graphs, automorphisms are relatively rare, and when they do exist, they do not significantly add to the time complexity of canonicalizing the graphs that contain them. Incorporation into the RDFlib project has provided a number of benefits: since RDFlib itself provides a framework for graph representation, only the algorithm it- self needed to be implemented. Because of this and the terse nature of Python code, the algorithm was implemented in less than 300 lines of code, including com- ments and performance instrumentation. Further, RDFlib has been using graph isomorphism tests to validate round-trip conversions between different file formats, so this algorithm is used regularly to test RDFlib itself. Basing this implementation in RDF has the additional benefit of being able to test against real world graph- oriented data. While there are many graphs that might cause the RGDA1 to take weakly exponential time (because of the worst-case performance of traces), the vast majority of RDF data does not exhibit these characteristics. The traces algorithm belongs to an interesting set of algorithms, because when the same color ordering and generation functions are used, any algorithm within that family (including nauty), or any improvement to that algorithm, will still produce the same canonicalization. Being able to always have the same canonicalization 81 even after performance improvements means that graph digests produced today by RGDA1 using traces can be produced tomorrow by some successor algorithm that may use a descendent of traces or nauty. This means that graph digests can be long-lived, even after the algorithms that produce them are supplanted. Because of its average case performance and the ability to provide crypto- graphically secure (based on Sayers and Karp), reproducible identifiers (because traces produces reliable canonicalization), RGDA1 is ideal for use in digital sig- nature schemes and in graph authentication schemes in general. RGDA1 is also parameterless, so any developer can easily produce implementations that provide the same results. RGDA1 can therefore be used to identify graphs with zero ad- ditional pre-coordination. Finally, RGDA1 can be adjusted in the future to use different message digest algorithms, if SHA-256 becomes insufficiently secure in the future. Since the message digest algorithm is a drop-in component, RGDA2, for instance, could simply use a different message digest.

6.6.1 Future Work Some major opportunities are available for future work: on-line computation of graph digests in RDF stores, use of RGDA1 in data provenance, and parallelized computation of graph digests. RDF databases could keep running tallies of the hash for each graph. When statements are removed from a graph, the hash for each statement can be subtracted from it. When statements are added, the hash for each statement would be added. The principal challenge is how to update the canonical identifier for blank nodes that have statements added to or removed from them. This could cause cascade updates in blank node canonicalization. Generally, data provenance for RDF could use RGDA1 to identify datasets in frameworks like the Comprehensive Knowledge Archive Network (CKAN) [102] and in knowledge frameworks like nanopublications, similar to the use of trusty URIs [86]. Parallel computation of graph digests could take place as long as all statements for a given blank node are computed together so that the canonical identifier is preserved. A first pass could set aside any statements containing blank nodes and only paral- lel process the rest, leaving the blank node statements to be processed separately. 82

Finally, the initial refinement process might be used in the future to identify indis- tinguishable blank nodes in RDF data in editing tools, in the hope that users would be willing to pare down the redundancies, making future identification less costly.

6.7 Conclusion The RDF Graph Digest Algorithm 1 (RGDA1) produces cryptographically secure hashes for all RDF graphs in an average case linear runtime for graphs with no blank nodes, and an average case polynomial runtime (relative to the number of blank nodes) for graphs with blank nodes. It uses the Sayers and Karp graph digest algorithm to produce consistent identifiers for graphs, but uses the traces algorithm to canonicalize graph blank nodes. RGDA1 uses no programming language-specific constructs, and is therefore portable across platforms and programming languages. We outlined the BioDomain benchmark, which shows that for 229 ontologies from BioPortal and the OBO Foundry, there are very few complex blank node constructs that require use of the worst-case exponential algorithm, and none of them have performance worse than any of the other blank node graphs. This suggests that for most RDF data, RGDA1 is scalable to large graphs with significant complexity without needing to worry about triggering worst-case behavior over large parts of such graphs. CHAPTER 7 WebSig: A Digital Signature Framework for the Web

WebSig is a digital signature scheme for RDF graphs that is trustable, computable, and minimally repudiable. This is because WebSig is an attributable, linkable, portable, revisable, verifiable digital signature scheme. In Chapter 2, we discussed these qualities and their sufficiency to produce trustable, minimally repudiable, digital signatures on computable documents. This chapter will show that these qualities can be realized when signing RDF documents using a suite of technologies and standards. Through WebSig, we will show that these qualities can co-exist in a single digital signature scheme. WebSig’s use of Functional Requirements for Information Resources (FRIR) representations of signed documents in RDF allows WebSig to be revisable. Further, WebSig’s use of the Nanopublication framework allows it to be attributable and linkable. Finally, the combination of FRIR, RDF Graph Digest Algorithm 1 (RGDA1) and the RSA digital signature algorithm, allows WebSig to be portable and verifiable. We will demonstrate that WebSig fulfills all five qualities, making it trustable, computable, and minimally repudiable.

7.1 Introduction Conventional digital signatures are designed to sign specific byte sequences. Further, the standard infrastructure of Public Key Cryptography does not include a mechanism for independently asserting who controls the key in question. This means that any assertions about the signing agent within the signed document are not clearly defined. The link between the signer and the content must be interpreted either relative to a fixed database or via some knowledge external to the document signature. This means that any database that collects digital signatures and signed documents must either be itself trusted to faithfully record the identity of the signer (and how they relate to the document itself) or to some trusted third party that can produce interpretations of the document that relate its content to the signing agent. We put forth a web-oriented digital signature scheme called Web Signature (WebSig)

83 84 that uses several existing standards and technologies to produce computable and verifiable digital signatures where it is possible not only to identify the signer of the document using the emerging WebID standard, but also to store the resulting signatures and document contents in untrusted databases in a way that allows the content itself to be trusted, not simply one particular serialization of it. Finally, we evaluate WebSig for five qualities that we have shown to be necessary and sufficient for digital signature schemes to be both trustable and computable.

7.2 The WebSig Signature Scheme WebSig is based on several standards and technologies, and endeavors to blend diverse technologies and standards to work together naturally. The WebSig signa- ture scheme is based on the Nanopublication framework, [13] which, for each graph of assertional knowledge, also requires a graph to provide publication information (such as attribution) and a graph to provide provenance of the assertion, often in the form of evidence of the truth of the assertion. We leverage this framework to create a graph of the signed document’s assertions as the Nanopublication assertion graph, and then compute an identifier (URI) for that graph using Functional Requirements for Information Resources. That URI is then used in the publication information graph to declare that the document is attributed to the signer, via their WebID identifier. A FRIR identifier is in turn computed for the publication information graph, and that identifier is signed using one of the sigining agent’s private keys. That corresponding public key is published in the agent’s WebID profile, allowing for independent validation of the signature as well as the document it signed. Fig- ure 7.1 provides a detailed view of the relationships between the nanopublication components, the signing agent, and the original document. The WebSig ontology26 is the core ontology used, with additional predicates from PROV-O27, WebID28, as well as the ontologies used by FRIR (see Chapter 5 for details.). Validation of WebSig web signatures requires the validation of not just the digital signature itself against the stated public key, but also the validation of the

26sig: http://purl.org/5am/websig 27prov: http://www.w3.org/ns/prov-o# 28cert: http://www.w3.org/ns/auth/cert# 85

Figure 7.1: A web signature is a special kind of nanopublication, where the Assertion and PublicationInfo are identified by their graph digest. The PublicationInfo graph states that the As- sertion graph is attributed to an agent, who has in turn signed the identifier of the PublicationInfo using a private key that they publish the public key for in their WebID page. The sig- nature and its provenance is stored in the Provenance graph. attribution statement against the agent who the assertion is being attributed to. In this scheme, if the signing agent does not match the attributed agent, then the signature is in question, unless some other signed assertions, such as the signing agent being able to act on behalf of the attributed agent, are known. Further, the assertion and publication information graphs are validated against their graph digests, so any change in the graph itself would result in a mismatch in the graph digest. Step-by-step details of validation are shown in Figure 7.2 The WebSig signature request workflow is inspired by the OAuth 2.0 authoriza- tion framework [103] and uses the Semantic Automated Discovery and Integration (SADI) web services framework [104] to provide a web service backbone to the pro- cess, as well as the WebID (formerly FOAF+SSL [105]) draft specification to provide authentication and a means of associating public keys with identified agents on the web. The SADI web service framework provides a very simple pattern for semantic web services: Clients can send an HTTP GET to the service to receive a description 86

Figure 7.2: Verification of a web signature is straightfoward: 1) Confirm that the public key actually belongs to the signing agent. 2) Verify the signature against the key. 3) Validate that the referenced PublicationInfo is actually identified by the signed identifier by computing the graph’s graph digest and checking that the attributed agent is the same as the signing agent. 4) Load the assertion graph from the identifier mentioned in the PublicationInfo graph and compute its graph signature, which should match. of the service, including its input and output classes. To submit to the service, send an HTTP POST with an RDF document to the service URL. The service looks for all instances of its input class and processes those instances into an output graph that provides those same instance URIs with the statements needed to make them valid instances of the service’s output class. In the case of synchronous services, the service blocks until all matching instances are processed, and the resulting output graph is returned to the client. In the case of asynchronous services, the service continues to process the inputs while providing a set of URLs at which the results will be found, when complete. The property rdfs:isDefinedBy is used to link the instance with its output location. The output location redirects to itself with a delay PRAGMA header until the result is available, at which time the actual result is given. The core of WebSig workflow relies on SADI’s asynchronous service pattern, 87 except that processing is deferred to the agent who will actually sign the graph. The signature requester, on getting the request confirmation, can, if currently interacting with the signing agent, redirect them to the output location URL, where the signing agent can evaluate the document and decide to sign or reject it. If the requester does not have direct contact, they will have to rely on the signatory to either get a notification of a signature request or for the signatory to periodically check their profile for signature requests. When the signer has made a decision and recorded it, that URL is then available to the requester to see the result. Further, if the requester has included a sig:followupAt statement on the document, the signing agent will be redirected to the object of that statement, which can also serve as notification for the signature requester as well as allow the signatory to perform some followup operation. This means that if the signing agent has been interacting with the requester’s web site, the requester can send the signing agent to the signature request page, let them decide, and then be redirected back into the requester’s original workflow. Figure 7.3 shows how this process works when the signing agent is already interacting with the requester, and Figure 7.4 shows how it works with the signing agent not actively interacting with the requester.

7.3 Implementing WebSig WebSig is implemented as a Python Flask application [106] using the Python SADI framework [107]. This SADI implementation supports synchronous and asyn- chronous services as well as the use of attachments – signature submitters can either make their requested documents available as linked data or can upload them with the signature request using the multipart/related mime type. WebSig supports au- thentication using either WebID or through conventional sign-in, and currently lets signing agents review the document as rendered HTML (in the case of RDFa), as the original source of the document, or as a parsed RDF graph (see Figure 7.5). Additional metadata of the request can be viewed as well (Figure 7.10). Further, once the document has been signed, users can see the publication info (Figure 7.6) and signature provenance (Figure 7.7) graphs. Currently, only agents who are party to the document (the signing agent or the requester) can view the results. 88

Figure 7.3: The sequence used by the signer, signature requester, and signature service agents when a signer is actively interact- ing with the signature requester. A signature requester, af- ter verifying the signer’s identity, discovers the signature re- quest service, where the requested document is posted. The requester receives a URL that the signer can visit to make their decision and redirects the signer to that URL. Once there, the signer determines if they want to sign it or not, and performs that action by submitting to the signing ser- vice. The signer is optionally redirected to a followup URL where the requester is notified of the signature event, where they are then able to proceed with whatever action required the signature in the first place.

7.4 Evaluation We evaluate this framework in the foliowing ways: we demonstrate the use case from Section 1.3; prove cryptographic integrity of the signature method; confirm that the resulting signatures are legally binding; and show how WebSig signatures are linkable, revisable, verifiable, attributable, and portable. 89

Figure 7.4: Similar to Figure 7.3, a signature requester can directly re- quest signature by discovering the Signing Agent’s URI via a third party registry such as a biobank. After they submit the request the workflow is the same.

7.4.1 Use Case We repeat below the use case from Section 1.3 and show how this use case can be satisfied using WebSig. This example will then be used to better illustrate the discussion of WebSig’s five qualities.

Alice has been keeping track of her weight, sleep quality, and resting pulse for a year by pushing data from various instruments and apps into an RDF store. When she enrolls in a medical study, the investigator, Bob, sends her a document that has a privacy preference written in the Privacy Preferences Ontology (PPO) [8] to let his organization, Cure- Lab, access part of her data. It also includes a prov:Activity [11] that represents the study and describes (in human-readable form) how the data will be used in it and the organization that is conducting it. Be- fore sending it, Bob signs the document on behalf of CureLab, assuring that Alice’s data will be used in a particular way. Bob then sends the 90

Figure 7.5: An example web signature assertion that has been signed via the Provenance and PublicationInfo graphs.

document to to Alice, who also signs it. Alice then loads the signed document into her RDF store which is able to verify that the document was signed by all parties to the document. Later on, Bob realizes that the study needs additional information in order to proceed. He sends a revision of the original document to Alice, who signs it. A year later, Alice has a change of heart, and decides to revoke the contract, which suspends CureLab’s access to her data. 91

Figure 7.6: An example web signature PublicationInfo Graph, that in- cludes an attribution of the Assertion graph to the Signing Agent.

This use case is made up of multiple signatures by several agents. The first is that Bob must somehow gain authorization by CureLab to sign on their behalf. This can be formalized by the controller of CureLab’s private key (which can be the key used in their web site’s SSL certificate) signing a document saying that Bob can be associated with Signing events acting as a delegate of CureLab. The class BobSmithSigningForCureLab in Listing 7.1 using PROV-O expresses the needed authorization.

Listing 7.1: Expressed using a subclass of prov:Activity, Bob Smith is au- thorized by CureLab to be associated with Signing activities as a delegate of CureLab. Class : BobSmithSigningForCureLab 92

Figure 7.7: An example web signature Provenance graph, with the sig- nature itself, the public key and PublicationInfo graph it was derived from, and the Signing Agent it was attributed to.

EquivalentTo : websig:Signing and (prov:wasActivityOfInfluence some (prov:Delegation and (prov:agent value BobSmith ) and (prov: qualifiedDelegationOf value CureLab ) ) ) SubClassOf : websig:Signing , prov:Activity

Bob and Alice can now each independently sign the document shown in Figure 7.5. A screen shot of the full signature page is shown in Figure 7.10. Since Bob is signing on behalf of CureLab, the two signatures are all that is needed to put the document in effect. When Bob decides that he needs a revision of the agreement 93

Figure 7.8: The new agreement that allows Bob to access Alice’s date of birth, because it was removed from the list of restricted fields. because he would like know the subjects’ date of birth to remove any age-related effects on sleep quality, he posts a new request to Alice. This request includes the new agreement (see Figure 7.8), which has an updated description explaining the changes, and is lacking the restriction on date of birth which was in the prior agreement. The request also includes additional metadata in its signature request (see Figure 7.9), stating that this document is the same work as a previous one, and that it is a new revision of the old one. It also states that the old version was invalidated on 8/3/2014. Once this is signed by both Alice and Bob, the new metadata will also be adopted, rendering the old privacy preference invalid. When Alice decides that she does not want to participate in the study any longer, she submits a new signature request that simply contains new metadata about the work and latest assertions, stating that they were invalidated on the date she specifies. She can then sign the metadata, which will invalidate any and all revisions, future and past, relating to this agreement. A PPO-based application can then determine that those preferences have been invalidated, and will remove them from their working privacy preference graph. 94

Figure 7.9: The submission request for the new agreement includes meta- data that says that the new assertions are a revision of the old one, and that the old assertion was invalidated on 8/3/2014.

7.4.2 WebSig is Linkable Being expressed entirely in RDF, WebSig documents, signatures, public keys, publication information, and signatories are all accessible as Linked Data. While this information can be restricted by access controls, if a requestor is authorized, they can get the information in machine-readable form using RDF.

7.4.3 WebSig is Attributable WebSig requires that the signer sign not the document graph directly, but a secondary PublicationInfo graph. This PublicationInfo graph requires triples of the form document wasAttributedTo agent, which explicitly assert the adoption of the document by the signatory. An example of this metadata is shown in Figure 7.6. Additionally, since the signed documents are also expressed in RDF, signatories can be mentioned using the same URIs that identify them in the PublicationInfo graph. The verification algorithm (Figure 7.11) also validates that the attributed signatory has signed the graph. 95

Figure 7.10: A screen shot of the signed PPO document with a descrip- tion of the activities it will be used in.

7.4.4 WebSig is Portable WebSig requires that documents are identified as a frbr:Work, frbr:Expression, frbr:Manifestation, and frbr:Item. The attribution of a signatory to a document hap- pens at the frbr:Expression level, which means that those documents are identified by their RDF graph digest. Untrusted databases can therefore store and act on the data directly while being able to verify the digest of the document in-place. Because of WebSig’s use of the FRIR RDF Graph Digest Algorithm (see Chapter 6), signed RDF graphs, as well as their attribution graphs, can be validated within any RDF-compliant triple store. This means that agents can compute using these signed graphs in the knowledge that they are the same graphs that were signed by the agent. Finally, because revisions of documents are expressed and detected at 96 the frbr:Expression level, it is possible to detect those changes live within an RDF graph database as well.

7.4.5 WebSig is Revisable Because WebSig document assertions are identified by content digest, it is possible to use FRBR relations to express their revision as part of a new publication info graph. This can be asserted in the signature request by saying that a new frbr:Expression, which realizes the same frbr:Work as the prior frbr:Expression, is a frbr:revision of the previous one. This invalidates the old frbr:Expression, which is modeled by asserting that the frbr:Expression was prov:invalidatedAtTime of some timestamp. See Figures 7.8 and 7.9 for an example based on the above use case. Revocation of documents happens the same way, simply by providing an invalidation timestamp for the frbr:Expression and original frbr:Work. By including them in the publication information and then signing it, the signatory is stating new provenance about the document. Revision timestamps might be a possible point of forgery, however. Documents can potentially be backdated or postdated if the signer controls the clock of the computer. A solution to this would be for an automated witnessing service that signs the signature with a statement about when the signature was witnessed. If the witnessing signature is generated within an acceptable window, then the revision timestamps can be believed.

7.4.6 WebSig is Verifiable We have already shown how to validate WebSig signatures. Digital signatures a subject to four levels of attack, the weakest of which is the existential forgery, where a public key can be used to produce a valid signature and content pair where the content is any possible random bit stream. The current RSA digital signature system is resistant to this scheme and all stronger attacks [16]. By extending the digital signature scheme deeper into the semantics of the message, we are able to strengthen the validation component of our scheme even beyond standard RSA. A valid web signature requires more than just a valid signature, but the signed hash must correspond to a singular, closed attribution graph that must in turn attribute 97

Figure 7.11: Web signatures are legal signatures because they follow a chain of proof: a Signing Agent has exclusive control over their private keys. When they sign the Publication- Info graph, the identifier exclusively identifies that graph and no other. The PublicationInfo graph in turn states that an Assertion graph, is attributed to the signing agent. Since the Assertion graph is in turn identified by both its graph digest and its the digest of its original serialization, its digest cannot be forged either. Further, since the PublicationInfo graph digest identifies a graph that explicitly identifies an- other graph using a graph digest, there is no way to forge a new Assertion graph into the same PublicationInfo graph, because that will change the PublicationInfo graph’s iden- tifier. Finally, because it is provable that the Signing Agent has stated that a very specific Assertion is attributable to themselves, their declaration of adopting the content is ex- plicit. the signer to a singular assertion graph. The attacker must therefore also be able to forge valid content digests and their corresponding graphs as well as the signature for them. RDF graph digests are as resilient to forgery as their underlying message digest algorithms [6, 68], so in order to produce an existential forgery, the attacker would need to be able to create some random valid RDF graph for which they can produce a signature. See Figure 7.11 for details of the chain of verifiability from signature to signed assertion. 98

7.4.7 Performance WebSig uses RGDA1 to compute graph digests of documents and publica- tion metadata. As shown in Chapter 6, RGDA1 has an average case runtime of T = 0.000051S1.59, where S is the number of statements in the graph. The WebSig algorithm, after computing the graph digest of the document, is constant time exe- cution because the output of RGDA1 is constant size (simply an integer). All other inputs are limited in size due to the fixed Nanopublication-based template. The overhead for producing WebSig signatures is very small at 108 triples across three graphs (not including the signed document) and is generally fixed in size, although users can supply extra triples in the signature request and PublicationInfo graphs. The originating signature request only needs to contain 3 triples. When the request is recorded, the FRBR stack and nanopublication structure are created, which adds 35 triples to the graph. This is a minimum of 38 triples for the signature request graph. The minimal publication information graph only contains 34 triples, most of which are the FRBR stack for the assertion. The provenance graph, which contains the signature itself, is 16 triples, and contains the FRBR stack for the PublicationInfo graph and the signature value itself.

7.5 Conclusion We defined a digital signature scheme that allows for portable expression of verifiable signed documents across information systems that does not require an honest broker. Further, this scheme supports the ability to modify signed documents and revoke them at a later date using additional digital signatures. It provides the ability to trace digital signatures back the the signing agents that control the keys used to sign and makes it possible to understand how the signing agent is party to the signed document. We evaluated the scheme against three of the dynamic consent management requirements with positive results, and were able to create and sign an authorization document using provenance primitives from PROV-O. In the next chapter we will further evaluate WebSig by incorporating it into a full prototype of a dynamic consent management system. CHAPTER 8 Discussion

WebSig brings a number of advantages to both digital signatures and the semantic web. First, the semantic web has provided a means for machine-readable data on the web. This has resulted in a new class of semantically-enabled agents that can act on data published to the web. However, the ability to act on information on the web is limited by the ability for those agents to trust what is said on the web. There are a number of ways that trust can be built, and one of the key components of WebSig, Nanopublications, provides a means of both attributing statements to agents as well as for them to explain the provenance of those statements. However, a large class of data, especially policies, have a complicated nature, where they are true because they have been said or adopted by a particular agent. This kind of data requires a digital signature framework that lets the signatures move with the data so that they can be validated in-place, regardless of their source. WebSig provides the ability to trust that a particular agent has, to the best of anyone’s knowledge, said a particular thing and claims it to be true. The potential uses for this are widespread. Medical records are signed either by hand (in the case of paper records) or digitally, when dealing with electronic medical records [108]. Aggregation of medical records from multiple sources means potentially losing the information about who signed the document. This can be important to determine the quality and source of the information, as information collected by doctors, nurses, pharmacists, and patients themselves carry different priorities and perspectives. Traditional digital signatures in turn benefit from the ability to sign informa- tion instead of byte streams, making it possible to potentially validate signatures in a wide variety of environments. Further, it now becomes possible to express complex attribution and provenance assertions within the signature. This makes it much easier to provide complex signature types, such as proxies and document revisions, without having to resort to new, purpose-built signature schemes.

99 100

The qualities of WebSig that make it trustable, computable, and minimally repudiable also make it require very little established trust between agents. If a document signature and document are addressable on the web and portable for local verification, a signature can come from any untrusted (or even anonymous) source and can be accepted as valid independently of how it was delivered or ac- cessed. This anonymous delivery and/or reference is the ideal for a decentralized computable contract framework, as it has minimal trust dependency: The presenter of the signature does not have to be trusted by the acceptor, nor vice-versa. The signatories do not need to trust either the presenter or the acceptor, as they cannot prove any additional information about the signatories beyond what was signed. The presenter and acceptor can, however, trust the signatory because the signatory would be able to show that they have 1) a role in the document (attributability) and 2) control of the private key (verifiability). The signatories can also change the document and provide an updated signature (revisability). The acceptor can then see that there is a change and validate that change instead of the original. This scenario, with minimal trust and maximal decentralization of both transport and storage mechanisms assumes no pre-coordination between agents and therefore satisfies the ability for ”anyone can say anything about anything” [109], but still provide trust about who actually said it and in what capacity. Our information abstraction model, Functional Requirements for Information Resources (FRIR), provides a framework for portability and revisability within Web- Sig. Since it accomplishes this by integrating PROV-O and FRBR, it provides ad- ditional benefits to both models. FRIR provides PROV-O with a means to talk about information resources at multiple levels of abstraction as well as additional publication-specific concepts. It also provides FRBR with the means to integrate with provenance on the web in general, making it easier for libraries to more gen- erally describe their resources on the web. Along the way, it has demonstrated that it can be used to manage information resource change and multiple represen- tations in digital signature contexts as well as use in managing the publication and provenance of open government data. Since publication of Chapters 4 and 5, the W3C has accepted PROV as a recommendation, including prov:specializationOf and 101 prov:alternateOf as we presented. Our graph digest algorithm, RDF Graph DigestAlgorithm 1 (RGDA1) pro- vides the means to reliably compute portable identifiers for RDF documents. RGDA1 is suitable for use in web-based digital signature schemes like WebSig and provides average case polynomial performance for the class of all RDF graphs. RGDA1 has been implemented as a part of RDFlib, and has been release in RDFlib 4.2.0. RD- Flib uses RGDA1 regularly in its unit tests to test for graph operation correctness, and is used to validate round-trip conversions between different file formats. As part of the RDFlib project, RGDA1 will be available to all python software projects that use RDFlib for managing RDF data. Recently, a member of the RDFlib team used RGDA1 to create a cache of more than 1 million SPARQL queries as part of a machine learning algorithm [110]. He used RGDA1 to create canonical identifiers for SPARQL queries so that results would not need to be requeried. The traces al- gorithm used by RGDA1 belongs to a family of blank node coloring algorithms that always concur in their resulting coloring, because they all follow the same strategy of individuation tree pruning. Any future improvements in the performance of this family of algorithms will not result in changes to the resulting coloring, but merely improvements in the runtime of the algorithm. RGDA1 digests are therefore long- lived, because improvements or alternate implementations of the coloring algorithm will still produce the same colorings, and therefore the same graph digests. Finally, because RGDA1 has no parameters, it is possible to exchange and validate RGDA1 digests with no pre-coordination. Signed documents can therefore validated auto- matically without having to determine which variant of the graph digest algorithm was used. WebSig has a number of limitations, however. While it can be used to sign non-RDF documents, message digests are used to identify the document. When a conventional message digest is used, the signature cannot be trusted and computable at the same time, because the signature ceases to be portable across representations. Another limitation is that RDF graphs can be written that provide poor performance with RGDA1. However, as we showed in the RGDA1 evaluation, these kinds of graphs are rare in practice, even when working with ontologies with large numbers 102 of blank nodes.

8.1 Future Work However, adoption of digital signatures has significant hurdles. We do not address issues of user acceptance and trust. There may be significant resistance to adoption of semantic technologies for such a task, and the WebID standard is still only in the early adopter stage. Current implementations of WebSig simply provide a way to sign documents and publish them, but future work can include capabilities like SPARQL endpoints that by default, only query over validated signed graphs, or graphs that have been signed by particular agents. Further integration with the Privacy Preferences Manager, an implementation of the Privacy Preferences Ontol- ogy, would also be helpful. It would allow users to request access to PPO-managed resources without needing to rely on the controller of the resources to author the access preferences. WebSig will also be useful in managing informed consent agree- ments in biomedical studies. While there is yet no clear winner in representing informed consent, WebSig can handle managing the resulting agreements, including changing and revoking that consent. WebSig can also benefit by taking advantage of technologies such as RDF- enabled implementations of PubSubHubbub such as sparqlPuSH [111] to push no- tifications of changes to interested and authorized agents. Another next step might be to also publish RDF change sets for revisions. Automatically computing opti- mal differences poses similar issues as RDF graph identification, but is even more challenging [112]. However, Tzitzikas et al. propose that sub-optimal (but useful) graph differences are more tractable. Other document types can be trustable and computable if they have formal semantic interpretations. As we showed in Chapters 4 and 5, content digests can be created for many other kinds of data, including spreadsheets, images, video, and audio. These file types can benefit from using WebSig to sign them, but they lack formal semantic interpretations. Future work might determine what the semantics would be to sign such content. Additionally, content digests might be computable for eXtensible Markup Language (XML) and JavaScript Object Notation (JSON) 103 documents, possibly with an eye towards developing a common formal semantic interpretation for them. FRIR and RGDA1 can potentially be used to manage the provenance of dis- tributed computing, where each step only has to examine its own inputs and outputs without consulting a centralized registry. This can potentially decrease the number of steps needed to explain data transformations, as content-preserving transforma- tions can be excluded simply by presenting the provenance at the Expression level. There are many opportunities for future work on the RDF Graph Digest Algo- rithm 1 (RGDA1). RGDA1 can be incorporated into RDF databases as a means of creating cryptographically secure audit trails for each RDF graph it stores. RGDA1 can also be used by data catalogs like the Comprehensive Knowledge Archive Net- work (CKAN) [102] and in nanopublications to provide verifiability of the data that is published [86]. A parallel implementation of RGDA1 could be used to quickly compute the digests of large RDF graphs, if blank nodes canonicalization is han- dled correctly. Additionally, RGDA1 can be used by RDF databases to canonically identify queries that are run multiple times to improve cache performance using the SPIN RDF representation of SPARQL queries [113]. Components of RGDA1 can also be used to warn users that their RDF graphs may be hard to identify. We found that all cases of indistinguishable blank nodes in the RDF that we examined were redundant or empty blank nodes that provided no additional information. Mini- mizing the occurrence of these kinds of nodes could be incorporated into tools like Protege. CHAPTER 9 Conclusion

WebSig is a new digital signature scheme that is able to provide trustable signa- tures on RDF graphs in a way that minimizes the opportunities for legal repudiation. We showed that this is possible by creating a digital signature scheme that is at- tributable, linkable, portable, revisable, and verifiable. It provides a generalized mechanism for signature metadata using the PROV Ontology and the nanopubli- cation framework that allows for complex signature types, including proxies and witnesses. In Chapter 3, our review of the literature showed that no existing digital or physical signature can assure the authenticity of RDF graphs across all possible representations, nor can it do so for any kind of computable document. In support of WebSig, we have formalized the representation of information resources in the framework Functional Requirements for Information Resources (FRIR). We show that information resources can be represented in the same man- ner as physical information resources are using Functional Requirements for Bibli- ographic Records (FRBR). Along the way, we were able to successfully formalize the provenance of information resource access on the web across multiple revisions, representations, and copies. We were able to use FRIR to make WebSig revisable. Additionally, we were able to create RDF Graph Digest Algorithm 1 (RGDA1). RGDA1, along with FRIR, allows WebSig to be portable. Its use in the digital signature algorithm with FRIR and the RSA digital signature algorithm provides verifiability of WebSig signatures. RGDA1 combines the graph digest algorithm of Sayers and Karp with the graph canonicalization algorithm traces. While the traces worst-case runtime is exponential, our implementation, when tested on real world RDF graphs, showed a highly predictable linear (O(n)) performance on RDF graphs with no blank nodes, and a strongly polynomial (O(b1.6)) average performance on graphs with blank nodes, where b is the number of blank nodes in the graph. Com- plex graphs that contain automorphisms, or colors with more than one blank node, do not perform significantly worse than non-automorphic graphs. Out of 229 ontolo-

104 105 gies from BioPortal and the Open Biomedical Ontologies (OBO) Foundry, only 13 require individuation of blank nodes, and those do not perform significantly worse than graphs that require no individuation, especially when the runtime is compared against the overall number of blank nodes in the ontology. This suggests that most RDF data is computable in a polynomial runtime using RGDA1, and that it is suit- able as a standardized graph digest algorithm. RGDA1 has been released as part of RDFlib 2.4.0, Python’s dominant RDF management library. Finally, we showed how WebSig’s use of FRIR and the Nanopublication frame- work allows it to be attributable and linkable. Since we proved that attributable, linkable, portable, revisable, and verifiable digital signature schemes are trustable, computable, and minimally repudiable, we can conclude that WebSig is trustable, computable, and minimally repudiable. This means that WebSig can provide trust- able digital signatures on RDF graphs regardless of their representation and mini- mizes the opportunities for challenges to those signatures. References

[1] D. Beckett and B. McBride, RDF/XML syntax specification (revised), W3C Recommendation REC-rdf-syntax, Feb. 2004. [Online]. Available: http://www.w3.org/TR/REC-rdf-syntax/ Accessed: 7/7/2015.

[2] B. Adida et al., RDFa Core 1.1 - Second Edition: Syntax and processing rules for embedding RDF through attributes, W3C Recommendation -syntax, Aug. 2013. [Online]. Available: http://www.w3.org/TR/rdfa-syntax/ Accessed: 8/5/2014.

[3] M. Sporny et al., JSON-LD 1.0: A JSON-based Serialization for Linked Data, W3C Recommendation json-ld, Jan. 2014. [Online]. Available: http://www.w3.org/TR/json-ld/ Accessed: 8/5/2014.

[4] T. Heath and C. Bizer, Linked Data: Evolving the Web into a Global Data Space. San Rafael, CA: Morgan & Claypool Publishers, Feb. 2011.

[5] S. Bratt. (2007, Jan.) Semantic Web, and Other Technologies to Watch. [Online]. Available: http://www.w3.org/2007/Talks/0130-sb-W3CTechSemWeb/#(24) Accessed: 7/7/2015.

[6] C. Sayers and A. H. Karp, “RDF Graph Digest Techniques and Potential Applications,” Mobile and Media Systems Laboratory, HP Laboratories, Palo Alto, CA, Tech. Rep. HPL-2004-95, 2004. [Online]. Available: http://www.hpl.hp.com/techreports/2004/HPL-2004-95.html Accessed: 7/7/2015.

[7] J. J. Carroll, “Signing RDF graphs,” in The Semantic Web - ISWC 2003, Sanibel Island, FL, Oct. 2003, pp. 369–384.

[8] O. Sacco et al., “Fine-Grained Trust Assertions for Privacy Management in the Social Semantic Web,” in Proc. 12th IEEE Int. Conf. Trust, Security and Privacy in Computing and Communication, Melbourne, Australia, Jul. 2013, pp. 218–225.

[9] J. Hollenbach et al., “Using RDF Metadata to Enable Access Control on the Social Semantic Web,” in Proc. Workshop Collaborative Construction, Management, and Linking of Structured Knowledge (CK2009), Washington, DC, Oct. 2009. [Online]. Available: http://ceur-ws.org/Vol-514/paper3.pdf Accessed: 7/7/2015.

106 107

[10] I. Jacobi et al., “Transitioning Linked Data accountable systems to the real world with identity, credential, and access management (ICAM) architectures,” in Proc. 2013 IEEE Int. Conf. Technologies for Homeland Security (HST), Waltham, MA, Nov. 2013, pp. 564–569.

[11] T. Lebo et al., PROV-O: The PROV Ontology, W3C Recommendation prov-o, Apr. 2013. [Online]. Available: http://www.w3.org/TR/prov-o/ Accessed: 7/7/2015.

[12] E. O’Neill, “FRBR: Functional Requirements for Bibliographic Records,” Library Resources & Tech. Services, vol. 46, no. 4, pp. 150–159, Sep. 2002.

[13] P. Groth et al., “The anatomy of a nanopublication,” Inform. Services and Use, vol. 30, no. 1, pp. 51–56, Jan. 2010.

[14] W. Diffie and M. E. Hellman, “New directions in cryptography,” IEEE Trans. Inform. Theory, vol. 22, no. 6, pp. 644–654, Nov. 1976.

[15] R. L. Rivest et al., “A method for obtaining digital signatures and public-key cryptosystems,” Commun. ACM, vol. 21, no. 2, pp. 120–126, Jan. 1978.

[16] S. Goldwasser et al., “A Digital Signature Scheme Secure Against Adaptive Chosen-Message Attacks,” SIAM J. Computing, vol. 17, no. 2, pp. 281–308, Apr. 1988.

[17] C. Reed, “Legally binding electronic documents: digital signatures and authentication,” The Int. Lawyer, pp. 89–106, Apr. 2001. [Online]. Available: http://www.jstor.org/stable/40707597 Accessed: 7/7/2015.

[18] X.509 Public Key Infrastructure – HTTP Transfer for the Certificate Management Protocol (CMP), IETF RFC 6712, Sep. 2012.

[19] R. Perlman, “An overview of PKI trust models,” Network, IEEE, vol. 13, no. 6, pp. 38–43, Nov. 1999.

[20] A. Boldyreva et al., “Secure proxy signature schemes for delegation of signing rights,” J. Cryptology, vol. 25, no. 1, pp. 57–115, Jan. 2012.

[21] M. A. Nia et al., “An introduction to digital signature schemes,” arXiv preprint arXiv:1404.2820, Apr. 2014. [Online]. Available: http://arxiv.org/abs/1404.2820 Accessed: 7/7/2015.

[22] R. Cyganiak et al., Resource Description Framework (RDF): Concepts and Abstract Syntax, W3C Recommendation rdf11-concepts, Feb. 2014. [Online]. Available: http://www.w3.org/TR/rdf11-concepts/ Accessed: 8/5/2014.

[23] “15 USC §§7001-7006 - ELECTRONIC RECORDS AND SIGNATURES IN COMMERCE.” 108

[24] Specht v. Netscape Communications Corp., Court of Appeals, 2nd Circuit Docket No. 01-7870, Oct. 2002.

[25] Cloud Corp. v. Hasbro, Inc., Court of Appeals, 7th Circuit Docket No. 02-1486, Dec. 2002.

[26] People v. McFarlan, NY: Supreme Court Decision 191 Misc. 2d 531, Apr. 2002.

[27] Campbell v. General Dynamics Government Systems, Court of Appeals, 1st Circuit Docket No. 04-1828, May 2005.

[28] Naldi v. Grunberg, Docket No. 600 707/08, Oct. 2010.

[29] Martin v. Portexit Corp., NY: Appellate Div., 1st Dept. Docket No. 6880, Jun. 2012.

[30] S. E. Blythe, “Digital signature law of the united nations, european union, united kingdom and united states: Promotion of growth in e-commerce with enhanced security,” Richmond J. of Law & Technol., vol. 11, pp. 6–8, Feb. 2005.

[31] A. McCullagh and W. Caelli, “Non-repudiation in the digital environment,” First Monday, vol. 5, no. 8, Aug. 2000.

[32] D. Eastlake et al., XML Signature Syntax and Processing (Second Edition), W3C Recommendation xmldsig-core, Jun. 2008. [Online]. Available: http://www.w3.org/TR/xmldsig-core/ Accessed: 7/7/2015.

[33] G. R. Andrews, Concurrent Programming: Principles and Practice. San Francisco, CA: Benjamin/Cummings Publishing Company, Jul. 1991.

[34] D. Zeginis et al., “On the foundations of computing deltas between RDF models,” in Proc. 6th Int. Semantic Web Conf., Nov. 2007.

[35] J. P. McCusker et al., “Functional requirements for information resource provenance on the web,” in Provenance and Annotation of Data and Processes, Santa Barbara, CA, Jun. 2012, vol. 7525, pp. 52–66.

[36] J. K. Winn, “The emperor’s new clothes: The shocking truth about digital signatures and internet commerce,” Idaho Law Rev., vol. 37, p. 353, Jan. 2001.

[37] “Restatement (Second) of Contracts 134,” 1981.

[38] “Uniform Commercial Code §2-610,” The Amer. Law Institute and the Nat. Conf. of Commissioners on Uniform State Laws, 2012.

[39] “L’Estrange v. Graucob,” 1934, 2 KB 394, 403 (Scrutton LJ) (U.K.). 109

[40] “Saunders v Anglia Building Society,” 1971, AC 1004 (U.K.).

[41] A. Tarski, “The semantic conception of truth: and the foundations of semantics,” Philosophy and Phenomenological Research, vol. 4, no. 3, pp. 341–376, Mar. 1944.

[42] P. J. Hayes and P. F. Patel-Schneider, RDF 1.1 Semantics, W3C Recommendation rdf11-mt, Feb. 2014. [Online]. Available: http://www.w3.org/TR/rdf11-mt Accessed: 7/7/2015.

[43] A. J. Menezes et al., Handbook of Applied Cryptography. Boca Raton, FL: CRC press, Oct. 1996.

[44] S. Cantor et al., Assertions and Protocols for the OASIS Security Assertion Markup Language(SAML) V2.0, OASIS Std. saml-core-2.0-os, Mar. 2005.

[45] Digital Signature Standard (DSS), NIST FIPS Publication 186-2, Jan. 2000.

[46] A. Shamir, “Identity-based cryptosystems and signature schemes,” in Advances in Cryptology, Winnipeg, Canada, Aug. 1985, pp. 47–53.

[47] D. Johnson et al., “The elliptic curve digital signature algorithm (ECDSA),” Int. J. of Inform. Security, vol. 1, no. 1, pp. 36–63, Aug. 2001.

[48] A. Kasten and A. Scherp, “Iterative Signing of RDF(S) Graphs, Named Graphs, and OWL Graphs: Formalization and Application,” Department of Computer Science, Universitat Koblenz-Landau, Koblenz, Germany, Working Paper, Jun. 2013. [Online]. Available: http: //userpages.uni-koblenz.de/∼fb4reports/2013/2013 03 Arbeitsberichte. Accessed: 7/7/2015.

[49] A. Kasten and A. Scherp, “Towards a framework for iteratively signing graph data,” in Proc. 7th Int. Conf. Knowledge Capture, Banff, Canada, Jun. 2013, pp. 141–142.

[50] R. Cloran and B. Irwin, “XML Digital Signature and RDF,” in Proc. Inform. Soc. South Africa (ISSA) 2005, Jun. 2005. [Online]. Available: http://icsa.cs.up.ac.za/issa/2005/Proceedings/Poster/026 Article.pdf Accessed: 7/7/2015.

[51] G. Tummarello et al., “Signing individual fragments of an rdf graph,” in Special interest tracks and posters 14th Int. Conf. World Wide Web, Chiba, Japan, May 2005, pp. 1020–1021. [Online]. Available: http://doi.acm.org/10.1145/1062745.1062848 Accessed: 7/7/2015.

[52] Digital Bazaar, Inc., “The Security Vocabulary.” [Online]. Available: https://web-payments.org/specs/source/vocabs/security.html# GraphSignature2012 Accessed: 11/2/2014. 110

[53] S. Impedovo and G. Pirlo, “Verification of handwritten signatures: an overview,” in Int. Conf. Image Analysis and Processing, Los Alamitos, CA, Sep. 2007, pp. 191–196.

[54] L. Ding et al., “Tracking RDF Graph Provenance Using RDF Molecules,” in Proc. 4th Int. Semantic Web Conf. (Poster), Nov. 2005, p. 42.

[55] D. Robinson et al., “Government data and the invisible hand,” Yale J. Law & Technol., vol. 11, p. 159, 2008.

[56] E. W. Patton et al., “SemantEco: A semantically powered modular architecture for integrating distributed environmental and ecological data,” Future Generation Comput. Sys., vol. 36, pp. 430–440, Jul. 2014.

[57] L. Ding et al., “TWC LOGD: A Portal for Linked Open Government Data Ecosystems,” J. Web Semantics, vol. 9, no. 3, pp. 325–333, Sep. 2011.

[58] M. Janssen et al., “Benefits, adoption barriers and myths of open data and open government,” Inform. Syst. Manage., vol. 29, no. 4, pp. 258–268, Nov. 2012.

[59] T. Lebo et al., “Producing and Using Linked Open Government Data in the TWC LOGD Portal,” in Linking Government Data. New York, NY: Springer, Oct. 2011, pp. 51–72.

[60] J. Unbehauen et al., “ from structured sources,” in Search Computing. Berlin, Germany: Springer, Oct. 2012, vol. 7538, pp. 34–52.

[61] T. Lebo. Alternative Tabular to RDF converters. [Online]. Available: http://purl.org/twc/page/tabular-rdf-converters Accessed: 7/7/2015.

[62] OpenRefine. [Online]. Available: https://github.com/OpenRefine/OpenRefine Accessed: 7/7/2015.

[63] S. Das et al., R2RML: RDB to RDF Mapping Language, W3C Recommendation r2rml, Sep. 2012. [Online]. Available: http://www.w3.org/TR/r2rml/ Accessed: 8/5/2014.

[64] L. Moreau et al., “The Open Provenance Model Core Specification (v1. 1),” Future Generation Comput. Syst., vol. 27, no. 6, Jun. 2011.

[65] D. L. McGuinness et al., “PML 2: A Modular Explanation Interlingua,” in Proc. Association Advancement Artificial Intelligence, Vancouver, Canada, May 2007. [Online]. Available: http://www.aaai.org/Papers/Workshops/2007/WS-07-06/WS07-06-008.pdf Accessed: 7/7/2015. 111

[66] Resource Description and Access (RDA): Information and Resources in Preparation for RDA, Aquisitions and Bibliographic Control, Library of Congress Std. rda. [Online]. Available: http://www.loc.gov/aba/rda/ Accessed: 8/5/2014. [67] R. T. Fielding et al., Transfer Protocol–HTTP/1.1, IETF RFC 2616, Jun. 1999. [68] C. Sayers and A. H. Karp, “Computing the digest of an RDF graph,” Mobile and Media Systems Laboratory, HP Laboratories, Palo Alto, CA, Tech. Rep. HPL-2003-235, Mar. 2004. [69] M. Altman, “A fingerprint method for scientific data verification,” in Advances in Computer and Information Sciences and Engineering. New York, NY: Springer, Aug. 2008, pp. 311–316. [70] F. Lef`ebvre et al., “A robust soft hash algorithm for digital image signature,” in Proc. 2003 Int. Conf. Image Processing, vol. 3, Barcelona, Spain, Sep. 2003, pp. 495–498. [71] T. Groza et al., “The nepomuk project-on the way to the social semantic desktop,” in Proc. I-Semantics 2007, vol. 7, Graz, Austria, Sep. 2007, pp. 201–211. [72] D. Mimno et al., “Hierarchical Catalog Records: Implementing a FRBR Catalog,” D-Lib Magazine, vol. 11, no. 10, Oct. 2005. [Online]. Available: http://www.dlib.org/dlib/october05/crane/10crane.html Accessed: 7/7/2015. [73] I. Jacobs and N. Walsh, Architecture of the World Wide Web, Volume One, W3C Recommendation webarch, Dec. 2004. [Online]. Available: http://www.w3.org/TR/webarch/ Accessed: 7/7/2015. [74] T. Bray et al., Extensible Markup Language (XML) 1.0 (Fifth Edition), W3C Recommendation xml, Nov. 2008. [Online]. Available: http://www.w3.org/TR/xml/ Accessed: 7/7/2015. [75] P. V. Biron and A. Malhotra, XML Schema Part 2: Datatypes, W3C Recommendation xmlschema-2, May 2001. [Online]. Available: http://www.w3.org/TR/xmlschema-2/ Accessed: 8/5/2014. [76] C. K. Ogden and I. Richards, The Meaning of Meaning. London, UK: Trubner & Co, 1923. [77] O. Madison et al., “Functional requirements for bibliographic records final report,” International Federation of Library Associations and Institutions, The Hague, Netherlands, Tech. Rep., Feb. 2009. [Online]. Available: http://www.ifla.org/VII/s13/frbr/ Accessed: 7/7/2015. 112

[78] O. Madison et al., Functional Requirements for Bibliographic Records, International Federation of Library Associations and Institutions Std. FRBR, Feb. 2009. [Online]. Available: http://www.ifla.org/en/publications/ functional-requirements-for-bibliographic-records Accessed: 7/7/2015.

[79] J. P. McCusker et al., “Parallel Identities for Managing Open Government Data,” IEEE Intell. Sys., vol. 27, no. 3, p. 55, May 2012.

[80] S. Xu et al., “Yale Image Finder (YIF): a new search engine for retrieving biomedical images,” Bioinformatics, vol. 24, no. 17, pp. 1968–1970, 2008.

[81] J. P. McCusker and D. L. McGuinness, “Towards identity in linked data,” in Proc. OWL: Experience and Directions, San Francisco, CA, Jun. 2010. [Online]. Available: http://ceur-ws.org/Vol-614/owled2010 submission 12.pdf Accessed: 7/7/2015.

[82] J. P. McCusker et al., “Where did you hear that? Information and the Sources They Come From,” in Proc. Linked Science 2011, Bonn, Germany, Oct. 2011.

[83] “Change Proposal for HttpRange-14,” Mar. 2012. [Online]. Available: http://lists.w3.org/Archives/Public/public-lod/2012Mar/0115.html Accessed: 7/7/2015.

[84] B. D. McKay and A. Piperno, “Practical graph isomorphism, II,” J. Symbolic Computation, vol. 60, pp. 94 – 112, Sep. 2014.

[85] A. Mallea et al., “On Blank Nodes,” in The Semantic Web ISWC 2011, Bonn, Germany, Oct. 2011, pp. 421–437.

[86] T. Kuhn and M. Dumontier, “Trusty URIs: Verifiable, Immutable, and Permanent Digital Artifacts for Linked Data,” in The Semantic Web: Trends and Challenges. Heraklion, Greece: Springer, May 2014, pp. 395–410.

[87] R. C. Read and D. G. Corneil, “The graph isomorphism disease,” J. Graph Theory, vol. 1, no. 4, pp. 339–363, Winter 1977.

[88] G. Gati, “Further annotated bibliography on the isomorphism disease,” J. Graph Theory, vol. 3, no. 2, pp. 95–109, Summer 1979. [Online]. Available: http://dx.doi.org/10.1002/jgt.3190030202 Accessed: 7/7/2015.

[89] R. Parris and R. Read, “A coding procedure for graphs,” University of the West Indies, Computing Center, Mona, Jamaica, Scientific Report 10, 1969.

[90] B. D. McKay, “Practical graph isomorphism,” Congressus Numerantium, vol. 30, pp. 45–87, 1981. 113

[91] T. A. Junttila and P. Kaski, “Engineering an efficient canonical labeling tool for large and sparse graphs,” in ALENEX, vol. 7, Apr. 2007, pp. 135–149.

[92] P. T. Darga et al., “Exploiting structure in symmetry detection for CNF,” in Proc. 41st Annu. Design Automation Conf., San Deigo, CA, Jun. 2004, pp. 530–534.

[93] G. Grimnes et al., “compare.py.” [Online]. Available: https://github.com/RDFLib/rdflib/blob/ 8fad4ed91a29a118560e0e6d8b0e1654a044a307/rdflib/compare.py Accessed: 11/30/2014.

[94] R. L. Rivest, The MD5 Message-Digest Algorithm, IETF RFC 1321, Apr. 1992. [Online]. Available: http://tools.ietf.org/html/rfc1321 Accessed: 8/5/2014.

[95] X. Wang and H. Yu, “How to break MD5 and other hash functions,” in Advances in Cryptology–EUROCRYPT 2005, Aarhus, Denmark, May 2005, pp. 19–35.

[96] X. Wang et al., “Finding collisions in the full SHA-1,” in Advances in Cryptology–CRYPTO 2005, Santa Barbara, CA, Aug. 2005, pp. 17–36.

[97] D. Eastlake and P. Jones, US Secure Hash Algorithm 1 (SHA1), IETF RFC 3174, Sep. 2001. [Online]. Available: http://tools.ietf.org/html/rfc3174 Accessed: 7/7/2015.

[98] H. Gilbert and H. Handschuh, “Security analysis of SHA-256 and sisters,” in Proc. Conf. Selected Areas in Cryptography, Waterloo, Canada, Aug. 2004, pp. 175–193.

[99] N. F. Noy et al., “Bioportal: ontologies and integrated data resources at the click of a mouse,” Nucleic Acids Res., vol. 37, no. suppl 2, pp. W170–W173, May 2009.

[100] B. Smith et al., “The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration,” Nature Biotechnology, vol. 25, no. 11, pp. 1251–1255, Nov. 2007.

[101] P. Harish and P. Narayanan, “Accelerating Large Graph Algorithms on the GPU Using CUDA,” in Proc. of High Performance Computing - HiPC 2007, vol. 4873, Goa, India, Dec. 2007, pp. 197–208.

[102] D. Dietrich and R. Pollock, “CKAN: apt-get for the debian of data,” in Proc. 26th Chaos Communication Congr., Berlin, Germany, Dec. 2009, p. 36. 114

[103] D. Hardt, The OAuth 2.0 authorization framework, IETF RFC 6749, Oct. 2012. [Online]. Available: http://tools.ietf.org/html/rfc6749.html Accessed: 8/5/2014.

[104] M. D. Wilkinson et al., “The Semantic Automated Discovery and Integration (SADI) Web service Design-Pattern, API and Reference Implementation,” J. Biomedical Semantics, vol. 2, no. 8, Oct. 2011. [Online]. Available: http://www.jbiomedsem.com/content/2/1/8 Accessed 7/7/2015.

[105] H. Story et al., “FOAF+SSL: RESTful Authentication for the Social Web,” in Proc. 2009 European Semantic Web Conf., Heraklion, Greece, Jun. 2009, pp. 1–12. [Online]. Available: http://ceur-ws.org/Vol-447/paper5.pdf Accessed: 7/7/2015.

[106] Flask (A Python Microframework). [Online]. Available: http://flask.pocoo.org Accessed: 7/7/2015.

[107] J. P. McCusker. Tutorial: building a SADI service in Python. [Online]. Available: https://code.google.com/p/sadi/wiki/BuildingServicesInPython Accessed: 7/7/2015.

[108] B. Blobel et al., “Standard Guide for EDI (HL7) Communication Security,” in Studies in Health Technology and Informatics. IOS Press; 1999, 2002, pp. 153–182.

[109] T. Berners-Lee, “Metadata Architecture,” Jan. 1997. [Online]. Available: http://www.w3.org/DesignIssues/Metadata.html Accessed: 5/11/2015.

[110] J. Hees et al., “Canonical form of SPARQL Patterns,” May 2015. [Online]. Available: https://github.com/RDFLib/rdflib/issues/483 Accessed: 5/11/2015.

[111] A. Passant and P. N. Mendes, “sparqlPuSH: Proactive Notification of Data Updates in RDF Stores Using PubSubHubbub,” in Proc. 6th Scripting Semantic Web Workshop, Heraklion, Greece, May. 2010.

[112] Y. Tzitzikas et al., “Blank Node Matching and RDF/S Comparison Functions,” in Proc. Int. Semantic Web Conf. 2012, vol. 7649, Nov. 2012, pp. 591–607. [Online]. Available: http://dx.doi.org/10.1007/978-3-642-35176-1 37 Accessed: 7/7/2015.

[113] H. Knublauch et al., SPIN - Overview and Motivation, W3C Member Submission spin-overview, Feb. 2011. [Online]. Available: http://www.w3.org/Submission/spin-overview/ Accessed: 5/11/2015.