Inscribing the Blockchain with Digital Signatures of Signed RDF Graphs. a Worked Example of a Thought Experiment
Total Page:16
File Type:pdf, Size:1020Kb
Inscribing the blockchain with digital signatures of signed RDF graphs. A worked example of a thought experiment. The thing is, ad hoc hash signatures on the blockchain carry no information linking the hash to the resource for which the hash is acting as a digital signature. The onus is on the inscriber to associate the hash with the resource via a separate communications channel, either by publishing the association directly or making use of one of the third-party inscription services that also offer a resource persistence facility – you get to describe what is signed by the hash and they store the description in a database, usually for a fee, or you can upload the resource for storage, definitely for a fee. Riccardo Cassata has a couple of technical articles that expose more of the practical details: https://blog.eternitywall.it/2016/02/16/how-to-verify-notarization/ So what's published is a hexadecimal string indistinguishable from all the others which purportedly matches the hash of Riccardo's mugshot. But which? https://tineye.com/search/ed9f8022c9af413a350ec5758cda520937feab21 What’s needed is a means of creating an inscription that is not only a signature but also a resovable reference to the resource for which it acts as a digital signature. It would be even better if it were possible to create a signature of structured information, such as that describing a social media post – especially if we could leverage off’ve the W3’s recent Activity Streams standard: https://www.w3.org/TR/activitystreams-core/ An example taken from the above: { "@context": "https://www.w3.org/ns/activitystreams", "summary": "Martin created an image", "type": "Create", "actor": "http://www.test.example/martin", "object": "http://example.org/foo.jpg" } The format is JSON-LD: https://www.w3.org/TR/json-ld/ in which the LD resolves to “Linked Data” - broadly: RDF, as can be quickly demonstrated in Python by using the RDF processing library RDFLib: src = """{ "@context": "https://www.w3.org/ns/activitystreams", "summary": "Martin created an image", "type": "Create", "actor": "http://www.test.example/martin", "object": "http://example.org/foo.jpg" }""" def read_jsonld_write_n3(self): from rdflib import Dataset, URIRef ds = Dataset() g = ds.graph(URIRef('urn:uuid:310cd110-ba9b-4f45-910a-8f84b50b1315')) g.parse(data=src, format="json-ld") print(g.serialize(format="n3").decode('utf-8')) which gives: @prefix : <_:> . @prefix as: <https://www.w3.org/ns/activitystreams#> . [] a as:Create ; as:actor <http://www.test.example/martin> ; as:object <http://example.org/foo.jpg> ; as:summary "Martin created an image" . So far, so good. Now, the thing about an RDF graph is that one of its properties is of being a fully normalised database (https://en.wikipedia.org/wiki/Database_normalization) – that is to say that each RDF statement (composed of a “subject”, a “predicate”, and an object) is a self-contained single cell in the database. The subject identifies the row index, the predicate identifies the column and the object identifies the content of the cell (either a Literal value such as a decimal number or the subject of another statement). Serializing the above example using RDF’s “ntriples” format (https://en.wikipedia.org/wiki/N- Triples) shows the gory details (shrunk to fit): _:N25a8c132302d49ff800166b9f01d05af <https://www.w3.org/ns/activitystreams#actor> <http://www.test.example/martin> . _:N25a8c132302d49ff800166b9f01d05af <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://www.w3.org/ns/activitystreams#Create> . _:N25a8c132302d49ff800166b9f01d05af <https://www.w3.org/ns/activitystreams#summary> "Martin created an image" . _:N25a8c132302d49ff800166b9f01d05af <https://www.w3.org/ns/activitystreams#object> <http://example.org/foo.jpg> . The subject-as-index is evident, the way that column metadata is represented is also evident (the URLs resolve to formal OWL definitions of the column) and the values represented by the object are also evident (three URIs and one string literal). Here's the full spec from wikipedia: Each line of the file has either the form of a comment or of a statement: A statement consists of three parts, separated by whitespace: •the subject, •the predicate and •the object, and is terminated with a full stop. Subjects may take the form of a URI or a Blank node; predicates must be a URI; objects may be a URI, blank node or a literal. URIs are delimited with less-than and greater-than signs used as angle brackets. Blank nodes are represented by an alphanumeric string, prefixed with an underscore and colon ( _: ). Literals are represented as printable ASCII strings (with backslash escapes), delimited with double-quote characters, and optionally suffixed with a language or datatype indicator. Language indicators are an at sign followed by an RFC 3066 language tag; datatype indicators are a double-caret followed by a URI. Comments consist of a line beginning with a hash sign. Returning to the Python snippet (above), it is possible to iterate over all of the statements, printing out the Notation3 serialisation for each of subject, predicate and object: for (s, p, o) in list(g.triples((None, None, None))): print("{} {} {} .".format(s.n3(), p.n3(), o.n3())) _:N4244c9f83b7642d68d0de95b5318261f <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <https://www.w3.org/ns/activitystreams#Create> . _:N4244c9f83b7642d68d0de95b5318261f <https://www.w3.org/ns/activitystreams#object> <http://example.org/foo.jpg> . _:N4244c9f83b7642d68d0de95b5318261f <https://www.w3.org/ns/activitystreams#summary> "Martin created an image" . _:N4244c9f83b7642d68d0de95b5318261f <https://www.w3.org/ns/activitystreams#actor> <http://www.test.example/martin> . Essentially the same output, different order (immaterial to reading the data back in). There's a slight wrinkle with the n3 and the ntriples serialisations shown above - the index numbers obviously change with each rendering. That's okay, it's supposed to be that way because the W3C's example doesn't assign an identifer to the collection of statements that form the example, so the statements become “blank nodes” in the graph and the system creates “stand-in” identifiers that fill that role for the (in essence, dissociated) statements. From https://www.w3.org/TR/rdf11- concepts/#section-blank-nodes: Blank node identifiers are local identifiers that are used in some concrete RDF syntaxes or RDF store implementations. They are always locally scoped to the file or RDF store, and are not persistent or portable identifiers for blank nodes. Blank node identifiers are not part of the RDF abstract syntax, but are entirely dependent on the concrete syntax or implementation. The syntactic restrictions on blank node identifiers, if any, therefore also depend on the concrete RDF syntax or implementation. Implementations that handle blank node identifiers in concrete syntaxes need to be careful not to create the same blank node from multiple occurrences of the same blank node identifier except in situations where this is supported by the syntax. Blank nodes caused some problems when introduced, prompting Pat Hayes to author this rather fine description of the graph abstract syntax: https://www.ihmc.us/users/phayes/RDFGraphSyntax.html in which he describes the role that blank nodes play in the mathematics underpinning the abstract syntax - again, included for completeness Where I'm headed is this - render the graph as ntriples, i.e. a sequence of strings each separated by a CR, e.g.: "_:N4244c9f83b7642d68d0de95b5318261f <https://www.w3.org/ns/activitystreams#actor> <http://www.test.example/martin> ." and compute the sha256 hash of each triple-rendered-as-a-line-of-chars, the resulting hash being a chunk of binary that is usually rendered as hexadecimal string (i.e. the bits rendered as base 16) but is also valid when rendered as base 10. Simply adding the base 10 integer (or the base 16 hexadecimal, whatever suits) to a running total will create a sum of hashes that is independent of the order of serialization (addition being a commutative operation) def sign_statements(self): from rdflib import Dataset, URIRef ds = Dataset() g = ds.graph(URIRef('urn:uuid:310cd110-ba9b-4f45-910a-8f84b50b1315')) g.parse(data=src, format="json-ld") hashtotal = 0 for triple in g.serialize(format='nt')[:-2].split(b'\n'): c = sha256(triple) s = int.from_bytes(c.digest(), sys.byteorder) hashtotal += s print(hex(hashtotal)) # 0xcb5390d579ad1f978bdb9ffdadeab9adaf53b7e111329552562b7b16cb6eb6454 The result is a unique digital signature of the graph/collection of statements. This signature can be inscribed on the blockchain and if the graph is posted for inclusion in a publicly-accessible store, its digital signature can be attached to the posting thereby enabling the graph to be retrieved by using the digital signature as an index (i.e. get me the graph with this signature). Unfortunately, repeated signings of the same serialized RDF graph will be different due to the different bnode identities generated for each fresh serialization and this of course renders digital signatures pointless because all signature checks will fail except for the original serialisation. What's been somewhat lost in the mists of time is the original work of ex-Labs colleague Jeremy Carrol on the maths and logic for digital signatures of RDF graphs and published at ISWC 2003: https://link.springer.com/chapter/10.1007/978-3-540-39718-2_24 The author's copy is still available http://www.hpl.hp.com/techreports/2003/HPL-2003-142.pdf This is top-drawer stuff - Part of the Lecture Notes in Computer Science book series (LNCS, volume 2870) - and the references offer some exalted company. Snippets: In this paper, we concentrate on creating canonical representations of RDF graphs in N-Triples (which is what the Python above is doing) To create a canonical N-Triples file for an RDF graph without any blank nodes we can simply reorder the lines in the N-Triples document to be in lexicographic order.