Provenance-Aware Knowledge Representation: a Survey of Data Models and Contextualized Knowledge Graphs
Total Page:16
File Type:pdf, Size:1020Kb
Edith Cowan University Research Online ECU Publications Post 2013 1-1-2020 Provenance-aware knowledge representation: A survey of data models and contextualized knowledge graphs Leslie F. Sikos Edith Cowan University Dean Philp Follow this and additional works at: https://ro.ecu.edu.au/ecuworkspost2013 Part of the Computer Sciences Commons 10.1007/s41019-020-00118-0 Sikos, L. F., & Philp, D. (2020). Provenance-aware knowledge representation: A survey of data models and contextualized knowledge graphs. Data Science and Engineering. https://doi.org/10.1007/s41019-020-00118-0 This Journal Article is posted at Research Online. https://ro.ecu.edu.au/ecuworkspost2013/8373 Data Science and Engineering https://doi.org/10.1007/s41019-020-00118-0 Provenance-Aware Knowledge Representation: A Survey of Data Models and Contextualized Knowledge Graphs Leslie F. Sikos1 · Dean Philp2 Received: 1 April 2019 / Revised: 13 February 2020 / Accepted: 3 March 2020 © The Author(s) 2020 Abstract Expressing machine-interpretable statements in the form of subject-predicate-object triples is a well-established practice for capturing semantics of structured data. However, the standard used for representing these triples, RDF, inherently lacks the mechanism to attach provenance data, which would be crucial to make automatically generated and/or processed data author- itative. This paper is a critical review of data models, annotation frameworks, knowledge organization systems, serialization syntaxes, and algebras that enable provenance-aware RDF statements. The various approaches are assessed in terms of stan- dard compliance, formal semantics, tuple type, vocabulary term usage, blank nodes, provenance granularity, and scalability. This can be used to advance existing solutions and help implementers to select the most suitable approach (or a combination of approaches) for their applications. Moreover, the analysis of the mechanisms and their limitations highlighted in this paper can serve as the basis for novel approaches in RDF-powered applications with increasing provenance needs. Keywords RDF provenance · Contextual knowledge graph · RDF reification alternatives · RDF data model 1 Introduction to RDF Provenance path[?query] [#fragment] used to identify a resource,2 The Resource Description Framework (RDF)1 is a Seman- 2. RDF literals (L), which can be a) self-denoting plain tic Web standard for formal knowledge representation, literals LP in the form "<string>"(@<lang>)?, which can be used to efficiently manipulate and interchange where <string> is a string and <lang> is an optional machine-interpretable, structured data. Its data model is par- language tag, or b) typed literals LT of the form ticularly powerful due to its syntax and semantics; RDF "<string>"^^<datatype>, where <datatype> allows statements to be made in the form of subject-predicate- is an IRI denoting a datatype according to a schema (e.g., object triples, resulting in fixed-length dataset fields that are XML Schema), and <string> is an element of the lex- much easier to process than variable-length fields. Formally ical space corresponding to the datatype, and speaking, assume pairwise disjoint infinite sets of 3. blank nodes (B), i.e., unique anonymous resources that do not belong to either of the above sets. 1. Internationalized Resource Identifiers (IRIs, I), i.e., sets of strings of Unicode characters of the form scheme: A triple of the form (s, p, o) ∈ (I ∪ B) × I × (I ∪ L ∪ B) [//[user:password@]host[:port]] [/] is called an RDF triple, also known as an RDF statement, where s is the subject, p is the predicate, and o is the object. 1 https://www.w3.org/RDF/ The RDF data model is the canonicalization of a directed B Leslie F. Sikos graph, offering compatibility with graph algorithms, such [email protected] as graph traversal algorithms [1]. In addition, the RDF data Dean Philp model inherently supports basic inferences. Being modular, [email protected] it allows fully parallelized data processing and can represent partial information. RDF is one of the primary graph-based 1 Edith Cowan University, 270 Joondalup Drive, Joondalup, data models that are well-utilized in fields that require data WA 6027, Australia 2 Defence Science and Technology Group, Third Ave, Edinburgh, SA 5111, Australia 2 https://tools.ietf.org/html/rfc3987 123 L. F. Sikos, D. Philp fusion and/or aggregation from diverse data sources, such as of the W3C have been kept to a minimum with the shutdown cybersecurity [2]. of the Provenance Working Group in 2013 [22]. RDF has a variety of serialization formats and syntaxes, However, in Semantic Web applications, provenance can including RDF/XML,3 Turtle,4 Notation3 (N3),5 N-triples,6 be seen as a means to develop trust [23], as witnessed N-quads,7 JSON-LD,8 RDF/JSON,9 RDFa,10 and HTML5 by implementations of augmented provenance [24, 25] and Microdata.11 These make it possible to express RDF state- semantic provenance [26, 27] in application areas such as ments differently for applications that require compatibility eScience [28–32] and scientific data processing [33], work- with XML, the most compact representation possible, define flow management, bioinformatics [34], laboratory informa- property values in website markup attributes, and so on [3]. tion management [35], digital media archives [36], rec- RDF can be used to provide the uniform representation ommender systems [37], query search result ranking [38], of knowledge for processing data from diverse web sources and decision support for cybersecurity and cyber-situational via syntactic and semantic interoperability. Moreover, RDF awareness [39, 40]. facilitates the partial or full automation of tasks that otherwise This paper is a critical review of alternate approaches to would have to be processed manually [4, 5]. capture provenance for RDF, demonstrates their use in, and These benefits make RDF appealing for a wide range provides their quantitative comparison for, the cybersecu- of applications; however, RDF has shortcomings when it rity domain (which is known to be reliant on provenance) comes to encapsulating metadata to statements. With the from various perspectives. Section 2 covers extensions of the proliferation of heterogeneous structured data sources, such standard RDF data model, state-of-the-art annotation frame- as triplestores and LOD datasets, capturing data prove- works, and purpose-built knowledge organization systems nance,12 i.e., the origin or source of data [7], and the (KOS), including controlled vocabularies15 and ontolo- technique used to extract it, is becoming more and more gies,16 typically written in RDFS17 and OWL,18 respectively. important, because it enables the verification of data, the Section 3 discusses the requirements of RDF triplestores, assessment of reliability [8], the analysis of the processes quadstores, and graph databases to be suitable for stor- that generated the data [9], decision support for areas such ing provenance-aware RDF data, and Sect. 4 details how as cybersecurity [10, 11], cyberthreat intelligence [12], and RDF provenance can be queried. Section 5 reviews the most cyber-situational awareness [13], and helps express trustwor- prominent software tools for manipulating RDF provenance. thiness [14, 15], uncertainty [16], and data quality [17]. Yet, Section 6 describes common application domains. Finally, the RDF data model does not have a built-in mechanism to Sect. 7 demonstrates how application-specific implemen- attach provenance to triples or elements of triples.13 Con- tations can outperform general-purpose RDF provenance sequently, representing provenance data with RDF triples is techniques. a long-standing, non-trivial problem [19]. While the World Wide Web Consortium (W3C) suggested RDF extensions in 2010 to support provenance in the upcoming version of the 2 Formal Representation of RDF Data standard [20, 21], none of these have been implemented in the Provenance next release in 2014, namely in RDF 1.1.14 Related efforts RDF reification19 refers to making an RDF statement about another RDF statement by instantiating the rdf: 3 https://www.w3.org/TR/rdf-syntax-grammar/ 15 4 https://www.w3.org/TR/turtle/ A controlled vocabulary is a finite set of IRI symbols denoting concept names or classes (atomic concepts), role names, properties, 5 https://www.w3.org/TeamSubmission/n3/ and relationships (atomic roles), and individual names (entities), where 6 https://www.w3.org/TR/n-triples/ these three sets are pairwise disjoint. 7 https://www.w3.org/TR/n-quads/ 16 Ontologies are formal conceptualizations of a knowledge domain 8 https://www.w3.org/TR/json-ld/ with complex relationships, and optionally complex rules, suitable for 9 https://www.w3.org/TR/rdf-json/ inferring new statements, thereby making implicit knowledge explicit. 17 10 https://www.w3.org/TR/rdfa-primer/ RDF Schema, an extension of RDF’s vocabulary for creating vocab- ularies, taxonomies, and thesauri; see https://www.w3.org/TR/rdf- 11 https://www.w3.org/TR/microdata/ schema/ for reference. 12 Note that provenance is a complex term, and its usage varies greatly 18 Web Ontology Language (intentionally abbreviated with the W and O in different contexts [6]. swapped as OWL), a fully featured knowledge representation language 13 This can be circumvented by using alternate knowledge representa- for the conceptualization of knowledge domains with complex property tions, such as considering entities as embeddings in a vector space, for restrictions and concept relationships; see https://www.w3.org/OWL/ example [18], however, none of the alternate representations share all for reference. the strengths of RDF. 19 https://www.w3.org/TR/2004/REC-rdf-primer-20040210/# 14 https://www.w3.org/TR/rdf11-new/ reification 123 Provenance-Aware Knowledge Representation: A Survey of Data Models... Statement class and using the rdf:subject, rdf: Listing 2 Shorthand notation for RDF reification predicate, and rdf:object properties of the stan- <?xml version="1.0"?> dard RDF vocabulary20 to identify the elements of the <rdf:RDF xmlns:rdf="http://www.w3. triple.