Everything You Always Wanted to Know About Blank Nodes ∗ Aidan Hogan A, Marcelo Arenas B, Alejandro Mallea B, Axel Polleres C

Everything You Always Wanted to Know About Blank Nodes ∗ Aidan Hogan a, Marcelo Arenas b, Alejandro Mallea b, Axel Polleres c, aDepartment of Computer Science, Universidad de Chile bDepartment of Computer Science, Pontificia Universidad Católica de Chile cVienna University of Economics and Business (WU), Welthandelsplatz 1, 1020 Vienna, Austria Abstract In this paper we thoroughly cover the issue of blank nodes, which have been defined in RDF as ‘existential variables’. We first introduce the theoretical precedent for existential blank nodes from first order logic and incomplete information in database theory. We then cover the different (and sometimes incompatible) treatment of blank nodes across the W3C stack of RDF-related standards. We present an empirical survey of the blank nodes present in a large sample of RDF data published on the Web (the BTC–2012 dataset), where we find that 25.7% of unique RDF terms are blank nodes, that 44.9% of documents and 66.2% of domains featured use of at least one blank node, and that aside from one Linked Data domain whose RDF data contains many “blank node cycles”, the vast majority of blank nodes form tree structures that are efficient to compute simple entailment over. With respect to the RDF-merge of the full data, we show that 6.1% of blank-nodes are redundant under simple entailment. The vast majority of non-lean cases are isomorphisms resulting from multiple blank nodes with no discriminating information being given within an RDF document or documents being duplicated in multiple Web locations. Although simple entailment is NP-complete and leanness-checking is coNP-complete, in computing this latter result, we demonstrate that in practice, real-world RDF graphs are sufficiently “rich” in ground information for problematic cases to be avoided by non-naive algorithms. Key words: blank nodes, rdf, simple entailment, leanness, skolemisation, semantic web, linked data Introduction rent definition of blank nodes appropriate for the needs of the Web community? Although the adoption of RDF [46] has broad- The standard semantics for blank nodes in- ened on the Web in recent years [11], one of its core terprets them as existential variables [36], de- features—blank nodes—has been sometimes misun- noting the existence of some unnamed resource. derstood, sometimes misinterpreted, and sometimes These semantics make even “simple” entailment ignored by implementers, other standards, and the checking—entailment without further well-defined broader Semantic Web community. This lack of con- vocabularies—intractable [36]. RDF and RDFS en- sistency between the standard and its actual use tailment are based on simple entailment, and are calls for investigation: are the semantics and the cur- thus also intractable due to blank nodes [36]. However, in the documentation for the RDF standard (e.g., RDF/XML [9], RDF Primer [49]), the ? But Were Afraid to Ask Email addresses: [email protected] (Aidan Hogan), existentiality of blank nodes is not directly treated; [email protected] (Marcelo Arenas), ambiguous phrasing such as “blank node identifiers” [email protected] (Alejandro Mallea), is used, and examples for blank nodes focus on rep- [email protected] (Axel Polleres). Preprint submitted to Elsevier 15 December 2016 resenting resources that do not have a natural URI. remark on trends of adoption. Furthermore, the standards built upon RDF some- §6 We look at the role of blank nodes in pub- times have different treatment and requirements for lishing, their prevalence of use in real-world blank nodes. As we will see, standards and tools are Linked Data, and what blank node morpholo- often, to varying degrees, ambivalent to the existen- gies exist in the wild. tial semantics of blank nodes, where, e.g., the stan- §7 We give a detailed analysis of the prevalence dard query language SPARQL can return different of blank nodes that are made redundant by results for two graphs considered equivalent by the simple entailment, designing methods to effi- RDF semantics [4] and takes seemingly contradic- ciently identify non-leanness in RDF graphs tory positions on whether or not (named) graphs and discussing the results for a large sample can share blank nodes. of real-world Linked Data. Being part of the RDF specification, blank nodes §8 Finally, in light of the needs of the various are now a core aspect of Semantic Web technol- stakeholders already introduced, we discuss ogy: they are featured in several W3C standards, a some alternatives for handling blank nodes. wide range of tools, and hundreds of datasets across This paper extends a previously published confer- the Web, but not always with the same meaning ence paper [48]. Herein, we provide extended discus- (or at least, with the same intent). Dealing with sion throughout, we update our empirical analysis for the issue of blank nodes is thus not only impor- a more recent dataset, and we provide detailed tech- tant and timely, but also inherently complex and niques and results relating to classifying blank nodes potentially costly: before weighing up alternatives as lean or non-lean. for blank-nodes, their interpretation and adoption— across legacy standards, tool, and published data— must be considered. Running Example. Throughout this paper, we Given the complexity of the situation currently use the RDF graph given in Figure 1 to illustrate surrounding blank nodes—which spans several stan- our discussion. This graph states that the tennis dards, several perspectives, a plethora of tools and player :Federer won the :FrenchOpen in 2009; it hundreds of RDF publishers—our goal in this paper also states that he won :Wimbledon where one such is to bring clarity to this thorny issue by covering win was in 2003. 1 all aspects in one article. We provide a comprehen- sive overview of the current situation surrounding Preliminaries blank nodes, from theoretical background to stan- dardisation. We also provide novel techniques and results for analysing the blank nodes published in We begin by introducing the abstract representa- real-world Linked Data. Our outline is as follows: tion [32, 54] of the formal RDF model [36, 46] used in our theoretical discussion. We also introduce the §2 We provide some formal preliminaries relating semantics of RDF graphs containing blank nodes. to RDF and blank nodes. §3 We discuss blank nodes from a theoretical per- spective, relating them to background con- The RDF data model cepts from logic and database theory and, in so doing, we recapitulate some core worst- We assume the existence of pairwise disjoint infi- case complexity results. We further discuss nite sets U (URIs), L (literals) and B (blank nodes), Skolemisation in the context of RDF. where we write UB for the union of U and B, and §4 We look at how tasks such as simple entail- similarly for other combinations. 2 An RDF triple is ment and leanness checking can be performed in practice, where we discuss why worst-case 1 Throughout, we will omit the prefix declarations in complexity results are rarely encountered and SPARQL queries for brevity. The default prefix is used for remark on how SPARQL (the standard RDF example URIs. Other standard prefixes can be checked at query language) can be used for such tasks. the service http://prefix.cc. All URLs mentioned in this paper were retrieved at the time of writing: 2013/11/10. §5 We then survey how blank-nodes are treated 2 We generally stick to the term “URI” instead of “IRI” in the Semantic Web standards, how they are (supported by RDF 1.1) to follow conventions familiar from interpreted, what features rely on them, and the literature. For the purposes of this paper, URIs and IRIs can be considered interchangeable. 2 :TennisPlayer _:b1 year event "2003" type wins precededBy :Federer wins _:b2 event :FrenchOpen type :GrandSlamTournament type :Wimbledon year name wins precededBy "2009" "Roger Federer" _:b3 event Fig. 1. An RDF graph for our running example. In this graph, URIs are preceded by ‘:’, blank nodes by ‘_:’ and literals are enclosed in quotation marks. a tuple (s, p, o) ∈ UB × U × UBL where s is called instance µ(G) of G is proper if µ(G) has fewer blank the subject, p the predicate and o the object. nodes than G. This occurs when µ maps blank nodes to URIs or literals and/or maps two or more blank Definition 2.1 An RDF graph (or simply “graph”, nodes to the same blank node. where unambiguous) is a finite set of RDF triples. We now define two important operations on The set of terms of a graph G, denoted by graphs. The union of G1 and G2, denoted G1 ∪ G2, terms(G), is the set of elements of UBL that oc- is the set theoretical union of their sets of triples. A 4 cur in the triples of G.A vocabulary is a subset of merge of G1 and G2, denoted G1 +G2, is the union 0 0 0 0 UL. The vocabulary of G, denoted by voc(G), is de- G1 ∪ G2, where G1 and G2 are isomorphic copies of fined as terms(G) ∩ UL. Given a vocabulary V and G1 and G2, respectively, and where the sets of blank 0 0 a graph G, we say that G is a graph over V when- nodes in G1 and G2 are disjoint from each other. All ever voc(G) ⊆ V . A graph G is ground if it does not possible merges of G1 and G2 are pair-wise isomor- contain blank nodes (i.e., terms(G) ∩ B = ∅). phic such that G1 +G2 is unique up to isomorphism, A map is a partial function µ : UBL → UBL and we thus call it the merge of G1 and G2.

Load more