Learning SHACL Shapes from Knowledge Graphs

Semantic Web 0 (0) 1 1 IOS Press 1 1 2 2 3 3 4 Learning SHACL Shapes from Knowledge 4 5 5 6 Graphs 6 7 7 8 Pouya Ghiasnezhad Omran a,*, Kerry Taylor a Sergio Rodriguez Mendez a and Armin Haller a 8 9 a School of Computing, The Australian National University , ACT, Australia 9 10 E-mails: [email protected], [email protected], [email protected], 10 11 [email protected] 11 12 12 13 13 14 14 15 15 16 Abstract. Knowledge Graphs (KGs) have proliferated on the Web since the introduction of knowledge panels to Google search 16 17 in 2012. KGs are large data-first graph databases with weak inference rules and weakly-constraining data schemes. SHACL, 17 18 the Shapes Constraint Language, is a W3C recommendation for expressing constraints on graph data as shapes. SHACL shapes 18 19 serve to validate a KG, to underpin manual KG editing tasks and to offer insight into KG structure. 19 We introduce Inverse Open Path (IOP) rules, a predicate logic formalism which presents specific shapes in the form of paths 20 20 over connected entities. Although IOP rules express simple shape patterns, they can be augmented with minimum cardinality 21 21 constraints and also used as a building block for more complex shapes, such as trees and other rule patterns. We define quality 22 measures for IOP rules and propose a novel method to learn high-quality rules from KGs. We show how to build high-quality 22 23 tree shapes from the IOP rules. Our learning method, SHACLEARNER, is adapted from a state-of-the-art embedding-based open 23 24 path rule learner (OPRL). 24 25 We evaluate SHACLEARNER on some real-world massive KGs, including YAGO2s (4M facts), DBpedia 3.8 (11M facts), 25 26 and Wikidata (8M facts). The experiments show SHACLEARNER can learn informative and intuitive shapes from massive KGs 26 27 effectively. Our experiments show the learned shapes are diverse in both structural features such as depth and width, and in 27 28 quality measures. 28 29 29 Keywords: SHACL Shape Learning, Shapes Constraint Language, Knowledge Graph, Inverse Open Path Rule 30 30 31 31 32 32 33 33 1. Introduction while less for others (e.g. opera singers). Even for the 34 34 same type of entity, for example, computer scientists, 35 35 While public knowledge graphs (KGs) became pop- there are different depths of detail depending on the 36 36 ular with the development of DBpedia [1] and Yago [2] country of origin of the scientist. 37 37 more than a decade ago, interest in enterprise knowl- 38 However, the power of KGs comes from their data- 38 edge graphs [3] has only taken off since the inclu- 39 first approach, enabling contributors to extend a KG 39 sion of knowledge panels on the Google Search en- 40 in a relatively arbitrary manner. By contrast, a rela- 40 gine, driven by its internal knowledge graph, in 2012. 41 tional database typically employs not-null and other 41 Although these KGs are massive and diverse, they are 42 constraints that require some attributes to be instanti- 42 typically incomplete. Regardless of the method that is 43 ated at all times. Large KGs are typically populated 43 44 used to build a KG (e.g. collaboratively vs individually, by automatic and semi-automatic methods using non- 44 45 manually vs automatically), it will be incomplete be- structured sources (e.g. Wikipedia) that are prone to 45 46 cause of the evolving nature of human knowledge, cul- errors of omission and commission. 46 47 tural bias [4] and resource constraints. Consider Wiki- SHACL[6] was formally recommended by the W3C 47 48 data [5], for example, where there is more complete in 2017 to express constraints on a KG as shapes. For 48 49 information for some types of entities (e.g. pop stars), example, SHACL can be used to express that a person 49 50 needs to have a name, birth date, and place of birth, 50 51 *Corresponding author. E-mail: [email protected]. and that these attributes have particular types: a string; 51 1570-0844/0-1900/$35.00 © 0 – IOS Press and the authors. All rights reserved 2 Ghiasnezhad Omran et al. / SHACLearner 1 a date; a location. The shapes are used to guide the propose a predicate calculus formalism in which rules 1 2 population of a KG, although they are not necessar- have one body atom and a chain of conjunctive atoms 2 3 ily enforced. Typically, SHACL shapes are manually- in the head with a specific variable binding pattern. 3 4 specified and the methods to do this are well-studied. Since these rules are an inverse version of open path 4 5 However, as for multidimensional relational database rules [13], we call them inverse open path (IOP) rules. 5 6 schemes [7], shapes could, in principle, be inferred To learn IOP rules we adapt an embedding-based open 6 7 from KG data. As frequent patterns, the shapes charac- path rule learner, OPRL [13]. We define quality mea- 7 8 terise a KG and can be used for subsequent data clean- sures to express the validity of IOP rules in a KG. 8 9 ing or ongoing data entry. There is scant previous re- SHACLEARNER uses the mined IOP rules to subse- 9 10 search on this topic [8]. quently discover more complex tree shapes. Each IOP 10 11 While basic SHACL [6] and its advanced fea- rule or tree is a SHACL shape, in the sense that it 11 12 tures [9] allows the modelling of diverse shapes in- can be syntactically rewritten in SHACL. Our mined 12 13 cluding rules and constraints, most of these shapes shapes are augmented with a novel numerical confi- 13 14 are previously well known as expressed by alterna- dence measure to express the strength of evidence in 14 15 tive formalisms, including closed rules [10], trees, the KG for each shape. 15 16 existential rules [11], and graph functional depen- 16 17 dencies [12]. We claim that the common underlying 17 18 form of all these shapes is the path, over which ad- 2. Preliminaries 18 19 ditional constraints induce the various versions. For 19 20 example in DBpedia we discover the following path, An entity e is an identifier for an object such as 20 21 < dbo : type > _ < db : S ong > (x) !< dbo : a place or a person. A fact (also known as a link) 21 22 album > (x; y)^ < dbo : recordLabel > (y; z): which is an RDF triple (e; P; e0), written here as as P(e; e0), 22 23 expresses that if an entity x is a song, then x is in an meaning that the subject entity e is related to an ob- 23 24 album y which has a record label z. ject entity e0 via the binary predicate (also known as 24 25 Since the satisfaction of a less-constrained shape is a property), P. In addition, we admit unary predicates 25 26 a necessary condition for satisfaction of a more com- of the form P(e), also written as the fact P(e; e). We 26 27 plex shape (but not a sufficient condition), in this pa- model unary predicates as self-loops to make the unary 27 28 per we focus on learning paths, the least constrained predicate act as the label of an link in the graph, just 28 29 shape for our purposes. Paths can serve as the basis for as for binary predicates. Unary predicates may, but 29 30 more complex shapes. We also investigate the process are not limited to, represent class assertions expressed 30 31 of constructing one kind of more complex shape, that in an RDF triple as (e; rdf:type; P) where P is a 31 32 is a tree, out of paths. For example, we discover a tree class or a datatype. A knowledge graph (KG) is a pair 32 33 about an entity which has song as its type as we show K = (E; F), where E is a set of entities and F is a set 33 34 in Fig. 1. In a KG context, the tree suggests that if we of facts and all the entities occurring in F also occur in 34 35 have an entity of type song in the KG, then we would E. 35 36 expect to have the associated facts as well. 36 37 2.1. Closed-Path Rules 37 38 38 39 KG rule learning systems employ various rule lan- 39 40 guages to express rules. RLVLR [14] and SCALEKB 40 41 [15] use so-called closed path (CP) rules that are a kind 41 42 of closed rule as they have no free variables. Each con- 42 43 sists of a head at the front of the implication arrow and 43 44 a body at the tail. We say the rule is about the predicate 44 45 of the head. The rule forms a closed path, or single un- 45 46 broken loop of links between the variables. It has the 46 47 following general form. 47 48 Fig. 1. A tree shape for the Song concept from DBpedia. 48 49 49 Pt(x; y) P1(x; z1)^P2(z1; z2)^:::^Pn(zn−1; y): 50 In this paper, we present a system, SHACLEARNER, 50 51 that mines shapes from KG data. For this purpose we (1) 51 Ghiasnezhad Omran et al.

Load more