0 (0) 1 1 IOS Press

1 1 2 2 3 3 4 Learning SHACL Shapes from Knowledge 4 5 5 6 Graphs 6 7 7 8 Pouya Ghiasnezhad Omran a,*, Kerry Taylor a Sergio Rodriguez Mendez a and Armin Haller a 8 9 a School of Computing, The Australian National University , ACT, Australia 9 10 E-mails: [email protected], [email protected], [email protected], 10 11 [email protected] 11 12 12 13 13 14 14 15 15 16 Abstract. Knowledge Graphs (KGs) have proliferated on the Web since the introduction of knowledge panels to Google search 16 17 in 2012. KGs are large data-first graph with weak inference rules and weakly-constraining data schemes. SHACL, 17 18 the Shapes Constraint Language, is a W3C recommendation for expressing constraints on graph data as shapes. SHACL shapes 18 19 serve to validate a KG, to underpin manual KG editing tasks and to offer insight into KG structure. 19 We introduce Inverse Open Path (IOP) rules, a predicate logic formalism which presents specific shapes in the form of paths 20 20 over connected entities. Although IOP rules express simple shape patterns, they can be augmented with minimum cardinality 21 21 constraints and also used as a building block for more complex shapes, such as trees and other rule patterns. We define quality 22 measures for IOP rules and propose a novel method to learn high-quality rules from KGs. We show how to build high-quality 22 23 tree shapes from the IOP rules. Our learning method, SHACLEARNER, is adapted from a state-of-the-art embedding-based open 23 24 path rule learner (OPRL). 24 25 We evaluate SHACLEARNER on some real-world massive KGs, including YAGO2s (4M facts), DBpedia 3.8 (11M facts), 25 26 and Wikidata (8M facts). The experiments show SHACLEARNER can learn informative and intuitive shapes from massive KGs 26 27 effectively. Our experiments show the learned shapes are diverse in both structural features such as depth and width, and in 27 28 quality measures. 28 29 29 Keywords: SHACL Shape Learning, Shapes Constraint Language, Knowledge Graph, Inverse Open Path Rule 30 30 31 31 32 32 33 33 1. Introduction while less for others (e.g. opera singers). Even for the 34 34 same type of entity, for example, computer scientists, 35 35 While public knowledge graphs (KGs) became pop- there are different depths of detail depending on the 36 36 ular with the development of DBpedia [1] and [2] country of origin of the scientist. 37 37 more than a decade ago, interest in enterprise knowl- 38 However, the power of KGs comes from their data- 38 edge graphs [3] has only taken off since the inclu- 39 first approach, enabling contributors to extend a KG 39 sion of knowledge panels on the Google Search en- 40 in a relatively arbitrary manner. By contrast, a rela- 40 gine, driven by its internal knowledge graph, in 2012. 41 tional typically employs not-null and other 41 Although these KGs are massive and diverse, they are 42 constraints that require some attributes to be instanti- 42 typically incomplete. Regardless of the method that is 43 ated at all times. Large KGs are typically populated 43 44 used to build a KG (e.g. collaboratively vs individually, by automatic and semi-automatic methods using non- 44 45 manually vs automatically), it will be incomplete be- structured sources (e.g. Wikipedia) that are prone to 45 46 cause of the evolving nature of human knowledge, cul- errors of omission and commission. 46 47 tural bias [4] and resource constraints. Consider Wiki- SHACL[6] was formally recommended by the W3C 47 48 data [5], for example, where there is more complete in 2017 to express constraints on a KG as shapes. For 48 49 information for some types of entities (e.g. pop stars), example, SHACL can be used to express that a person 49 50 needs to have a name, birth date, and place of birth, 50 51 *Corresponding author. E-mail: [email protected]. and that these attributes have particular types: a string; 51

1570-0844/0-1900/$35.00 © 0 – IOS Press and the authors. All rights reserved 2 Ghiasnezhad Omran et al. / SHACLearner

1 a date; a location. The shapes are used to guide the propose a predicate calculus formalism in which rules 1 2 population of a KG, although they are not necessar- have one body and a chain of conjunctive atoms 2 3 ily enforced. Typically, SHACL shapes are manually- in the head with a specific variable binding pattern. 3 4 specified and the methods to do this are well-studied. Since these rules are an inverse version of open path 4 5 However, as for multidimensional relational database rules [13], we call them inverse open path (IOP) rules. 5 6 schemes [7], shapes could, in principle, be inferred To learn IOP rules we adapt an embedding-based open 6 7 from KG data. As frequent patterns, the shapes charac- path rule learner, OPRL [13]. We define quality mea- 7 8 terise a KG and can be used for subsequent data clean- sures to express the validity of IOP rules in a KG. 8 9 ing or ongoing data entry. There is scant previous re- SHACLEARNER uses the mined IOP rules to subse- 9 10 search on this topic [8]. quently discover more complex tree shapes. Each IOP 10 11 While basic SHACL [6] and its advanced fea- rule or tree is a SHACL shape, in the sense that it 11 12 tures [9] allows the modelling of diverse shapes in- can be syntactically rewritten in SHACL. Our mined 12 13 cluding rules and constraints, most of these shapes shapes are augmented with a novel numerical confi- 13 14 are previously well known as expressed by alterna- dence measure to express the strength of evidence in 14 15 tive formalisms, including closed rules [10], trees, the KG for each shape. 15 16 existential rules [11], and graph functional depen- 16 17 dencies [12]. We claim that the common underlying 17 18 form of all these shapes is the path, over which ad- 2. Preliminaries 18 19 ditional constraints induce the various versions. For 19 20 example in DBpedia we discover the following path, An entity e is an identifier for an object such as 20 21 < dbo : type > _ < db : S ong > (x) →< dbo : a place or a person. A fact (also known as a link) 21 22 album > (x, y)∧ < dbo : recordLabel > (y, z). which is an RDF triple (e, P, e0), written here as as P(e, e0), 22 23 expresses that if an entity x is a song, then x is in an meaning that the subject entity e is related to an ob- 23 24 album y which has a record label z. ject entity e0 via the binary predicate (also known as 24 25 Since the satisfaction of a less-constrained shape is a property), P. In addition, we admit unary predicates 25 26 a necessary condition for satisfaction of a more com- of the form P(e), also written as the fact P(e, e). We 26 27 plex shape (but not a sufficient condition), in this pa- model unary predicates as self-loops to make the unary 27 28 per we focus on learning paths, the least constrained predicate act as the label of an link in the graph, just 28 29 shape for our purposes. Paths can serve as the basis for as for binary predicates. Unary predicates may, but 29 30 more complex shapes. We also investigate the process are not limited to, represent class assertions expressed 30 31 of constructing one kind of more complex shape, that in an RDF triple as (e, rdf:type, P) where P is a 31 32 is a tree, out of paths. For example, we discover a tree class or a datatype. A knowledge graph (KG) is a pair 32 33 about an entity which has song as its type as we show K = (E, F), where E is a set of entities and F is a set 33 34 in Fig. 1. In a KG context, the tree suggests that if we of facts and all the entities occurring in F also occur in 34 35 have an entity of type song in the KG, then we would E. 35 36 expect to have the associated facts as well. 36 37 2.1. Closed-Path Rules 37 38 38 39 KG rule learning systems employ various rule lan- 39 40 guages to express rules. RLVLR [14] and SCALEKB 40 41 [15] use so-called closed path (CP) rules that are a kind 41 42 of closed rule as they have no free variables. Each con- 42 43 sists of a head at the front of the implication arrow and 43 44 a body at the tail. We say the rule is about the predicate 44 45 of the head. The rule forms a closed path, or single un- 45 46 broken loop of links between the variables. It has the 46 47 following general form. 47 48 Fig. 1. A tree shape for the Song concept from DBpedia. 48 49 49 Pt(x, y) ← P1(x, z1)∧P2(z1, z2)∧...∧Pn(zn−1, y). 50 In this paper, we present a system, SHACLEARNER, 50 51 that mines shapes from KG data. For this purpose we (1) 51 Ghiasnezhad Omran et al. / SHACLearner 3

1 We interpret these kinds of rules with quantification 2.2. Open-Path Rules: Rules with Free Variables 1 2 of all variables at the outside, and so we can infer a 2 3 fact that instantiates the head of the rule by finding an Unlike all earlier work in rule mining for KG com- 3 4 instantiation of the body of the rule in the KG. For ex- pletion, active knowledge graph completion [13] de- 4 5 ample, from the rule citizenO f (x, y) ← livesIn(x, z) ∧ fines open path (OP) rules of the form: 5 6 locatedIn(z, y), if we have the facts, livesIn(Mary, 6 7 Canberra) and locatedIn(Canberra, Australia) in the 7 P (x, z ) ← P (z , z )∧P (z , z )∧...∧P (z , y). 8 KG, then we can infer and assert the following new t 0 1 0 1 2 1 2 n n−1 8 9 fact, citizenO f (Mary, Australia). (2) 9 10 Rules are considered of more use if they generalise 10 11 well, that is, they explain many facts. To quantify this 11 Here, Pi is a predicate in the KG and each of {x, zi, y} 12 idea we recall measures support, head coverage and are entity variables. Unlike CP rules, OP rules do not 12 13 standard confidence that are used in some major ap- necessarily form a loop, but a straightforward variable 13 14 proaches to rule learning including [15] and [10]. unification transforms an OP rule to a CP rule, and ev- 14 15 Definition 1 (satisfies, support). Let r be a CP rule ery entity instantiation of a CP rule is also an entity 15 16 of the form (1). A pair of entities (e, e0) satisfies the instantiation of the related OP rule (but not vice-versa). 16 17 0 17 body of r, denoted bodyr(e, e ), if there exist enti- To assess the quality of open path rules, open path 18 18 ties e1, ..., en−1 in the KG such that all of {P1(e, e1), standard confidence (OPSC) and open path head cov- 19 0 19 P2(e1, e2), ..., Pn(en−1, e )} are facts in the KG. Fur- erage (OPHC) are derived in [13] from the closed path 20 0 0 20 ther (e, e ) satisfies the head of r, denoted Pt(e, e ), forms (Definition 2). 21 0 21 if Pt(e, e ) is a fact in the KG. Then the support of r 22 counts the rule instances for which the body and the Definition 3 (open path: OPsupp, OPSC, OPHC). Let 22 23 head are both satisfied in the KG. r be an OP rule of the form (2). Then a pair of enti- 23 0 0 24 ties (e, e ) satisfies the body of r, denoted bodyr(e, e ), 24 0 0 0 25 supp(r) = |{(e, e ): bodyr(e, e ) and Pt(e, e )}| if there exist entities e1, ..., en−1 in the KG such 25 0 26 that P1(e, e1), P2(e1, e2), ..., Pn(en−1, e ) are facts in 26 0 27 Standard confidence (SC) and head coverage (HC) the KG. Also (e , e) satisfies the head of r, denoted 27 0 0 28 offer standardised measures for comparing rule qual- Pt(e , e), if Pt(e , e) is a fact in the KG. The open path 28 29 ity. SC describes how frequently the rule is true, i.e. support, open path standard confidence, and open path 29 30 of the number of entity pairs that satisfy the body in head coverage of r are given respectively by 30 31 the KG, what proportion of the inferred head instances 31 32 are satisfied? It is closely related to confidence widely 32 0 00 0 00 33 used in association rule mining [16]. HC measures the OPsupp(r) = |{e : ∃e , e s.t. bodyr(e, e ) and Pt(e , e)}| 33 explanatory power of the rule, i.e. what proportion of 34 OPsupp(r) 34 the facts satisfying the head of the rule could be in- OPS C(r) = 0 0 35 |{e : ∃e s.t. bodyr(e, e )}| 35 36 ferred by satisfying the rule body? It is closely related 36 OPsupp(r) 37 to cover which is widely used for rule learning in in- 37 OPHC(r) = 0 0 |{e : ∃e s.t. Pt(e , e)}| 38 ductive logic programming [17]. A non-recursive rule 38 39 that has both 100% SC and HC is redundant with re- 39 40 spect to the KG, and every KG fact that is an instance 40 41 of the rule head is redundant with respect to the rule. 2.3. SHACL Shapes 41 42 Definition 2 (standard confidence, head coverage). 42 0 43 Let r, e, e , bodyr be as given in definition 1. Then stan- A KG is a schema-free database and does not need 43 44 dard confidence, to be augmented with schema information natively. 44 45 However, many KGs are augmented with type infor- 45 supp(r) 46 mation that can be used to understand and validate data 46 SC(r) = 0 0 47 |{(e, e ): bodyr(e, e )}| and can also be very helpful for inference processes 47 48 and head coverage, on the KG. In 2017 the Shapes Constraint Language 48 49 (SHACL) [6] was introduced as a W3C recommenda- 49 50 supp(r) tion to define schema information for KGs stored as 50 51 HC(r) = 0 0 51 |{(e, e ): Pt(e, e )}| RDF datasets. SHACL defines constraints for graphs 4 Ghiasnezhad Omran et al. / SHACLearner

Shape:human 1 as shapes. KGs can then be validated against a set of 1 a sh:NodeShape; 2 shapes. 2 sh:targetclass class:human; 3 Shapes serve two main purposes: validating the sh:property[ 3 4 quality of a KG and characterising the frequent pat- sh:path: citizenOf; 4 5 terns in a KG. In Fig. 2, we illustrate an example of a sh:path[citizenOf 5 1 6 shape from Wikidata , where x and zis are variables sh:nodekind country;] 6 7 that are instantiated by entities. Although the shape is ]; 7 8 originally expressed in ShEx [18], we translate it to sh:property[ 8 9 SHACL here. sh:path: father; 9 10 SHACL, together with SHACL advanced features [9] sh:path[father 10 11 is extensive. Here we focus on the core of SHACL sh:nodekind human;] 11 ]; 12 in which node shapes constrain a target predicate 12 sh:property[ 13 (e.g. the unary predicate human in Fig. 2), with prop- 13 sh:path: nativeLanguage; 14 erty shapes expressing constraints over facts related sh:path[nativeLanguage 14 15 to the target predicate. We particularly focus on prop- sh:nodekind language;] 15 16 erty shapes which act to constrain an argument of the ]; 16 17 target predicate. Continuing the example of Fig. 2, 17 18 the shape expresses that each entity x which satis- 18 19 fies human(x) should satisfy the following paths: (1) 19 20 citizenO f (x, z1) ∧ country(z1), (2) f ather(x, z2) ∧ 20 21 human(z2), and (3) nativeLanguage(x, z3)∧language(z3). 21 22 Many kinds of shapes, including tree and closed 22 23 rule, can be expressed by SHACL. We focus on path 23 24 shapes for our shape learning system. A path is a se- 24 25 quence of predicates with chained intermediate vari- 25 26 ables but permitting unbound variables at both ends. 26 27 Although shapes in the form of a path are not as con- Fig. 2. An example of a SHACL shape from Wikidata. 27 28 strained as more complex shapes like closed rules, they 28 29 are a necessary condition for the more complex shapes 29 human(x, x) → f ather(x, z2)∧ 30 like closed rule or tree, which are also paths (with fur- 30 31 ther restrictions). We will define Inverse Open Path human(z2, z2). 31 32 rules induced from paths that have a straightforward 32 human(x, x) → nativeLanguage(x, z3)∧ 33 interpretation as shapes, and also propose a method to 33 34 mine such rules from a KG. To demonstrate the poten- language(z3, z3). 34 35 tial for these kind of shapes to serve as building blocks 35 36 for more complex shapes, we then propose a method 36 that builds trees out of mined rules, and discuss the ap- 37 The general form of an IOP rule is given by 37 38 plication of such trees to KG completion. 38 39 0 0 39 Pt(x, z0) → ∃(z1, ..zn−1, y)P1(z0, z1) 40 40 3. SHACL learning 0 0 41 ∧P2(z1, z2) ∧ ... ∧ Pn(zn−1, y). (3) 41 42 3.1. Rules with Free Variables or Uncertain Shapes 42 0 43 where each Pi is either a predicate in the KG or its 43 44 We observe that the converse of OP rules, that reverse with the subject and object bindings swapped. 44 45 we call inverse open path rules (IOP), correspond to These are not Horn rules. In an IOP rule the body of 45 46 SHACL shapes. For example, the shape of Fig.2 has the rule is Pt and its head is the sequence of predicates, 46 47 the following three IOP rules as building blocks: P1 ∧P2 ∧...∧Pn. Hence we instantiate the atomic body 47 48 to predict an instance of the head. This pattern of exis- 48 human(x, x) → citizenO f (x, z ) ∧ country(z , z ). 49 1 1 1 tential quantification in the head and free variables in 49 50 the body of a rule has been investigated in the literature 50 51 1https://www.wikidata.org/wiki/EntitySchema:E10 as existential rules [11]. 51 Ghiasnezhad Omran et al. / SHACLearner 5

1 To assess the quality of IOP rules we follow the should exist. For example, we may need a rule to ex- 1 2 quality measures for OP rules [13]. press that each human has at least two parents. Hence, 2 3 we introduce IOP rules annotated with a cardinality, 3 Definition 4 (inverse open path: IOPsupp, IOPSC, IO- 4 4 PHC). Let r be an IOP rule of the form (3). Then a Car. This gives the following annotated form for each 5 pair of entities (e, e0) satisfies the head of r, denoted IOP rule. 5 6 0 6 headr(e, e ), if there exist entities e1, ..., en−1 in the 7 0 0 0 7 KG such that P1(e, e1), P2(e1, e2), ..., Pn(en−1, e ) are IOPS C, IOPHC, Car : Pt(x, z0) → P1(z0, z1)∧ 8 facts in the KG. A pair of entities (e00, e) satisfies the 8 P0 (z , z ) ∧ ... ∧ P0 (z , y) 9 body of r, denoted P (e00, e), if P (e00, e) is a fact in the 2 1 2 n n−1 9 t t (4) 10 KG. The inverse open path support, inverse open path 10 11 standard confidence, and inverse open path head cov- 11 where IOPSC and IOPHC belong to [0, 1] and denote 12 erage of r are given respectively by 12 those qualities of the rule. Car is an integer 1. 13 > 13 IOPsupp(r) =| {e : ∃e0, e00 s.t. head (e, e0) and P (e00, e)} | 14 r t Definition 5 (Cardinality of an IOP rule, Car). Let r be 14 15 an annotated IOP rule of the form (4) and let Car(r) 15 16 be the cardinality annotation for r. Then r satisfies 16 17 IOPsupp(r) Car(r) iff for each entity e ∈ {e|∃e00 s.t. P (e00, e)}, 17 ( ) = t 18 IOPS C r 00 00 0 0 18 | {e : ∃e s.t. Pt(e , e)} | Car(r) 6 | {e |headr(e, e } |. 19 19 20 The cardinality expresses a lower bound on the num- 20 21 IOPsupp(r) ber of head paths which are satisfied in the KG for 21 IOPHC(r) = 0 0 every instantiation of the variable that joins the body 22 | {e : ∃e s.t. headr(e, e )} | 22 23 to the head. For example, if we have 0.8, 0.1, 2 : 23 24 Notably, because a single open path induces both an Pt(x, y) → P1(y, z) ∧ P2(z, t) and have Pt(e1, e2) sat- 24 25 OP and IOP rule, the support for an IOP rule is the isfied, we should have at least two paths starting from 25 26 same as the support for its corresponding OP form, e2 that instantiate two distinct entities for variable t, of 26 27 IOPSC is the same as the corresponding OPHC, and the form P1(e2, z)∧P2(z, e3) and P1(e2, z)∧P2(z, e4). 27 28 IOPHC is the same as the corresponding OPSC. This There is no limitation on the instantiations for vari- 28 29 close relationship between OP and IOP rules helps us able z which is scoped inside the head, so these two 29 30 to mine both OP and IOP rules in the one process. pairs of instantiations both satisfy cardinality of 2: (1) 30 31 We show the relationship between an OP rule and its P1(e2, e5) ∧ P2(e5, e3) and P1(e2, e5) ∧ P2(e5, e4) and 31 32 converse IOP version in the following example. Con- (2) P1(e2, e5) ∧ P2(e5, e3) and P1(e2, e6) ∧ P2(e6, e4). 32 33 sider the OP rule, P1(x, z0) Rules with the same head and the same body may 33 34 ← P2(z0, z1) ∧ P3(z1, z2). Assume we have three en- have different cardinalities. In the given example we 34 35 tities ({e3, e4, e5}) which can instantiate z0 and satisfy might have the following rule as well, 0.9, 0.1, 1 : 35 36 both P1(x, z0) and P2(z0, z1) ∧ P3(z1, z2). Assume the Pt(x, y) → P1(y, z) ∧ P2(z, t). While we have a rule 36 37 number of entities that can instantiate z0 to satisfy the with a cardinality, c1, we also have rules with lower 37 38 head part is 5 ({e1, e2, e3, e4, e5}) and the number of cardinalities 1, ..., (c1 − 1) but their IOPSCs should be 38 39 entities that can instantiate z0 to satisfy the body part is as good or better than the rule with c1 cardinality. This 39 40 7 ({e3, e4, e5, e6, e7, e8, e9}). Hence, we have the fol- statement is quite intuitive since cardinality expresses 40 41 lowing for this rule, OPsupp = 3, OPS C = 3/7 and the lower bound number of occurrences of the head. 41 42 OPHC = 3/5. For the IOP version of the rule over Lemma 1 (IOPSC is non-increasing with length). Let 42 43 the same KG, P (x, z ) → P (z , z ) ∧ P (z , z ), we 0 43 1 0 2 0 1 3 1 2 r be an IOP rule of the form (3) with n > 2 and let r 44 have the same entities to instantiate zo while the pred- 0 0 44 be an IOP rule of the form Pt(x, z0) → P1(z0, z1) ∧ 45 icate terms corresponding to the bodies and heads are 0 0 45 P2(z1, z2) ∧ ... ∧ Pn(zn−2, y), being r shortened by the 46 swapped. Hence, we have the following for the IOP 46 removal of the last head predicate. Then IOPS C(r) 6 47 version of the rule, IOPsupp = 3, IOPS C = 3/5 and IOPS C(r0). 47 48 IOPHC = 3/7. 48 49 In many cases we need the rules to express not only Proof. Observe that by definition 4, the denominator 49 50 the necessity of a chain of facts (the facts in the head of of IOPS C() is not affected by the head of the rule, and 50 51 the IOP rule) but the number of different chains which so has the same value for both. Now, looking at the 51 6 Ghiasnezhad Omran et al. / SHACLearner

1 Algorithm 1 SHACLEARNER predicate (i.e., Pt). This sampling is essential since em- 1 2 bedding learners (e.g. HOLE [19] and RESCAL) can- 2 3 Input: a KG K, a target predicate Pt not handle massive KGs with millions of entities (e.g. 3 4 Parameters: max rule length l, max rule cardinality YAGO2). 4 5 MCar, MinIOPS C, MinIOPHC, and MinTreeS C After sampling, in line 2 Embeddings(), we com- 5 6 Output: a set of IOP rules R and Tree pute predicate embeddings as well as subject and ob- 6 7 7 0 ject argument embeddings for all predicates which ex- 1: K := Sampling(K, Pt) 8 ist in the K0 after sampling, as is done in RLVLR [14]. 8 2: (P A) := Embeddings( 0) 9 , K 9 0 In a nutshell, we use RESCAL [19] to embed each 3: R := ∅ d 10 entity ei to a vector Ei ∈ R and each predicate Pk to 10 4: for 2 6 k 6 l do d×d 11 a matrix Pk ∈ where is the set of real numbers 11 5: Add PathFinding(K0, P , P, A, k) to R0 R R 12 t and d is an integer (a parameter of RESCAL). For each 12 6: 13 end for 13 given fact P0(e1, e2), the following scoring function is 7: R := Evaluation(R0, K, MCar, MinIOPS C, MinIOPHC) 14 computed: 14 15 15 8: Tree := GreadySearch(R, MinTreeS C) 16 f (e , P , e ) = ET · P · E (5) 16 9: 1 0 2 1 0 2 17 return Tree and R 17 18 The scoring function indicates the plausibility of the 18 19 numerators, we have that 19 0 0 0 fact that e1 has relation P0 with e2. The two sets of 20 ∀e(∃e1, .., en(P1(e, e1)∧P2(e1, e2)∧...∧Pn(en−2, en−1)∧ 20 0 0 0 embeddings, {Ei} and {Pk} are learned by minimizing 21 Pn(en−1, en)) =⇒ ∃e1, .., en−1(P1(e, e1)∧P2(e1, e2)∧ 21 0 0 a a loss function. 22 ... ∧ Pn(en−2, en−1))). Therefore {e : ∃e s.t. 22 0 0 0 The argument embeddings are also computed in 23 headr(e, e )} ⊆ {e : ∃e s.t. headr0 (e, e )}. There- 23 0 00 0 00 Embeddings() according to the method proposed 24 fore {e : ∃e , e s.t. headr(e, e ) and Pt(e , e)} ⊆ {e : 24 0 00 in [14]. To compute the subject (respectively object) 25 ∃e , e s.t. 25 0 00 0 argument embeddings of a predicate Pk, we aggre- 26 headr0 (e, e )and Pt(e , e)}. So IOPsupp(r) 6 IOPsupp(r ) gate the embeddings of entities that occur as the sub- 26 27 as required. 27 ject (respectively object) of Pk in the KG. Hence for 28 1 2 28 each predicate Pk we have two vectors, Pk and Pk that 29 present the subject argument and object argument of 29 30 3.2. IOP Learning through Representation Learning 30 Pk respectively. 31 After that, in line 3 to line 7, PathFinding() produces 31 32 We start with the open path rule learner OPRL [13], potential IOP rules based on the embedding represen- 32 33 and adapt its embedding-based OP rule learning to tation of the predicates which are involved in each rule. 33 34 learn annotated IOP rules. We call this new learner, The potential rules are pruned by the scoring function 34 35 SHACLEARNER, shown in Algorithm 1. heuristic proposed in [13] for OP rules. A high-scoring 35 36 Our SHACLEARNER uses a sampling method 36 potential path suggests the existence of both OP and 37 Sampling() which prunes the entities and predicates 37 IOP rules. 38 that are less relevant to the target predicate to obtain 38 An IOP rule Pt(x, y) → P1(y, z) ∧ P2(z, t)., acts to 39 a sampled KG. The sample is fed to embedding learn- 39 connect entities satisfying the subject argument of the 40 ers such as RESCAL [19, 20] in Embeddings(). Then 40 body predicate, Pt, to entities forming the object argu- 41 in PathFinding() SHACLEARNER uses the computed 41 ment of the last predicate, P2, along a path of entities 42 embedding representations of predicates and entities 42 that satisfy a chain of predicates in the rule. There is a 43 in heuristic functions that inform the generation of 43 relationship between the logical statement of the rule 44 IOP rules bounded by a maximum length. Then, po- 44 and the following properties in embedding space [13]: 45 tential IOP rules evaluated, annotated, and filtered in 45 46 Evaluation() to produce annotated IOP rules. Eventu- 1. The predicate arguments that have same variable 46 47 ally, a tree is discovered for each argument of each in the rule should have similar argument emded- 47 48 target predicate by aggregating mined IOP rules. dings. For example we should have the following 48 2 1 2 2 49 In more detail, in line 1, the Sampling() method similarities, Pt ∼ P1 and P1 ∼ P1. 49 50 computes a fragment of the KG (K0) consisting of a 2. The whole path forms a composite predicate, 50 ∗ 51 bounded number of entities that are related to the goal like P (x, t) = Pt(x, y) ∧ P1(y, z) ∧ P2(z, t). 51 Ghiasnezhad Omran et al. / SHACLearner 7

1 We compute the embedding representation of the The adjacency matrices for the predicates P1, P2 and 1 2 composite predicate based on its components, Pt are: 2 3 P∗(x, t) = P (x, y) · P (y, z) · P (z, t). Now we 3 t 1 2     4 could check the plausibility of P∗(x, t) for any 0 1 0 0 1 0 4 A(P ): 1 0 1 , A(P ): 0 0 0 , 5 pair of entities by equation 5. However, since 1   2   5 1 0 0 0 1 1 6 we are interested in the existence of an entity- 6 7 free rule, the following similarity will hold: 1 ≈ 7 8 P1 · P · P · P · P2. 8 t t 1 2 2 0 0 1 9 9 Based on the above two properties, [13] defines two A(Pt): 0 0 0 10 10 scoring functions that help us to heuristically mine the 0 1 1 11 11 space of all possible IOP rules to produce a reduced set 12 12 of candidate IOP rules. For IOPSC and IOPHC (definition 4) we need to cal- 13 13 The ultimate evaluation of an IOP rule will be done culate (1) the number of entities that satisfy the body 14 14 in the next step as we will describe. of the rule, i.e. 15 00 00 15 In line 8, we use a greedy search to aggregate the #e : ∃e s.t. Pt(e , e), (2) the number of entities that 16 0 0 16 discovered IOP rules for each argument of each target satisfy the head of a rule i.e. #e : ∃e s.t. headr(e, e ) 17 17 predicate and form a tree (as illustrated in Figure 3). and, (3) the number of entities that join the head of a 18 18 The detail of this process will be described in section rule to its body i.e. 19 0 00 0 00 19 3.3. #e : ∃e , e s.t. headr(e, e ) and Pt(e , e). 20 For (1) we can read the pairs (e00, e) directly from the 20 21 21 3.2.1. Efficient Computation of Quality Measures matrix A(Pt). To find distinct es we sum each column 22 The structure of SHACLEARNER follows OPRL [13] (corresponding to each value for the second argument) 22 23 2 23 very closely, so now we focus our attention on the and transpose to obtain the vector V (Pt). Each non- 24 efficient matrix-computation of the quality measures zero element of this vector indicates a satisfying e and 24 25 that are novel for SHACLEARNER. Evaluation() in the number of distinct es is given by simply counting 25 26 Algorithm 1 first evaluates candidate rules on the the number of non-zero elements. In the example, the 26 27 27 small sampled KG, and selects only the rules with only non-zero element in A(Pt) is A(Pt)[1, 3] and after 28 IOPsupp(r) > 1. They may still include a large num- summing the columns and transposing we have 28 29 ber of redundant and low quality rules and so are fur- 29 30 ther downselected based on their IOPSC and IOPHC 0 30 31 calculated over the full KG. We show in the following 2 31 V (Pt) = 1 32 how to efficiently compute the measures over massive 2 32 33 KGs using an adjacency matrix representation of the 33 34 KG. 2 34 so we have only V (Pt)[3] non-zero and {e2, e3} satis- 35 Let K = (E, F) with E = {e1,..., en} be the set fies the head with count 2. 35 36 of all entities and P = {P1,..., Pm} be the set of all For (2) the pairs (e, e0) satisfying the head are con- 36 37 predicates in F. Like RESCAL [19, 20], we repre- 37 nected by the path P1, P2,... Pm, and can be ob- 38 sent K as a set of square n × n adjacency matrices by tained, as for (1), directly from the matrix product 38 39 defining the function A. Specifically, the [i, j]th ele- 39 A(P1)A(P2) ... A(Pm), being the elements with a 40 ment A(P )[i, j] is 1 if the fact P (e , e ) is in F; and 40 k k i j value > Car (for rules with cardinality Car). To 41 0 otherwise. Thus, A(Pk) is a matrix of binary values find distinct es we sum each row (corresponding to 41 42 and the set {A(Pk) | k ∈ {1,..., m}} represents K. each value for the first argument) to obtain the vec- 42 43 We illustrate the computation of IOPSC and IO- 1 43 tor V (A(P1)A(P2) ... A(Pm)). Each element with 44 PHC through an example. Consider the IOP rule r : 44 > Car value of this vector indicates a satisfying e and 45 Pt(x, z0) → P1(z0, z1)∧P2(z1, y). Let E = {e1, e2, e3} the number of distinct es is given by counting the num- 45 46 and 1 46 ber of elements in V (A(P1)A(P2) ... A(Pm)). 47 In the example we have 47 48 F = {P1(e1, e2), P1(e2, e1), P1(e2, e3), P1(e3, e1), 48 49 " # " # 49 P2(e1, e2), P2(e3, e2), P2(e3, e3), Pt(e1, e3), 0 0 0 0 50 1 50 A(P1)A(P2) = 0 2 1 , V (P1, P2) = 3 51 Pt(e3, e2), Pt(e3, e3)} 0 1 0 1 51 8 Ghiasnezhad Omran et al. / SHACLearner

01 1 01 1 1 01 1 1 1 with satisfying entities e2 and e3 and count of 2 for P1 (z0, z1) ∧ P2 (z1, z2) ∧ ... ∧ Pn (zn−1, y ) 1 2 Car = 1. For Car = 2, e satisfies the rule so the count 2 2 ∧P02(z , z2) ∧ P02(z2, z2) ∧ ... ∧ P02(z2 , y2) 3 is 1. 1 0 1 2 1 2 m m−1 3 4 Computing (3) is now straightforward. We have ... 4 5 that the row index of non-zero elements of V2(P ) 5 t 0q q 0q q q 0q q q 6 indicates entities that satisfy the second argument ∧P1 (z0, z1) ∧ P2 (z1, z2) ∧ ... ∧ Pt (zt−1, y ) 6 7 of the body and that row index of elements with a (6) 7 8 1 8 value > Car of V (A(P1)A(P2) ... A(Pm)) indi- 9 0 j 9 cates entities that satisfy the first argument of the where each Pi is either a predicate in the KG or its 10 head. Therefore we can find the entities that satisfy reverse with the subject and object bindings swapped. 10 11 11 both of these conditions by pairwise multiplication. In a tree we say the body of the shape is Pt and its head 12 2 1 2 12 That is, the entities we need are {ei | (V (Pt)[i] > is the sequence of paths or branches, Path ∧ Path ∧ 13 1 q 13 0 ∧ V (A(P1)A(P2) ... A(Pm))[i]) > Car and 1 6 ... ∧ Path . Hence we instantiate the atomic body to 14 14 i 6 n}, and the count of entities is the size of this set. predict an instance of the head. All head branches and 15 15 For the example, for Car = 1, we have {e2, e3} in the the body joint in one variable, z0. 16 16 set with count 2 and for Car = 2, we have {e2} in the To assess the quality of a path we follow the quality 17 17 set with count 1. measures for IOP rules. 18 Hence, we could have three versions of r, i.e., r1, r2 18 19 and r3 with three different Car of 1, 2, and 3 respec- Definition 6 (Tree: Treesupp, TreeSC). Let r be a 19 20 1 tree of the above form. Then a set of pairs of enti- 20 tively. For Car = 1, IOPsupp(r ) = |{e2, e3}| = 2. 1 q 21 ties (e, e ), ..., (e, e ) satisfies the head of r, denoted 21 From Definition 4 we can obtain IOPHC(r1) = 2/2 22 1 2 headr(e), if there exist the following sequences of en- 22 and IOPS C(r ) = 2/2. For Car = 2, IOPsupp(r ) = 1 1 2 2 q q 23 2 tities e1, ..., en−1, e1, ..., em−1 and e1, ..., et−1 in the 23 |{e2}| = 1. We can obtain IOPHC(r ) = 1/1 and 1 1 1 1 1 1 1 1 24 2 KG such that P1(e, e1), P2(e1, e2), ..., Pn (en−1, e ), 24 IOPS C(r ) = 1/2. In this case, we have the same 2 2 2 2 2 2 2 2 q q 25 P1(e, e1), P2(e1, e2), ..., Pm(em−1, e ) and P1(e, e1), 25 qualities for Car = 3 as we have for Car = 2. q q q q q q 26 P2(e1, e2) , ..., Pt (et−1, e ) are facts in the KG. A 26 00 27 pair of entities (e , e) satisfies the body of r, denoted 27 3.3. From IOP Rules to Tree Shapes 00 00 28 Pt(e , e), if Pt(e , e) is a fact in the KG. The tree sup- 28 29 port and tree standard confidence of r are given re- 29 Now we turn to deriving SHACL trees from anno- spectively by 30 30 tated IOP rules. For each target predicate we learn two 31 31 00 00 32 sets of IOP rules, one binding the subject argument and Treesupp(r) =| {e : ∃e s.t. headr(e) and Pt(e , e)} | 32 33 the other binding the object argument, i.e. rules about 33 34 the predicate and rules about its reverse. We aggregate 34 35 the rules about a target predicate as a tree rooted at the 35 Treesupp(r) 36 predicate as illustrated in Fig. 3. 36 TreeS C(r) = 00 00 | {e : ∃e s.t. Pt(e , e)} | 37 For example, the shape of Fig. 2 has the following 37 38 tree: 38 To learn each tree we deploy a greedy search 39 39 (GreadySearch). To do so, we sort all rules that bind 40 human(x, x) → citizenO f (x, z1) ∧ country(z1, z1) 40 41 the subject argument (for the left hand tree in Fig. 41 ∧ f ather(x, z3) ∧ human(z3, z3) 42 3) in a non-increasing order with respect to IOPSC. 42 43 ∧ nativeLanguage(x, z4)∧ Then we iteratively try to add the first rule in the list 43 44 to the tree and compute the TreeSC. If the TreeSC 44 language(z4, z4). 45 drops below the defined threshold (i.e. TreeS CMIN) 45 46 we dismiss the rule otherwise we add it to the tree. 46 47 For the right hand tree we do the same with the rules 47 48 The general form of a tree is given by that bind the object argument of the target predicate. 48 49 Since a conjunction of IOP rules form a tree, TreeSC 49 50 is bounded above by the minimum IOPSC of its con- 50 0 ∗ ∗ 51 Pt(x, z0) → ∃(z∗, y ) stituent IOP rules. 51 Ghiasnezhad Omran et al. / SHACLearner 9

1 are procedural methods without logical foundations 1 2 and are not shown to be scalable to handle real-world 2 3 KGs. They work with a small amount of data and the 3 4 representation formalism they use for their output is 4 5 difficult to compare with the well-defined IOP rules 5 6 which we use in this paper. [25] carries out the task in 6 7 a semi-automatic manner: it provides a sample of data 7 8 to an of-the-shelf graph structure learner and provides 8 9 Fig. 3. Trees for a target predicate Pt. Each vi indicates a TreeSC. the output in an interactive interface for a human user 9 10 to create SHACL shapes. 10 3.3.1. Tree Shapes are Useful for Human Interaction 11 There are some works [26, 27] that use existing on- 11 Shapes offer KG documentation as readable patterns 12 tologies for KGs to generate SHACL shapes. [26] uses 12 and also provide a way to validate a KG. Our novel tree 13 two different kinds of knowledge to automatically gen- 13 shapes can additionally be used for KG-completion. 14 erate SHACL shapes: ontology constraint patterns as 14 While there are several methods proposed to complete 15 well as input ontologies. In our work we use the KG it- 15 KGs automatically (e.g. [14, 19, 20]) by predicting 16 self to discover the shapes, without relying on external 16 missing facts (also called links), all these methods tra- 17 modelling artefacts. 17 verse the KG in a breadth-first manner. In other words, 18 From an application point of view, there are papers 18 they sequentially answer a set of independent ques- 19 which investigate the application of SHACL shapes to 19 tions such as 20 the validation of RDF databases including [28, 29], but 20 birthPlace(Trump, ?) or playsIn(Naomi_Watts, ?). 21 these do not contribute to the discovery of shapes. 21 Our proposed tree shapes instead provide an opportu- 22 [30] proposes an extended validation framework for 22 nity to work sequentially along a path of dependent 23 the interaction between rules and SHACL shapes in 23 questions such as birthPlace(Trump, ?1) followed by 24 KGs. When a set of inference rules and SHACL shapes 24 capitalO f (?1, ?). The latter question cannot even be 25 are provided, a method is proposed to detect which 25 asked until we have an answer for the former ques- 26 shapes could be violated by applying a rule. 26 tion, and the existence of an answer to the former 27 There are some partial attempts to provide logical 27 gives us the confidence to proceed to the next question 28 foundations for the semantics of the SHACL language 28 along the path. This completion strategy is depth-first 29 including [31] that presents the semantics of recursive 29 as it works through a shape tree. Importantly, when 30 SHACL shapes. By contrast, in our work we approach 30 we want to ask such completion questions of a human, 31 SHACL semantics in the reverse direction. We start 31 this depth-first questioning strategy will reduce the 32 32 cognitive load due to the contextual connections be- with logical formalisms with both well-defined seman- 33 33 tween successive questions. This strategy for human- tics and motivating use cases to derive shapes that can 34 34 KG-completion is applied in the smart KG editor Schí- be trivially expressed in a fragment of SHACL. 35 35 matos [21]. 36 36 Tree shapes can also help a human expert to ex- 37 37 tract a more intuitive concise sub-tree out of a longer, 5. Experiments 38 38 more complex tree when desired for explainablity. If 39 We have implemented our SHACLEARNER2 based 39 a tree with confidence TreeS Corig is pruned either by 40 40 removing branches (width-wise) or by reducing the on Algorithm 1, and conducted experiments to assess 41 41 length of branches (depth-wise), it remains a valid it. Our experiments are designed to prove the effective- 42 ness of our SHACLEARNER at capturing shapes with 42 tree with confidence TreeS Cnew with the property that 43 varying confidence, length and cardinality from vari- 43 TreeS Cnew > TreeS Corig. Hence, by pruning a tree 44 we got a tree with higher degree of confidence based ous real-world massive KGs. Since our proposed sys- 44 45 on the data. tem is the first method to learn shapes from massive 45 46 KGs automatically, we have no benchmark with which 46 47 to compare. However, the performance of our system 47 48 4. Related Work 48 49 49 2Detailed experimental results can be found (anonymously) at 50 There are some exploratory attempts to address https://www.dropbox.com/sh/dn1kujeg0b6o9bw/ 50 51 learning SHACL shapes from KGs [8, 22–25]. They AAAbKG9KfZ7e0eaNONz2zE93a?dl=0 51 10 Ghiasnezhad Omran et al. / SHACLearner

1 shows that it can handle the task satisfactorily and can choose two predicates from DBpedia 3.8 "" 1 2 be applied in practice. We demonstrate that and "" and one from Wikidata "" to generate those unary predicates and 3 1.SHACL EARNER is scalable so it can handle 4 facts. These predicates each have a class as their sec- 4 real-world massive KGs including DBpedia with 5 ond argument. To prune the classes with few instances, 5 over 11M facts. 6 we consider only the new unary predicates which have 6 2.SHACL EARNER can learn several shapes each 7 7 for various target predicates. at least 100 facts. We do not remove the original pred- 8 icates and facts from the KG but extend the KG with 8 3.SHACL EARNER can discover diverse shapes 9 9 with respect to qualities of IOPSC and IOPHC. the new ones. In Table 1 we report the specifications 10 10 4.SHACL EARNER discovers shapes of varying of two benchmarks where we have added the unary 11 11 complexity, and diverse with respect to length predicates and facts (denoted as +UP). 12 12 and cardinality. 13 13 5.SHACL EARNER discovers more complex shapes 5.1. Learning IOP Rules 14 14 (trees) by aggregating learnt IOP rules efficiently. 15 We follow the established approach for evaluating 15 16 Our three benchmark KGs are described in Table 1. KG rule-learning methods, that is, measuring the quan- 16 17 All three are common KGs and have been used in rule tity and quality of distinct rules learnt. Rule qual- 17 18 learning experiments previously [10, 14]. 18 ity is measured by Inverse open path standard con- 19 19 Table 1 fidence (IOPSC) and Inverse open path head cover- 20 20 Benchmark specifications age (IOPHC). We randomly selected 50 target pred- 21 21 icates from Wikidata and DBPedia unary predicates 22 22 KG # Facts # Entities # Predicates (157 and 355 respectively). We used all binary pred- 23 23 icates of YAGO2s (i.e. 37) as target predicates. Each 24 YAGO2s 4,122,438 2,260,672 37 24 binary target predicate serves as two target predicates, 25 Wikidata 8,397,936 3,085,248 430 25 once in the straight form and secondly in its reverse 26 Wikidata +UP 8,780,716 3,085,248 587 26 form. In the straight form the object argument of the 27 DBpedia 3.8 11,024,066 3,102,999 650 27 predicate is the joining variable to connect the body 28 DBpedia 3.8 +UP 11,498,575 3,102,999 1,005 28 and head, while in the reverse form the subject argu- 29 29 30 ment serves to join. In this manner, we ensure that the 30 All experiments were conducted on an Intel Xeon EARNER 31 results of SHACL on YAGO2s with its binary 31 CPU E5-2650 v4 @2.20GHz, 66GB RAM and run- 32 predicates as targets is comparable with the results for 32 ning CentOS 8. 33 Wikidata and DBpedia that have unary predicates as 33 34 5.0.1. Transforming KGs with type predicates for targets. Hence for YAGO2s we have 74 target predi- 34 35 experiments cates. A 10 hour limit was set for learning each target 35 36 In real-world KGs, concept or class membership predicate. Table 2 shows #Rules, the average numbers 36 37 is modeled as entity instances of a binary fact, for of quality IOP rules found; %SucTarget, the proportion 37 38 example DBpedia contains "(x,y)" and of target predicates for which at least one IOP rule was 38 39 "(x,y)" predicates where the second argu- found; and the running times in hours, averaged over 39 40 ments of these predicates are types (e.g. "" the targets. Only high quality rules meeting minimum 40 41 or "") or classes (e.g. "" or quality thresholds are included in these figures, that is, 41 42 ""). Instead, we choose to model types and with IOPSC> 0.1 and IOPHC> 0.01, thresholds es- 42 43 classes with unary predicates. To do so we make new tablished in comparative work [10]. 43 44 predicates like "_(x)" if we 44 Table 2 45 have facts in the form " (x,)", 45 Performance of SHACLEARNER on benchmark KGs 46 where x is the name of an album. Then we produce 46 47 new unary facts based on the new predicate and related Benchmark %SucTarget #Rules Time (hours) 47 48 facts. For example for YAGO2s 80 42 0.6 48 49 "(Revolver,)", we produce the Wikidata+UP 82 67 1.8 49 50 new fact DBpedia+UP 98 157 2.4 50 51 "_(Revolver)". We manually 51 Ghiasnezhad Omran et al. / SHACLearner 11

1 SHACLEARNER shows satisfactory performance in quires a song (x) to have at least one producer while the 1 2 terms of both the runtime and the numbers of qual- third rule requires a song to have at least two produc- 2 3 ity rules mined. Note that rules found have a variety ers, and these two rules are distinguished by the car- 3 4 of lengths and cardinalities. To better present the qual- dinality annotation. As we discussed in 3.1, the third 4 5 ity performance of rules we illustrate the distribution rule is more constraining than the second, so the confi- 5 6 of rules with respect to the features, IOPSC, IOPHC, dence of the third rule is lower than the confidence of 6 7 cardinality and length, and also IOPSC vs length. In the second, based on the KG data. 7 8 the following, the proportion of mined rules having The rules can be trivially rewritten (through a script 8 9 the various feature values is presented, to more evenly that we developed) in SHACL syntax as follows. 9 10 demonstrate the quality of performance over the three 10 11 very different KGs. :s1 11 12 The distribution of mined rules with respect to their a sh:NodeShape ; 12 13 IOPSC and IOPHC is shown in Figure 4. In the left- sh:targetclass class:_; 13 14 hand chart we observe a consistent decrease in the pro- sh:property [ sh:minCount 1 ; 14 15 portion of quality rules as the IOPSC increases. In the sh:path : ; 15 [sh:Path 16 right hand chart we see a similar pattern for increas- 16 :; ] 17 ing IOPHC, but the decrease is not as consistent be- 17 ]. 18 18 cause there must be frequent relevant rules with many :s2 19 covered head instances. a sh:NodeShape ; 19 20 The distribution of mined rules with respect to their sh:targetclass class:_; 20 21 cardinalities is shown in Figure 5. We observe that the sh:property [ sh:minCount 1 ; 21 22 largest proportion of rules has cardinality of 1, as ex- sh:path :; ] . 22 23 pected, as they have the least stringent requirements to :s3 23 24 be met in the KG. We observe an expected decrease a sh:NodeShape ; 24 25 with greater cardinalities as they demand tighter re- sh:targetclass class:_; 25 26 strictions to be satisfied. sh:property [ sh:minCount 2 ; 26 sh:path :; ] . 27 The distribution of mined rules with respect to their 27 28 lengths is shown in Figure 6. One might expect that 28 5.1.1. IOPSC vs IOPHC 29 as the length increases, the number of rules would in- 29 Using rules found in the experiments, we further il- 30 crease since the space of possible rules grows and this 30 lustrate the practical meaning of the IOPSC and IOHC 31 is what we see. qualities. While IOPSC determines the confidence of a 31 32 For a concrete example of SHACL learning, we rule based on counting the proportion of target predi- 32 33 show the following three IOP rules mined from DBpe- cate instances for which the rule holds true in the KG, 33 34 dia in the experiments. IOPHC indicates the proportion of rule consequent in- 34 35 stances that are justified by target predicate instances 35 0.64, 0.01, 1 : < dbo : type > _ < db : Song >(x) → 36 in the KG, thereby indicating the relevance of the rule 36

37 < dbo : album >(x, z1) ∧ < dbo : recordLabel >(z1, y). to the target. In Wikidata, all unary predicates are oc- 37 38 cupations such as singer or entrepreneur, so all the en- 38 0.63, 0.01, 1 : < dbo : type > _ < db : Song >(x) → 39 tities which have these types turn out to be persons 39 40 < dbo : producer >(x, y). even though there is no explicit person type in our KG. 40 Thus, the occupations all have very similar IOP rules 41 41 0.41, 0.01, 2 : < dbo : type > _ < db : Song >(x) → about each of them with high IOPSC and low IOPHC, 42 like the following one, for example. 42 43 < dbo : producer >(x, y). 43 44 44 0.47, 0.01, 1 : < occupation > _ < singer >(x) → 45 45 46 The numbers prefixing each IOP rule are the corre- < country_of_citizenship >(x, y). 46 47 sponding IOPSC, IOPHC, and Cardinality annotations 47 48 respectively. 48 49 < dbo : type > _ < db : Song >(x) expresses x is a On the other hand, for these unary occupation predi- 49 50 song. The first rule indicates x should belong to an al- cates there are also some IOP rules with high IOPHC 50 51 bum (z1) that has y as record label. The second rule re- that apply only to one specific unary predicate, like the 51 12 Ghiasnezhad Omran et al. / SHACLearner

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 Fig. 4. Distribution of mined rules with respect to IOPSC and IOPHC 13 13 14 5.2. Learning Trees from IOP Rules 14 15 15 16 Now we turn to presenting results for the trees that 16 17 are built based on the IOP rules discovered in the 17 18 experiments. We report the characteristics of discov- 18 19 ered trees in Table 3. We use a value of 0.1 for the 19 20 TreeS CMIN parameter and show average TreeSC for 20 21 each KG, along with the average number of branches 21 22 in the trees and the average runtime building each tree. 22 23 The number of trees for each KG is defined by the 23 24 number of target predicates for which we have at least 24 25 Fig. 5. Distribution of mined rules with respect to their cardinalities one IOP rule (see %SucRate in Table 2). 25 26 26 27 Table 3 27 28 Tree-learning performance of SHACLEARNER on benchmark KGs 28 29 Benchmark TreeSC #branches Time (hours) 29 30 YAGO2s 0.13 21 0.06 30 31 Wikidata+UP 0.19 21 0.1 31 32 DBpedia+UP 0.2 88 0.14 32 33 33 34 34 35 The results show the running time for aggregating 35 36 IOP rules into trees is lower than the initial IOP min- 36 37 ing time by a factor greater than 10. If, on the other 37 38 hand, we wanted to discover such complex shapes 38 39 from scratch it would be exhaustively time consum- 39 40 Fig. 6. Distribution of mined rules with respect to their lengths ing due to sensitivity of rule learners to the maximum 40 41 length of rules. The number of potential rules in the 41 42 search space grows exponentially with regards to the 42 43 following one. maximum number of predicates of the rules. The av- 43 44 erage number of branches in the mined trees are 50%, 44 45 31%, and 56% of the corresponding number of mined 45 0.15, 0.16, 5 : < occupation > _ < singer >(x) → 46 rules respectively from Table 2. Hence, by imposing 46 47 < performer_musical_artist >(x, y). the additional tree-shaped constraint over the basic 47 48 IOP-shaped constraint, at least 44% of IOP rules are 48 49 pruned. 49 50 For an example of tree shape learning, in the follow- 50 51 ing, we show a fragment of a 39-branched tree mined 51 Ghiasnezhad Omran et al. / SHACLearner 13

1 from DBpedia by aggregating IOP rules in the experi- The shapes support efficient and interpretable hu- 1 2 ments. man validation in a depth-first manner and are em- 2 3 ployed in an editor Schímatos [21] for manual knowl- 3 4 edge graph completion. The shapes can be used to 4 5 0.13 : < dbo : type > _ < db : Song >(x) → complete information triggered by entities with only a 5 6 type or class declaration by automatically generating 6 1 : < dbo : album >(x, z )∧ 7 1 dynamic data entry forms. In this manual mode, they 7 8 < dbo : recordLabel >(z1, y1)∧ can also be used more traditionally to complete miss- 8 9 ing facts for a target predicate, as well as other pred- 9 2 : < dbo : album >(x, z )∧ 10 2 icates related to the target, while enabling the acqui- 10 11 < dbo : producer >(z2, y2)∧ sition of facts about entities that are entirely missing 11 12 from the KG. 12 1 : < dbo : album >(x, z )∧ 13 3 To learn such rules we adapt an embedding-based 13 14 14 < dbo : genre >(z3, y3)∧ Open Path rule learner (OPRL) by introducing the fol- 15 lowing novel components: (1) we propose IOP rules 15 3 : < dbo : artist >(x, z )∧ 16 4 which allow us to mine rules with free variables with 16 17 17 < dbo : musicalArtist >(z4, y4). one predicate forming the body and a chain of pred- 18 icates as the head, while keeping the complexity of 18 19 the learning phase manageable; (2) we introduce tree 19 20 shapes that are built from the IOP rules for more 20 21 Here, the first annotation value (0.13) presents the expressive patterns; and (3) we propose an efficient 21 22 SC of the tree and the subsequent values at the be- method to evaluate IOP rules and trees by exactly com- 22 23 ginning of each branch indicate the branch cardinality. puting the quality measures of each rule using fast ma- 23 24 This tree can be read as saying that a song has an al- trix and vector operations. 24 25 bum with a record label, and an album with two pro- Our experiments show that SHACLEARNER can 25 26 ducers, and an album with a genre, and an artist who is mine IOP rules of various lengths, cardinalities, and 26 27 a musical artist. qualities from three massive real-world benchmark 27 28 As can be seen here, there remains an opportu- KGs including Yago, Wikidata and DBpedia. 28 29 nity for further simplification and explanatory value by In future work we will validate the shapes we learn 29 30 unifying additional variables occurring in predicates with SHACLEARNER via formal human-expert evalu- 30 31 shared over multiple branches. We plan to investigate ation and further extend the expressivity of the shapes 31 32 this potential post-processing step in future work. we can discover. Another future work will be to ex- 32 33 tend the SHACLearner algorithm through MapReduce 33 34 model to handle extremely massive KGs with tens of 34 35 6. Conclusion billions of facts. 35 36 36 37 In this paper we propose a method to learn SHACL 37 38 shapes from KGs as a way to describe KG patterns, to Acknowledgments 38 39 validate KGs, and also to support new data entry. For 39 40 entities that satisfy target predicates, our shapes de- The authors acknowledge the support of the Aus- 40 41 scribe conjunctive paths of constraints over properties, tralian Government Department of Finance and the 41 42 enhanced with minimum cardinality constraints. Australian National University for this work. 42 43 We reduce the SHACL learning problem to learning 43 44 a novel kind of rules, Inverse Open Path rules (IOP). 44 45 We introduce rule quality measures IOPSC, IOPHC References 45 46 and Car which augment the rules. IOPSC effectively 46 47 extends SHACL with shapes, representing the quanti- [1] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak 47 48 fied uncertainty of a candidate shape to be selected for and Z.G. Ives, DBpedia: A Nucleus for a Web of Open Data, 48 in: ISWC, Lecture Notes in Computer Science, Vol. 4825, 49 49 interestingness or for KG verification. We also propose Springer, 2007, pp. 722–735. 50 a method to aggregate learnt IOP rules in order to dis- [2] F.M. Suchanek, G. Kasneci and G. Weikum, Yago: a core of 50 51 cover more complex shapes, trees. semantic knowledge, in: WWW, ACM, 2007, pp. 697–706. 51 14 Ghiasnezhad Omran et al. / SHACLearner

1 [3] N. Noy, Y.Gao, A. Jain, A. Narayanan, A. Patterson and J. Tay- [20] M. Nickel, V. Tresp and H.-P. Kriegel, A three-way model for 1 2 lor, Industry-Scale Knowledge Graphs: Lessons and Chal- collective learning on multi-relational data, in: ICML, 2011, 2 3 lenges, Commun. ACM 62(8) (2019), 36–43–. pp. 809–816. 3 [4] E.S. Callahan and S.C. Herring, Cultural bias in Wikipedia 4 [21] J. Wright, S. Rodríguez Méndez, A. Haller, K. Taylor and 4 content on famous persons, Journal of the American Society P. Omran, Schimatos: a SHACL-based Web-Form Generator 5 5 for Information Science and Technology 62(10) (2011), 1899– for Knowledge Graph Editing, in: ISWC, 2020. 6 1915. [22] N. Mihindukulasooriya, M. Rifat, A. Rashid, G. Rizzo, 6 7 [5] D. Vrandecic and M. Krötzsch, Wikidata: a free collaborative R. García-Castro, O. Corcho and M. Torchiano, RDF Shape 7 8 knowledgebase, Commun. ACM 57(10) (2014), 78–85. Induction using Profiling, in: Annual {ACM} 8 [6] H. Knublauch and D. Kontokostas, Shapes Constraint Lan- 9 Symposium on Applied Computing, {SAC}, Vol. 8, 2018, 9 guage (SHACL), 2017. https://www.w3.org/TR/shacl/. p. pages. ISBN 9781450351911. 10 [7] Z. Huo, K. Taylor, X. Zhang, S. Wang and C. Pang, Gener- 10 11 ating multidimensional schemata from relational aggregation [23] D. Fernández-Álvarez, H. García-González, J. Frey, S. Hell- 11 mann and J.E.L. Gayo, Inference of latent shape expressions 12 queries, 23 (2020). 12 associated to DBpedia ontology, in: ISWC Posters, Vol. 2180, 13 [8] P. Ghiasnezhad Omran, K. Taylor, S. Rodríguez Méndez and 13 A. Haller, Towards SHACL Learning from Knowledge Graphs, 2018. ISSN 16130073. 14 14 in: {ISWC2020} Posters & Demonstrations, CEUR Proceed- [24] B. Spahiu, A. Maurino and M. Palmonari, Towards improv- 15 ings, 2020, To appear.. ing the quality of knowledge graphs with data-driven ontology 15 16 [9] H. Knublauch, D. Allemang and S. Steyskal, SHACL Ad- patterns and SHACL, in: Workshop on Ontology Design and 16 17 vanced Features, 2017. https://www.w3.org/TR/shacl-af/. Patterns, Vol. 2195, 2018, pp. 52–66. ISSN 16130073. 17 18 [10] L. Galárraga, C. Teflioudi, K. Hose and F.M. Suchanek, Fast [25] I. Boneva, J. Dusart, D.F. Álvarez and J.E. Labra Gayo, Shape 18 rule mining in ontological knowledge bases with AMIE+, The designer for ShEx and SHACL constraints, in: ISWC Posters, 19 19 International Journal on Very Large Data Bases (2015), 707– Vol. 2456, 2019, pp. 269–272. ISSN 16130073. 20 730. ISBN 1066-8888. [26] A. Cimmino, A. Fernández-Izquierdo and R. García-Castro, 20 21 [11] L. Bellomarini, E. Sallinger and G. Gottlob, The Vadalog sys- Astrea: Automatic Generation of SHACL Shapes from Ontolo- 21 22 tem: Datalogbased reasoning for knowledge graphs, in: VLDB, gies, in: ESWC, Vol. 12123 LNCS, 2020, pp. 497–513. ISSN 22 23 Vol. 11, 2018, pp. 975–987. ISSN 21508097. 16113349. ISBN 9783030494605. 23 [12] W. Fan, C. Hu, X. Liu and P. Lu, Discovering graph functional 24 [27] H.J. Pandit, D. O’Sullivan and D. Lewis, Using ontology de- 24 dependencies, in: SIGMOD, ACM, 2018, pp. 427–439. ISSN sign patterns to define SHACL shapes, in: CEUR Workshop 25 25 07308078. ISBN 9781450317436. Proceedings, Vol. 2195, 2018, pp. 67–71. ISSN 16130073. 26 [13] P. Ghiasnezhad Omran, K. Taylor, S. Rodríguez Méndez and 26 [28] J.-E. Labra-Gayo, E. Prud’hommeaux, H. Solbrig and A. Haller, Active Knowledge Graph Completion, Technical 27 I. Boneva, Validating and describing portals using 27 Report, ANU Research Publication, 2020. 28 shapes, CoRR (2017). 28 [14] P.G. Omran, K. Wang and Z. Wang, Scalable Rule Learning 29 [29] I. Boneva, J.E. Labra Gayo and E.G. Prud’hommeaux, 29 via Learning Representation, in: IJCAI, 2018, pp. 2149–2155. Semantics and validation of shapes schemas for 30 ISBN 9780999241127. 30 31 [15] Y. Chen, D.Z. Wang and S. Goldberg, ScaLeKB: scalable RDF, in: ISWC, Vol. 10587 LNCS, 2017, pp. 104– 31 120. ISSN 16113349. ISBN 9783319682877. http: 32 learning and inference over large knowledge bases, The Inter- 32 //www.w3.org/Submission/spin-modeling/. 33 national Journal on Very Large Data Bases (2016), 893–918. 33 [16] R. Agrawal and R. Srikant, Fast algorithms for mining associ- [30] P. Pareti, G. Konstantinidis, T.J. Norman and S. Murat, SHACL 34 34 ation rules, in: VLDB, Vol. 1215, 1994, pp. 487–499. Constraints with Inference Rules, in: ISWC, 2019. 35 [17] K. Taylor, Generalization by absorption of definite clauses, The [31] J. Corman, J.L. Reutter and O. Savkovic,´ Semantics and valida- 35 36 Journal of Logic Programming 40(2–3) (1999), 127–157. tion of recursive SHACL, in: ISWC, Vol. 11136 LNCS, 2018, 36 37 [18] S. Staworko, I. Boneva, J.E. Labra Gayo, S. Hym, pp. 318–336. ISSN 16113349. ISBN 9783030006709. 37 38 E.G. Prud’hommeaux and H. Solbrig, Complexity and Ex- [32] Anonymised, Active Knowledge Graph Completion, in: IJ- 38 pressiveness of ShEx for RDF, in: ICDT, 2015, pp. 195–211. CAI(Submitted/Under review), 2020. 39 39 http://linkeddata.org/. [33] Authors suppressed, Schímatos: A SHACL-based Web-Form 40 40 [19] M. Nickel, L. Rosasco and T. Poggio, Holographic Embed- Generator for Knowledge Graph Editing, 2020, submitted to 41 dings of Knowledge Graphs, in: AAAI, 2016, pp. 1955–1961. ISWC2020. 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51