<<

Completing Knowledge Bases using Formal Analysis∗

Franz Baader,1 Bernhard Ganter,1 Barıs¸ Sertkaya,1 and Ulrike Sattler2 1TU Dresden, Germany and 2The University of Manchester, UK

Abstract concerned with a different quality dimension: completeness. We provide a basis for formally well-founded techniques and We propose an approach for extending both the ter- tools that support the ontology engineer in checking whether minological and the assertional part of a Descrip- an ontology contains all the relevant information about the tion Logic knowledge base by using information application domain, and to extend the ontology appropriately provided by the knowledge base and by a domain if this is not the case. expert. The use of techniques from Formal Con- A DL knowledge base (nowadays often called ontology) cept Analysis ensures that, on the one hand, the usually consists of two parts, the terminological part (TBox), interaction with the expert is kept to a minimum, which defines and also states additional constraints and, on the other hand, we can show that the ex- (so-called general concept inclusions, GCIs) on the interpre- tended knowledge base is complete in a certain, tation of these concepts, and the assertional part (ABox), well-defined sense. which describes individuals and their relationship to each other and to concepts. Given an application domain and a DL 1 Introduction knowledge base (KB) describing it, we can ask whether the KB contains all the relevant information1 about the domain: Description Logics (DLs) [Baader et al., 2003] are a suc- • cessful family of logic-based knowledge representation for- Are all the relevant constraints that hold between con- malisms, which can be used to represent the conceptual cepts in the domain captured by the TBox? knowledge of an application domain in a structured and for- • Are all the relevant individuals existing in the domain mally well-understood way. They are employed in vari- represented in the ABox? ous application domains, such as natural language process- ing, configuration, databases, and bio-medical ontologies, but As an example, consider the OWL ontology for human pro- their most notable success so far is due to the fact that DLs tein phosphatases that has been described and used in [Wol- provide the logical underpinning of OWL, the standard ontol- stencroft et al., 2005]. This ontology was developed based ogy language for the [Horrocks et al., 2003]. on information from peer-reviewed publications. The human As a consequence of this standardization, several ontology protein phosphatase family has been well characterised exper- editors support OWL [Knublauch et al., 2004; Oberle et al., imentally, and detailed knowledge about different classes of 2004; Kalyanpur et al., 2006a], and ontologies written in such proteins is available. This knowledge is represented in OWL are employed in more and more applications. As the the terminological part of the ontology. Moreover, a large set size of these ontologies grows, tools that support improving of human phosphatases has been identified and documented their quality become more important. The tools available un- by expert biologists. These are described as individuals in the til now use DL reasoning to detect inconsistencies and to infer assertional part of the ontology. One can now ask whether the consequences, i.e., implicit knowledge that can be deduced information about protein phosphatases contained in this on- from the explicitly represented knowledge. There are also tology is complete. Are all the relationships that hold among promising approaches that allow to pinpoint the reasons for the introduced classes of phosphatases captured by the con- inconsistencies and for certain consequences, and that help straints in the TBox, or are there relationships that hold in the the ontology engineer to resolve inconsistencies and to re- domain, but do not follow from the TBox? Are all possible move unwanted consequences [Schlobach and Cornet, 2003; kinds of human protein phosphatases represented by individ- Kalyanpur et al., 2006b]. These approaches address the qual- uals in the ABox, or are there phosphatases that have not yet ity dimension of soundness of an ontology, both within it- been included in the ontology or even not yet been identified? self (consistency) and w.r.t. the intended application domain Such questions cannot be answered by an automated tool (no unwanted consequences). In the present paper, we are alone. Clearly, to check whether a given relationship between

∗Supported by DFG (GRK 334/3) and the EU (IST-2005-7603 1The notion of “relevant information” must, of course, be formal- FET project TONES and NoE 507505 Semantic Mining). ized appropriately for this problem to be addressed algorithmically.

IJCAI-07 230 concepts—which does not follow from the TBox—holds in ing the exploration process. the domain, one needs to ask a domain expert, and the same In the next section, we introduce our variant of FCA that is true for questions regarding the existence of individuals not can deal with partial contexts, and describe an attribute ex- described in the ABox. The rˆole of the automated tool is to ploration procedure that works with partial contexts. In Sec- ensure that the expert is asked as few questions as possible; tion 3, we show how a DL knowledge base gives rise to a in particular, she should not be asked trivial questions, i.e., partial context, define the notion of a completion of a knowl- questions that could actually be answered based on the rep- edge base w.r.t. a fixed model, and show that the attribute resented knowledge. In the above example, answering a non- exploration algorithm developed in the previous section can trivial question regarding human protein phosphatases may be used to complete a knowledge base. In Section 4 we de- require the biologist to study the relevant literature, query scribe ongoing and future work. This paper is accompanied existing protein databases, or even to carry out new exper- by a technical report [Baader et al., 2006] containing more iments. Thus, the expert may be prompted to acquire new details and in particular full proofs of our results. biological knowledge. Attribute exploration is an approach developed in Formal 2 Exploring Partial Contexts [ ] Concept Analysis (FCA) Ganter and Wille, 1999 that can In this section, we extend the classical approach to Formal be used to acquire knowledge about an application domain Concept Analysis (FCA), as described in detail in [Ganter by querying an expert. One of the earliest applications of this and Wille, 1999], to the case of individuals (called objects [ ] approach is described in Wille, 1982 , where the domain is in FCA) that have only a partial description in the sense that, theory, and the goal of the exploration process is to for some properties (called attributes in FCA), it is not known find, on the one hand, all valid relationships between prop- whether they are satisfied by the individual or not. The con- erties of lattices (like being distributive), and, on the other nection between this approach and the classical approach is hand, to find counterexamples to all the relationships that do explained in [Baader et al., 2006]. In the following, we as- not hold. To answer a query whether a certain relationship sume that we have a finite set M of attributes and a (possibly holds, the lattice theory expert must either confirm the rela- infinite) set of objects. tionship (by using results from the literature or carrying out a new proof for this fact), or give a counterexample (again, by Definition 2.1 A partial object description (pod) is a tuple either finding one in the literature or constructing a new one). (A, S) where A, S ⊆ M are such that A ∩ S = ∅.Wecall A ∪ S = M Although this sounds very similar to what is needed in our such a pod a full object description (fod) if .A context, we cannot directly use this approach. The main rea- set of pods is called a partial context and a set of fods a full son is the open-world of description logic knowl- context. edge bases. Consider an individual i from an ABox A and Intuitively, the pod (A, S) says that the object it describes a concept C occurring in a TBox T . If we cannot deduce satisfies all attributes from A and does not satisfy any attribute from the TBox T and A that i is an instance of C,thenwe from S. For the attributes not contained in A ∪ S, nothing is do not assume that i does not belong to C. Instead, we only known w.r.t. this object. A partial context can be extended by accept this as a consequence if T and A imply that i is an either adding new pods or by extending existing pods. ¬C instance of . Thus, our knowledge about the relationships (A,S) between individuals and concepts is incomplete: if T and A Definition 2.2 We say that the pod extends the pod (A, S), and write this as (A, S) ≤ (A,S),ifA ⊆ A and imply neither C(i) nor ¬C(i), then we do not know the re- i C S ⊆ S . Similarly, we say that the partial context K extends lationship between and . In contrast, classical FCA and K K≤K attribute exploration assume that the knowledge about indi- the partial context , and write this as , if every pod in K is extended by some pod in K.IfK is a full context and viduals is complete: the basic datastructure is that of a formal K≤K K K (A, S) context, i.e., a crosstable between individuals and properties. ,then is called a realizer of .If is a fod (A, S) ≤ (A, S) (A, S) A cross says that the holds, and the absence of a and , then we also say that realizes (A, S) cross is interpreted as saying that the property does not hold. . There has been some work on how to extend FCA and at- Next, we introduce the notion of an implication between tribute exploration from complete knowledge to the case of attributes which formalizes the informal notion “relationship partial knowledge [Obiedkov, 2002; Burmeister and Holzer, between properties” used in the introduction. 2005]. However, this work is based on assumptions that are Definition 2.3 An implication is of the form L → R where different from ours. In particular, it assumes that the expert L, R ⊆ M. This implication is refuted by the pod (A, S) if cannot answer all queries and, as a consequence, the knowl- L ⊆ A and R ∩ S = ∅.Itisrefuted by the partial context edge obtained after the exploration process may still be in- K if it is refuted by at least one element of K.Thesetof complete and the relationships between concepts that are pro- implications that are not refuted by a given partial context K duced in the end fall into two categories: relationships that are is denoted by Imp(K). The set of all fods that do not refute a valid no matter how the incomplete part of the knowledge is given set of implications L is denoted by Mod(L). completed, and relationships that are valid only in some com- L P ⊆ M pletions of the incomplete part of the knowledge. In contrast, Forasetofimplications and a set ,theimplica- P L L(P ) our intention is to complete the KB, i.e., in the end we want tional closure of with respect to , denoted by ,isthe Q M to have complete knowledge about these relationships. What smallest of such that may be incomplete is the description of individuals used dur- • P ⊆ Q,and

IJCAI-07 231 • L → R ∈Land L ⊆ Q imply R ⊆ Q. How can we find—and let the expert decide—all unde- AsetP ⊆ M is called L-closed if L(P )=P . cided implications without considering all implications? The following proposition motivates why it is sufficient to con- Definition 2.4 The implication L → R is said to follow from sider implications whose left-hand sides are L-closed. asetJ of implications if R ⊆J(L). The set of implica- tions J is called complete for a set of implications L if every Proposition 2.8 Let L be a set of implications and L → R implication in L follows from J . It is called sound for L if an implication. Then, L → R follows from L iff L(L) → R every implication that follows from J is contained in L.Aset follows from L. J L of implications is called a base for a set of implications Concerning right-hand sides, Proposition 2.5 says that the L if it is both sound and complete for , and no strict subset of largest right-hand side R such that L → R is not refuted J satisfies this property. by K is R = K(L). Putting these two observations together, The following is a trivial fact regarding the connection be- we only need to consider implications of the form L →K(L) tween partial contexts and the implications they do not refute, where L is L-closed. In order to enumerate all left-hand sides, but it will turn out to be crucial for our attribute exploration we can thus use the well-known approach from FCA for enu- algorithm. merating closed sets in the lectic order [Ganter and Wille, Proposition 2.5 For a given set P ⊆ M and a partial con- 1999]. K K(P ):=M \ {S | (A, S) ∈K,P ⊆ A} text , is the Definition 2.9 Assume that M = {m1,...,mn} and fix largest subset of M such that P →K(P ) is not refuted by K. some linear order m1

IJCAI-07 232 Algorithm 1 Attribute exploration for partial contexts Name of constructor Syntax Semantics I I 1: Input: M = {m1,...,mn}, K0 {attribute set and negation ¬C Δ \ C I I partial context, realized by full context K} conjunction C D C ∩ D I I 2: K := K0 {initialize partial context} general concept inclusion C D C ⊆ D L := ∅ { } 3: initial empty set of implications concept assertion C(a) aI ∈ CI 4: P := ∅ {lectically smallest L-closed set} I I I role assertion r(a, b) (a ,b ) ∈ r 5: while P = M do 6: Compute K(P ) Table 1: Conjunction, negation, GCIs, and ABox assertions. 7: if P = K(P ) then {P →K(P ) is undecided} 8: Ask the expert if P →K(P ) is refuted by K 9: if no then {P →K(P ) not refuted} ABox. The semantics of concept descriptions, TBoxes, and I I 10: K := K where K is a partial context such that ABoxes is given in terms of an interpretation I =(Δ , · ), I I K≤K ≤ K{optionally extend K} where Δ (the domain) is a non-empty set, and · (the in- 11: L := L∪{P →K(P ) \ P } terpretation function) maps each concept name A ∈ NC to a AI ⊆ ΔI r ∈ N 12: Pnew := L((P ∩{m1,...,mj−1}) ∪{mj}) set , each role name R to a I I I for the max. j that satisfies r ⊆ Δ × Δ , and each individual name a ∈ NI to an el- I I P

IJCAI-07 233 and the whole interpretation induces the full context way. If the answer is “yes,” then the expert is asked to extend the current ABox (by adding appropriate assertions on either K (M):={fod (d, M) | d ∈ ΔI }. I I old or new individual names) such that the extended ABox L → R I Proposition 3.1 Let (T , A), (T , A) be DL KBs such that refutes and is still a model of this ABox. Because T⊆T and A⊆A, M a set of concept descriptions, and of Proposition 3.3, before actually asking the expert whether the implication L → R is refuted by I, we can first check I a model of (T , A ).ThenKT ,A(M) ≤KT ,A (M) ≤ whether L R already follows from the current TBox. If KI (M). this is the case, then we know that L → R cannot be refuted Next, we straightforwardly transfer the notion of refutation of by I. This completion algorithm for DL knowledge bases is an implication from partial (full) contexts to knowledge bases described in more detail in Algorithm 2. (interpretations). Definition 3.2 The implication L → R over the attributes Algorithm 2 Completion of DL knowledge bases M (T , A) is refuted by the knowledge base if it is refuted by 1: Input: M = {m1,...,mn}, (T0, A0) {attribute set and KT ,A(M), and it is refuted by the interpretation I if it is re- knowledge base, with model I} K (M) I futed by I . If an implication is not refuted by ,then 2: T := T0, A := A0 I L → R we say that it holds in . In addition, we say that fol- 3: L := ∅ {initial empty set of implications} T L R L R lows from if T ,where and respectively 4: P := ∅ {lectically smallest L-closed subset of M} C D stand for the conjunctions C∈L and D∈R . 5: while P = M do K (P ) Obviously, the implication L → R holds in I iff ( L)I ⊆ 6: Compute T ,A P = K (P ) { ( R)I . As an immediate consequence of this fact, we obtain: 7: if T ,A then check whether the implication follows from T} T I T Proposition 3.3 Let be a TBox and be a model of .If 8: if P T KT ,A(P ) then L → R T I follows from , then it holds in . 9: L := L∪{P →KT ,A(P ) \ P } P := L((P ∩{m ,...,m }) ∪{m }) Completion of DL KBs: formal definition and algorithm 10: new 1 j−1 j for the max. j that satisfies We are now ready to define what we mean by a completion of P< L((P ∩{m ,...,m }) ∪{m }) a DL knowledge base. Intuitively, the knowledge base is sup- j 1 j−1 j M 11: else posed to describe an intended model. For a fixed set of “in- P →K (P ) I teresting” concepts, the knowledge base is complete if it con- 12: Ask expert if T ,A is refuted by . 13: if no then { P KT ,A(P ) is satisfied in I} tains all the relevant knowledge about implications between L := L∪{P →K (P ) \ P } these concepts. To be more precise, if an implication holds 14: T ,A 15: Pnew := L((P ∩{m1,...,mj−1}) ∪{mj}) in the intended interpretation, then it should follow from the j TBox, and if it does not hold in the intended interpretation, for the max. that satisfies P

IJCAI-07 234 Algorithm 2. Then (T , A) is a completion of (T0, A0). scription Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, 2003. 4Conclusion [Baader et al., 2006] F. Baader, B. Ganter, U. Sattler, and We have extended the attribute exploration algorithm from B. Sertkaya. Completing description logic knowledge FCA to partial contexts, and have shown how the extended bases using formal concept analysis. LTCS Report 06-02, algorithm can be used to complete DL knowledge bases, us- Theoretical Computer Science, TU Dresden, 2006. Avail- ing both DL reasoning and answers given by a domain ex- able at http://lat.inf.tu-dresden.de/research/reports.html pert. This algorithm inherits its complexity from “classical” [Burmeister and Holzer, 2005] P. Burmeister and R. Holzer. attribute exploration [Ganter and Wille, 1999]:intheworst Treating incomplete knowledge in formal concept analy- case, which occurs if there are few or many relationships be- sis. In Formal Concept Analysis, volume 3626 of LNCS. tween attributes, it is exponential in the number of attributes. Springer, 2005. Regarding the number of questions asked to the expert, it eas- [Ganter and Wille, 1999] B. Ganter and R. Wille. Formal ily follows from results shown in the technical report that our Concept Analysis: Mathematical Foundations. Springer, method asks the minimum number of questions with posi- 1999. tive answers. For the questions with negative answers, the behaviour depends on the answers given by the expert: FCA- [Horrocks et al., 2003] I. Horrocks, P. F. Patel-Schneider, theory implies that there always exist counterexamples that, and F. van Harmelen. From SHIQ and RDF to OWL: The if taken in each step, ensure a minimal number of questions making of a web ontology language. J. of Web Semantics, with negative answers. In general, however, one cannot as- 1(1), 2003. sume that the expert provides these “best” counterexamples. [Kalyanpur et al., 2006a] A. Kalyanpur, B. Parsia, E. Sirin, Based on the results presented in the previous two sec- B. C. Grau, and J. Hendler. Swoop: A Web ontology edit- tions, we have implemented a first experimential version of ing browser. J. of Web Semantics, 4(2), 2006. a tool for completing DL knowledge bases as an extension of [ ] the ontology editor Swoop [Kalyanpur et al., 2006a],which Kalyanpur et al., 2006b A. Kalyanpur, B. Parsia, E. Sirin, uses the system Pellet as underlying reasoner [Sirin and Par- and B. C. Grau. Repairing unsatisfiable concepts in OWL sia, 2004]. We have just started to evaluate our tool on the ontologies. In Proc. of ESWC’06, volume 4011 of LNCS. OWL ontology for human protein phosphatases mentioned in Springer, 2006. the introduction, with biologists as experts, and hope to get [Knublauch et al., 2004] H. Knublauch, R. W. Fergerson, first significant results on its usefulness and performance in N. F. Noy, and M. A. Musen. The Prot´eg´e OWL plugin: the near future. Unsurprisingly, we have observed that the An open development environment for semantic web ap- experts sometimes make errors when answering implication plications. In Proc. of ISWC 2004, volume 3298 of LNCS. questions. Hence we will extend the completion tool such that Springer, 2004. it supports detecting such errors and also allows to correct er- [Oberle et al., 2004] D. Oberle, R. Volz, B. Motik, and rors without having to restart the exploration from scratch. S. Staab. An extensible ontology software environment. From a theoretical point of view, we will also look at exten- Handbook on Ontologies, International Handbooks on In- sions of our definition of a complete KB. As a formalization formation Systems. Springer, 2004. of what “all relationships between interesting concepts” re- ally means, we have used subsumption relationships between [Obiedkov, 2002] S. A. Obiedkov. Modal logic for evalu- conjunctions of elements of the set of interesting concepts ating formulas in incomplete contexts. In Proc. of ICCS M. One could also consider more complex relationships by 2002, volume 2393 of LNCS. Springer, 2002. fixing a specific DL D, and then taking, as attributes, all D- [Rudolph, 2004] S. Rudolph. Exploring relational structures concept descriptions over the concept “names” from M.The via FLE.InProc. of ICCS’04, volume 3127 of LNCS. immediate disadvantage of this extension is that, in general, Springer, 2004. the set of attributes becomes infinite, and thus termination of [Schlobach and Cornet, 2003] S. Schlobach and R. Cornet. the exploration process is no longer guaranteed. An exten- Non-standard reasoning services for the debugging of de- sion of classical attribute exploration (i.e., for full contexts) scription logic terminologies. In Proc. of IJCAI 2003. [ ] in this direction is described in Rudolph, 2004 .Themain Morgan Kaufmann, 2003. idea to deal with the problem of an infinite attribute set used there is to restrict the attention to concept descriptions with a [Sirin and Parsia, 2004] E. Sirin and B. Parsia. Pellet: An bounded role depth. But even though this makes the attribute OWL DL reasoner. In Proc. of DL’04, 2004. set finite, its size is usually too large for practical purposes. [Wille, 1982] R. Wille. Restructuring lattice theory: An ap- Thus, an adaptation of Rudolph’s method to our purposes not proach based on hierarchies of concepts. In Ordered Sets. only requires extending this method to partial contexts, but Reidel, Dordrecht-Boston, 1982. also some new approach to reduce the number of attributes. [Wolstencroft et al., 2005] K. Wolstencroft, A. Brass, I. Hor- rocks, P. W. Lord, U. Sattler, D. Turi, and R. Stevens. A References little semantic web goes a long way in . In Proc. of [Baader et al., 2003] F. Baader, D. Calvanese, D. McGuin- ISWC’05, volume 3729 of LNCS. Springer, 2005. ness, D. Nardi, and P. F. Patel-Schneider, editors. The De-

IJCAI-07 235