<<

IMPACT ANALYSIS IN DESCRIPTION

A submitted to the for the degree of in the Faculty of Engineering and Physical Sciences

2014

By Jo˜aoRafael Landeiro de Sousa Gon¸calves School of Contents

Abstract 10

Declaration 11

Copyright 12

Acknowledgements 13

1 Introduction 14 1.1 Desiderata for Impact Analysis...... 15 1.2 State of the Art...... 17 1.3 Goals and Contributions...... 18 1.4 Published Work...... 20

2 Preliminaries 21 2.1 Description ...... 21 2.1.1 and ...... 22 2.1.2 Standard Reasoning Services...... 23 2.1.3 Non-Standard Reasoning Services...... 25 2.1.4 Structural Notions...... 26 2.2 OWL: The Web Language...... 28 2.3 Experimental Setup...... 30 2.3.1 Infrastructure...... 30 2.3.2 Ontology Corpora...... 31

3 Impact Analysis 33 3.1 Impact in OWL...... 33 3.2 NCIt Case Study...... 34 3.2.1 Methods and Materials...... 35 3.2.2 Results...... 36 3.2.2.1 Parsing Times...... 36 3.2.2.2 Asserted ...... 36 3.2.2.3 Inferred Axioms...... 38 3.2.2.4 Reasoner Performance...... 41 3.3 Conclusions...... 42

4 Related Work 44 4.1 Diffing Techniques...... 44 4.1.1 Syntactic...... 45 4.1.2 Semantic...... 46 4.2 Reasoner Performance...... 48 4.3 Discussion...... 50

5 -Centric Impact Analysis 51 5.1 Motivation...... 51 5.2 Overview...... 53 5.3 Specification...... 55 5.3.1 Axiom Categorisation...... 55 5.3.1.1 Ineffectual Change Categorisation...... 56 5.3.1.2 Effectual Change Categorisation...... 60 5.3.2 Example Walkthrough...... 63 5.4 Implementation...... 66 5.4.1 ...... 66 5.4.2 Preliminary Evaluation...... 67 5.5 Empirical Evaluation...... 71 5.5.1 Case Study...... 72 5.5.1.1 Coarse-Grained Change Analysis...... 72 5.5.1.2 Ineffectual Changes...... 73 5.5.1.3 Effectual Changes...... 75 5.5.1.4 Discussion...... 77 5.5.2 NCIt Axiom Diff Walkthrough...... 78 5.6 Conclusions...... 83

6 Term-Centric Impact Analysis 85 6.1 Motivation...... 85 6.2 Specification...... 87 6.2.1 Characterising Impact...... 88 6.2.2 Diff Functions...... 90 6.2.2.1 CEX-Based Functions...... 92 6.2.3 Example Walkthrough...... 94 6.3 Implementation...... 96 6.3.1 Algorithms...... 97 6.3.2 Preliminary Evaluation...... 99 6.4 Empirical Evaluation...... 100 6.4.1 Case Study...... 100 6.4.1.1 Diff Function Comparison...... 101 6.4.1.2 Splitting Direct and Indirect Changes...... 103 6.4.1.3 Change Log Analysis...... 103 6.4.2 NCIt Concept Diff Walkthrough...... 106 6.5 Conclusions...... 108

7 Term and Axiom Change Alignment 111 7.1 Motivation...... 111 7.2 Specification...... 112 7.2.1 Aligning Changes...... 112 7.2.2 Example Walkthrough...... 113 7.3 Implementation...... 115 7.3.1 Algorithms...... 115 7.3.2 ecco: A Diff Tool for OWL 2 Ontologies...... 116 7.4 Tool Walkthrough...... 119 7.5 Conclusions...... 122

8 Performance Heterogeneity and Homogeneity 125 8.1 Motivation...... 125 8.2 Specification...... 127 8.3 Implementation...... 128 8.4 Empirical Evaluation...... 130 8.4.1 Materials and Methods...... 130 8.4.2 Results...... 131 8.5 Conclusions...... 136 9 Performance Hot Spots 140 9.1 Motivation...... 140 9.2 Specification...... 142 9.2.1 Finding Hot Spots...... 142 9.2.2 Reasoning with Hot Spots...... 143 9.2.2.1 Approximation Techniques...... 143 9.2.2.2 Compilation Techniques...... 144 9.3 Implementation...... 145 9.4 Empirical Evaluation...... 146 9.4.1 Finding Hot Spots...... 147 9.4.1.1 Hot Spot Analysis...... 148 9.4.1.2 Comparison with Pellint...... 150 9.4.2 Hot Spot Based Reasoning...... 152 9.4.2.1 Approximations...... 152 9.4.2.2 Compilations...... 153 9.5 Conclusions...... 154

10 Conclusions 157 10.1 Contributions and Significance...... 157 10.2 Research Impact and Future Directions...... 160

Bibliography 163

Word Count: 39,304 List of Tables

2.1 Concept constructors for the ALC DL...... 22

5.1 Compound change categories...... 54 5.2 Effectual change categories...... 55

5.3 Example ontologies O1 and O2...... 63

5.4 Categorisation of removals in diff(O1, O2)...... 64

5.5 Categorisation of additions in diff(O1, O2)...... 65 5.6 Operation times per comparison throughout the NCIt (in seconds). 70 5.7 Coarse-grained changes throughout the NCIt...... 73

5.8 Ineffectual removals of diff(Oi, Oi+1), for 1 ≤ i ≤ 112...... 74

5.9 Ineffectual additions of diff(Oi, Oi+1), for 1 ≤ i ≤ 112...... 75

5.10 Effectual removals of diff(Oi, Oi+1), for 1 ≤ i ≤ 112...... 76

5.11 Effectual additions of diff(Oi, Oi+1), for 1 ≤ i ≤ 112...... 76

5.12 Effectual additions in diff(O32, O33)...... 78

5.13 Ineffectual additions in diff(O32, O33)...... 78

5.14 Effectual removals in diff(O32, O33)...... 79

5.15 Ineffectual removals in diff(O32, O33)...... 79 5.16 Mapping of term names in the NCIt to abbreviations...... 80

6.1 Example ontologies O1 and O2...... 88

6.2 Affected (specialised, generalised and total) between O1

and O2 according to the mentioned diff notions...... 92

6.3 Breakdown of concept impact in At-AT(O1, O2)Σ...... 94

6.4 Breakdown of concept impact in Sub-AT(O1, O2)Σ...... 95

6.5 Breakdown of concept impact in Gr-AT(O1, O2)Σ...... 96 6.6 Number of concepts processed per minute by each diff function Φ. 100 6.7 Number of affected atomic concepts found by each diff function for

Σ := Σu, and their respective coverage w.r.t. Gr-AT(Oi, Oi+1)Σ.. 102 L 6.8 Number of directly affected concepts (1) in AT(O1, O2)Σ (denoted R “L”), (2) in AT(O1, O2)Σ (denoted “R”), (3) in the of those two sets (denoted “Total”), and (4) that do not appear in the

NCIt change logs (denoted “Missed”), found by AtDiff(O1, O2)Σ

and SubDiff(O1, O2)Σ for Σ := Σu...... 105

6.9 Number of affected atomic concepts, AT(Oi, Oi+1)Σ, found by each

diff function (in addition to Un-AT := {Cex1-AT ∪ Cex2-AT ∪

Sub-AT}) for Σ := Σu within the NCIt change logs...... 106

6.10 Affected concepts (specialised, generalised and total) between O1

and O2 according to the mentioned diff notions...... 107

6.11 Number of changes in At-AT(O1, O2)Σ according to impact.... 107

6.12 Number of changes in Sub-AT(O1, O2)Σ according to impact.... 108

7.1 Example ontologies O1 and O2...... 113

7.2 Affected concepts in At-AT(O1, O2)Σ with corresponding witness axioms and justifications...... 114

7.3 Additional affected concept in Sub-AT(O1, O2)Σ with correspond- ing witness axiom and justification...... 114

8.1 Basic metrics and classification times (in seconds) of selected Bio- Portal ontologies. Ontologies marked with ∗ are in the OWL 2 EL profile...... 131

9.1 Comparison of hot spots found via SAT-guided (white rows) and random (grey rows) concept selection approach. CPU times in seconds...... 148 9.2 Expressivity of each original ontology (O), its various hot spots

(Mi, for 1 ≤ i ≤ 3) and corresponding remainders (O\Mi).... 149 9.3 Number of GCIs contained in each ontology, its hot spots, and their corresponding remainders. The “average reduction” represents the

percentage of GCIs removed from O into O\Mi, for 1 ≤ i ≤ 3.. 150 9.4 Ontology/reasoner combinations for which Pellint found lints... 151 9.5 Reasoning times and degree of completeness of the devised approx- imations, and tr-Ap(O). The degree of completeness is denoted “compl.”...... 153 9.6 Compilation results for the devised compilation techniques (time in seconds, where unit is not shown)...... 154 List of Figures

3.1 Axiom growth of the NCIt, where annotation axioms dominate (x-axis: NCIt version, y-axis: number of axioms)...... 37 3.2 Breakdown of logical axioms occurring in the NCIt (x-axis: NCIt version, y-axis: number of axioms). Note that in Figure 3.2a role axioms are grouped together, and then broken down in Figure 3.2b. 38 3.3 Entailment growth of NCIt: asserted and inferred entailment counts (x-axis: NCIt version, y-axis: number of axioms)...... 39 3.4 Reasoner performance across NCIt (x-axis: NCIt version, y-axis: time in seconds). Versions that could not be classified are not plotted...... 41

5.1 Categorisation hierarchy of additions...... 54 5.2 Categorisation hierarchy of removals...... 54 5.3 Diff operation times per NCIt comparison...... 70 5.4 Diff operation times including laconic justification finding..... 72 5.5 Breakdown of effectual vs ineffectual additions (x-axis: NCIt ver- sion, y-axis: number of axioms in a logarithmic scale)...... 73 5.6 Breakdown of effectual vs ineffectual removals (x-axis: NCIt ver- sion, y-axis: number of axioms in a logarithmic scale)...... 74

6.1 Comparison of number of specialised concepts found by CvsDiff(O1, O2)Σ

and GrDiff(O1, O2)Σ within the signature samples of the NCIt (y- axis: number of atomic concepts, x-axis: comparison identifier).. 102 6.2 Comparison of purely directly (“P.D.”), purely indirectly (“P.I.”), and both directly and indirectly (denoted “Mix”) affected concepts

found within At-AT(O1, O2)Σ (denoted “At”), and Sub-AT(O1, O2)Σ (denoted “Sub”) in NCIt versions (y-axis: number of atomic con- cepts, x-axis: comparison identifier)...... 104 7.1 Entry point to ecco on the Web...... 117

7.2 Summary of changes between O1 and O2...... 119

7.3 Summary of changes between O1 and O2 with categories expanded. 120

7.4 Strengthenings between O1 and O2...... 120

7.5 Strengthenings between O1 and O2, where witness axioms for con- cept changes are in focus...... 121

7.6 Ineffectual additions between O1 and O2...... 121

8.1 ChEBI: Chemical Entities of Biological Interest (times in seconds). 133 8.2 NCIt: National Cancer Institute Thesaurus (times in seconds)... 133 8.3 IMGT Ontology (times in seconds)...... 133 8.4 CCLO: Coriell Cell Line Ontology (times in seconds)...... 134 8.5 Gazetteer ontology (times in seconds)...... 134 8.6 ICF: International Classification of Functioning-Disability and Health (times in seconds)...... 134 8.7 PRPPO: Patient Research Participant Permissions Ontology (times in seconds)...... 135 8.8 EFO: Experimental Factor Ontology (times in seconds)...... 135 8.9 NEMO: Neural Electromagnetic Ontologies (times in seconds)... 135 8.10 VO: Vaccine Ontology (times in seconds)...... 136 8.11 OBI: Ontology for Biomedical Investigations (times in seconds).. 137 8.12 Classification times (in seconds) of the GO-Ext ontology with Pellet.138 Abstract

With the growing popularity of the Web (OWL) as a logic- based ontology language, as well as advancements in the language itself, the need for more sophisticated and up-to-date services increases as well. While, for instance, there is active focus on new reasoners and optimisations, other services fall short of advancing at the same rate (it suffices to compare the number of freely-available reasoners with ontology editors). In particular, very little is understood about how ontologies evolve over time, and how reasoners’ performance varies as the input changes. Given the evolving nature of ontologies, detecting and presenting changes (via a so-called diff ) between them is an essential engineering service, especially for version control systems or to support change analysis. In this thesis we address the diff problem for (DL) based ontologies, specifically OWL 2 DL ontologies based on the SROIQ DL. The outcomes are novel algorithms em- ploying both syntactic and semantic techniques to, firstly, detect axiom changes, and what terms had their meaning affected between ontologies, secondly, cate- gorise their impact (for example, determining that an axiom is a stronger version of another), and finally, align changes appropriately, i.e., align source and target of axiom changes (so the stronger axiom with the weaker one, from our example), and axioms with the terms they affect. Subsequently, we present a theory of reasoner performance heterogeneity, based on field observations related to reasoner performance variability phenom- ena. Our hypothesis is that there exist two kinds of performance behaviour: an ontology/reasoner combination can be performance-homogeneous or performance- heterogeneous. Finally, we verify that performance-heterogeneous reasoner/ontol- ogy combinations contain small, performance-degrading sets of axioms, which we call hot spots. We devise a performance hot spot finding technique, and show that hot spots provide a promising basis for engineering efficient reasoners. Declaration

No portion of the work referred to in this thesis has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning. Copyright

i. The author of this thesis (including any appendices and/or schedules to this thesis) owns certain copyright or related rights in it (the “Copyright”) and s/he has given The University of Manchester certain rights to use such Copyright, including for administrative purposes. ii. Copies of this thesis, either in full or in extracts and whether in hard or electronic copy, may be made only in accordance with the Copyright, De- signs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has from time to time. This page must form part of any such copies made. iii. The ownership of certain Copyright, patents, designs, trade marks and other intellectual property (the “Intellectual Property”) and any reproductions of copyright works in the thesis, for example graphs and tables (“Reproduc- tions”), which may be described in this thesis, may not be owned by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property and/or Reproductions. iv. Further information on the conditions under which disclosure, publication and commercialisation of this thesis, the Copyright and any Intellectual Property and/or Reproductions described in it may take place is available in the University IP Policy (see http://documents.manchester.ac.uk/ DocuInfo.aspx?DocID=487), in any relevant Thesis restriction declarations deposited in the University , The University Library’s regulations (see http://www.manchester.ac.uk/library/aboutus/regulations) and in The University’s policy on presentation of Theses Acknowledgements

To my parents Jos´eAnt´onioViegas and Maria da Luz, my brother Ricardo, and my sister Rute, a huge thank you for the constant support and countless advice. For the inspiration to join a doctoral programme, the incessant patience and guidance, as well as the brilliance of their supervision throughout my time in Manchester, I am remarkably grateful to Bijan and Uli. It was not before a thorough examination, courtesy of and Suzanne Embury, that this became an official thesis. So thank you both for examining this work, and for all your helpful feedback. By virtue of doctoral grants awarded by the Engineering and Physical Sciences Research Council (EPSRC) from 2009 to 2012, and Funda¸c˜aopara a Ciˆenciae a Tecnologia (FCT) from 2013 to 2014 (under grant SFRH/BD/87031/2012, funded by Portugal’s Ministry of Education and Science (MEC), Programa Operacional Potencial Humano (POPH), Quadro de ReferˆenciaEstrat´egicoNacional (QREN), and the European Social Fund (ESF)), I benefited from financial support that was crucial to completing this degree; needless to say this was highly appreciated. When the going got (financially) tough, it was owing to Robert Stevens that I had a research associate position and a fun project to get done. Thanks Robert! By and large it was Catarina’s encouragement, in times good or bad, that kept me soundly going; thank you for that and for all the great moments we shared. Finally, many thanks to all friends and colleagues I gathered together on this thin raft of life, your company made mancunian weather (almost) a delight. Chapter 1

Introduction

An ontology is a machine-readable knowledge representation composed of axioms: logical sentences that precisely describe the meaning of terms within a domain. The (OWL) [19, 101] is a well-known logic-based on- tology language, widely used in the . OWL was standardised by W3C in 2004, and subsequently in 2009 its most recent revision, OWL 2, became a W3C recommendation. This latest version of OWL is based on the SROIQ description logic (DL) [4, 48]. Having a logic as its underpinning allows ontol- ogy engineers to encode knowledge about terms where these have a well-defined meaning. Unlike the more expressive first order logic (FOL), the description log- ics family benefits from the availability of decision procedures for key problems, such as satisfiability, and thus classification. Consequently, OWL on- tology engineers can employ automated deduction systems, so-called reasoners, to infer new knowledge from their ontologies; more precisely, a reasoner makes implicitly asserted knowledge explicit, thus allowing engineers to verify (certain forms of) consequences of their asserted axioms. The most notable use of OWL ontologies is in the clinical sciences domain, with well-known examples such as the Foundational Model of Anatomy (FMA) [78], the National Cancer Institute (NCI) Thesaurus (NCIt) [44], the System- atized Nomenclature of Medicine (SNOMED) Clinical Terms (SNOMED CT) [96], the Gene Ontology (GO) [2], and GALEN [85]. These ontologies are collaboratively-crafted documents, typically with teams of developers involved, and principled design and deployment methodologies. Additionally, they tend to be rather large in size, for instance; the January 2012 release of SNOMED CT contains around 340,000 terms, and over 320,000 logical axioms. In spite of 1.1 Desiderata for Impact Analysis 15 the magnitude of these ontologies, akin to large software engineering projects, the state of the art in (particularly maintenance) services available to ontology engineers is hardly comparable to those available to software engineers. As ontologies grow bigger and more complex, the associated services that support their engineering must also progress. For instance, only recently have there been significant advances in the area of ontology versioning [58] and diffing [68]. Even so, developers still rely on ad hoc mechanisms to carry out such essential tasks. For example, when faced with a poorly performing ontology version, following minor (seemingly innocuous) changes, users are inclined to removing or weakening their changes, or rolling back such changes altogether and redoing these one by one in order to identify the performance-degrading change(s). Similarly, as an ontology evolves over time, users find that answers to their queries might vary (sometimes significantly), naturally getting confounded as to why such changes in query answers occurred; without the means to pinpoint which changes caused this phenomenon, users may find themselves redoing work. Understandably, there is a need for more sophisticated methods to detect and facilitate the understanding of ontology changes and the effect that these have, that is, change impact analysis services for ontologies. In particular, we address the impact of axiom changes on the semantics of an ontology: its asserted and inferred axioms, and terms, as well as the impact on (reasoner) performance. A diachronic analysis of NCIt versions allowed us to identify certain key impact analysis services which would be desirable for ontology engineers, and that will be addressed in this thesis.

1.1 Desiderata for Impact Analysis

The study of impact analysis in software systems came into focus in the 1990’s, being defined as “identifying the potential consequences of a change, or estimat- ing what needs to be modified to accomplish a change” [1]. Impact analysis is fundamental for evolving documents, such as software code or knowledge bases, where “small changes can ripple throughout the system to cause major unin- tended impacts elsewhere” [71]. A key service for impact analysis, typically used as a component in versioning systems, is a diff engine; it enables users to com- pare documents and analyse changes. This service is valuable not only in the versioning context, but also for the purpose of understanding changes and their 1.1 Desiderata for Impact Analysis 16 impact on the overall ontology. Different diff methods vary in their sensitivity to changes: for example, a diff method based on character differences will find that two different serialisations of the same ontology are radically distinct ontologies. Consequently, if a diff method is sensitive to irrelevant differences, then the user will have to determine which reported changes are actually significant. On the other hand, a hard requirement is that the diff method does not miss any change of significance. Aside from selecting what changes are relevant, a common ability of a diff is the alignment of changes between source and target documents. However, since OWL does not enforce a systematised ordering of axioms, such an alignment becomes problematic when comparing OWL documents. In order to align changes precisely, one solution would be a log-based diff, which requires the existence of a change log between two ontologies. The obvious problem with this is lack of flexibility; such a diff only works when one has a change log. For that to happen, ontology editors must provide the ability to track changes either out-of-the-box or via plugins, for example, the edit-based diff built into the Swoop ontology editor [61]. Of course, there is the added issue of agreement on a change log format, between ontology editors and diffs. Also observe that, in a collaborative development scenario where ontology engineers can ‘fork’ the ‘base’ ontology (i.e., a copy of the base ontology is made, and developed independently – while the base ontology can keep changing as well), one cannot directly compare the resulting versions (i.e. base and modified) with a log-based diff if both of them have been altered since the fork; it would require backtracking changes across both versions’ chains of changes. So the ideal OWL diff should not rely on the existence of a change log, but rather only require two input ontologies. It should produce an alignment similar to that employed within textual diffs, though obviously adapted to the idiosyncrasies of OWL. Particularly, in OWL we have not only changes to axioms, but also changes to the meaning of terms used in those axioms. Ideally, a diff would detect both axiom and term differences, and pinpoint which axioms altered the meaning of which terms, and in what way. Another aspect of impact analysis concerns reasoner performance on ontolo- gies. As the latter change over time, so does the performance of a reasoner in the face of changes carried out; seemingly harmless changes can result in reasoning time shifting from seconds to hours, or even days. Currently though, there is a significant lack of services to support performance impact analysis. If a reasoner 1.2 State of the Art 17 performs poorly on some input, one solution is to switch to another reasoner [33], or trade-off soundness or completeness of reasoning results for better per- formance. Another typical solution is to alter the ontology until the reasoner performs acceptably well on the altered ontology. This is usually done based on “folk wisdom” about what constitutes a performance-degrading change (for a specific reasoner, or in general), for instance, “disjoints are hard”; though in fact disjoints can greatly improve performance. So ideally, one would have the tools to identify whether there exist such performance-degrading axioms in an ontol- ogy, and, if so, identify and return them to the user. Of course, merely pointing out these axioms helps very little if the aim is purely to speed up reasoning, so one would have to exploit ways to improve reasoner performance based on those performance-degrading axioms.

1.2 State of the Art

The OWL specification defines a high level notion of syntactic equality, so-called structural equivalence, based on which the structural difference between two on- tologies can be computed. Several (syntactic) diff services rely on this method to detect and present asserted axiom changes (i.e., additions and removals), aligned with the term on the left hand side of the change [77, 69]. Other (semantic) diffs compute entailment differences, and perform a similar term-based alignment [58, 68]. However, axiom changes potentially alter the meaning of more than one term, and can do so in different ways; for example, an axiom may cause a term to be indirectly specialised, i.e., specialised because of a change to another term. It could also occur that an axiom change causes no changes to the meaning of any terms, possibly because the axiom has no logical impact in the first place. Such a distinction, by itself, is already useful since it gives the engineer an immediate, sensible division of the change between changes that do and do not cause any logical effect (on the ontology or its terms). The diff methods available to date do not analyse the impact of changes at such a fine-grained level. Particularly with regards to (axiom) change alignment, current services perform no categorisation of changes according to their impact, for example, whether an axiom has been rewritten, or is a strengthening of another axiom. Instead, this kind of analysis is left up to the user. In terms of reasoner performance impact analysis, some work has been 1.3 Goals and Contributions 18 conducted in finding performance-degrading constructs (i.e. terms or ax- ioms). Among the key services for that purpose, Pellint [72] detects possibly performance-degrading axioms for Pellet [95], and repairs them (by removing or weakening). However, it does not verify whether the reasoner performs any bet- ter on the repaired ontology than it did on the original one. Furthermore, both Pellint and the approach described in [16], will return any number of performance- degrading constructs, possibly even ones with no relation between each other, which might be problematic for subsequent maintenance of the ontology; if the set of performance-degrading constructs returned includes, for instance, terms or axioms across all sub-domains in the ontology, then removing all of them might not be a viable option. Searching for and verifying such problematic constructs can be computationally expensive (for the verification, the reasoner must at least perform the required task on one, presumably modified ontology), and even futile if none exist. That said, there is currently no solution to predicting whether an ontology/reasoner pair might have performance-degrading constructs.

1.3 Goals and Contributions

In summary, the goals of this thesis are to advance the state of the art in im- pact analysis (services) for OWL ontologies, by, first of all, advancing techniques that address fundamental diff features. Second, this thesis aims at investigating reasoner performance variability phenomena, and means to cope with otherwise performance unmanageable ontologies (for some reasoner). The contributions of the research carried out are summarised below.

Chapter3 In designing and carrying out a diachronic case study of the NCIt, we discover a series of logical (i.e. axiom changes) and reasoner performance phenomena, which help identifying desirable methods to better detect and understand them.

Chapter5 An axiom-based diff notion is presented (i.e. designed, implemented, and evaluated); it detects structural changes between ontologies, verifies whether those have any logical effect, and subsequently aligns changes in a way similar to textual diffs, where an alignment can be found. Based on a case study where 100 ontology versions are compared with respect to axiom changes, we find that the axiom-based diff can aid not only change analysis 1.3 Goals and Contributions 19

between ontologies, but also the division of labour (among developers) with respect to amending potential errors.

Chapter6 A term-based diff notion is presented; it detects which terms in an ontology had their meaning affected by changes, by comparing (finite) entailment sets between ontologies and extrapolating affected terms from those differences. The diff then verifies whether a term has been directly or indirectly affected (or both), depending on whether it has changed via some change to one of its sub or superconcepts. In a case study where 14 ontol- ogy versions are compared with respect to term differences, we find that by separating directly and indirectly affected terms we can greatly reduce the amount of potentially uninteresting changes shown to users.

Chapter7 A mechanism to align axiom and term based changes is presented; by attending to the justifications for entailments in the difference, and the term changes that these entailments “witness”, an alignment between axiom changes and affected terms can be computed by our method. Based on examples and a tool walkthrough, we show that this alignment is helpful when inspecting changes between ontologies, as well as providing a means to sort the change set according to affected terms.

Chapter8 A theory of performance heterogeneity for ontologies/reasoners is formulated, and a means to verify whether a given ontology/reasoner com- bination is performance-homogeneous or heterogeneous is designed and eval- uated. Surprisingly, we discover that ontology/reasoner combinations can have three different performance (growth) patterns: monotonic and linear, monotonic but non-linear, and non-monotonic.

Chapter9 A method to identify small subsets of an ontology that are performance-degrading when combined with the remainder (so-called per- formance hot spots) is presented, and shown to be feasible. A series of approximation and knowledge compilation techniques based on hot spots are designed and evaluated, some of which are shown to significantly im- prove reasoning time compared to the original ontology. 1.4 Published Work 20

1.4 Published Work

The contributions of this thesis are supported by the following workshop and conference publications:

• [34] R. S. Gon¸calves, B. Parsia, and U. Sattler. Analysing multiple versions of an ontology: A study of the NCI Thesaurus. In Proceedings of the 24th International Workshop on Description Logics (DL), 2011.

• [35] R. S. Gon¸calves, B. Parsia, and U. Sattler. Analysing the evolution of the NCI thesaurus. In Proceedings of the 24th IEEE International Sympo- sium on Computer-Based Medical Systems (CBMS), 2011.

• [36] R. S. Gon¸calves, B. Parsia, and U. Sattler. Categorising logical dif- ferences between OWL ontologies. In Proceedings of the 20th ACM Inter- national Conference on Information and (CIKM), 2011.

• [37] R. S. Gon¸calves, B. Parsia, and U. Sattler. Facilitating the analysis of ontology differences. In Proceedings of the Joint Workshop on Knowledge Evolution and Ontology Dynamics (EvoDyn), 2011.

• [38] R. S. Gon¸calves, B. Parsia, and U. Sattler. Concept-based semantic difference in expressive description logics. In Proceedings of the 11th Inter- national Semantic Web Conference (ISWC), 2012.

• [39] R. S. Gon¸calves, B. Parsia, and U. Sattler. Concept-based semantic difference in expressive description logics. In Proceedings of the 25th Inter- national Workshop on Description Logics (DL), 2012.

• [40] R. S. Gon¸calves, B. Parsia, and U. Sattler. Ecco: A hybrid diff tool for OWL 2 ontologies. In Proceedings of the 9th International Workshop on OWL: Experiences and Directions (OWLED), 2012.

• [41] R. S. Gon¸calves, B. Parsia, and U. Sattler. Performance heterogeneity and approximate reasoning in description logic ontologies. In Proceedings of the 11th International Semantic Web Conference (ISWC), 2012. Chapter 2

Preliminaries

This chapter lays the groundwork in terms of employed terminology and core background notions used throughout the thesis. In Section 2.1 we present fun- damental aspects of description logics and associated (reasoning) services for de- scription logic ontologies. The term “ontology”, in the field of philosophy, is the study of the nature of being; what kinds of entities exist in the universe, and how they relate. In computer science, this same term “ontology” is used to refer to a representation of knowledge as a set of terms, and relations between these terms. This thesis is focused on description logic based ontologies written in the Web Ontology Language (OWL), which is discussed in Section 2.2 along its applica- tions and profiles. OWL profiles are fragments of OWL that benefit from more robust computational properties, such as polynomial time worst case complex- ity for core reasoning tasks. Finally, in Section 2.3 we present our experimental setup, in particular: the ontology corpora, hardware, and software libraries used in all our experiments.

2.1 Description Logics

Description logics (DLs) [4] are fragments of first order logic (FOL) that benefit from decision procedures for key inference problems, such as satisfiability. Syntax- wise, the main difference between DLs and FOL is that DLs use a variable-free syntax, making DL formulae more succinct than FOL ones. 2.1 Description Logics 22

2.1.1 Syntax and Semantics

A DL ontology O is composed of a set of asserted axioms, analogous to FOL for- mulae, describing relations between terms, i.e., concept, role and instance names [4]. Concepts in DLs correspond to unary predicates in FOL, while roles corre- spond to binary predicates, and finally instances to constants.

Let NC , NR, and NI be the sets of concept, role, and instance names, re- spectively, of an ontology O. The union of these three sets, that is, the set of all terms mentioned in O, is called the signature of O, and denoted Oe. We use the term “signature” and associated notation for any set of terms mentioned in an ontology or an element thereof; for example, the signature of an axiom α is denoted αe, and of a concept C denoted Ce. The term “concept” is used for any (possibly complex) concept, while “atomic concept” is used to refer to a concept name, and similarly “atomic role” stands for a role name. The set of subconcepts of an ontology O is recursively defined as all (possibly complex) concepts asserted in each axiom of O, plus {>, ⊥}. For example, the axiom α : A v B u(∃r.(C tD)) contains the subconcepts SC := {A, B, C, D, B u (∃r.(C t D)), ∃r.(C t D),C t D}. An interpretation I is a tuple I = (∆I , ·I ), where ∆I , the interpretation domain, is a non- of instances, and ·I , the interpretation function, I maps to each concept A ∈ NC a subset of ∆ , to each role r ∈ NR a subset of I I I ∆ × ∆ , and finally to each instance a ∈ NI an element of ∆ . For the ALC DL, the interpretation function ·I is described in Table 2.1.

Table 2.1: Concept constructors for the ALC DL.

Constructor Syntax Semantics Top > ∆I Bottom ⊥ ∅ () ¬C ∆I \ CI Conjunction (intersection) C u DCI ∩ DI Disjunction (union) C t DCI ∪ DI Existential restriction ∃r.C {x ∈ ∆I | ∃y:(x, y) ∈ rI and y ∈ CI } Universal restriction ∀r.C {x ∈ ∆I | ∀y:(x, y) ∈ rI and y ∈ CI }

The ALC DL extended with transitive roles becomes S, as an abbreviation for ALC+ [88]. The letters in a DL name roughly indicate the constructors that can be used, for instance, SH extends S with role hierarchies (H). The SROIQ DL 2.1 Description Logics 23 is an extension of SH with: complex role inclusions (R), nominals (O), inverse roles (I), and qualified number restrictions (Q). The axioms one can use in DL ontologies are categorised into three major sets: (1) TBox or terminological axioms, denoted T , (2) RBox or role axioms, denoted R, and (3) ABox or assertional axioms, denoted A. This partitioning of an ontology O is denoted O = hT , R, Ai. The forms of TBox, RBox and ABox axioms used in DLs are, respectively: (1) Concept Inclusion (CI) axioms of the form C v D,1 where C,D are (possibly complex) concepts. If C and D are complex concepts, the axiom is referred to as a General Concept Inclusion (GCI). (2) Role inclusion axioms of the form r v s, or role chains of the form r ◦ s v t, where r, s, t are roles. (3) Concept assertions of the form a : C, or role assertions of the form (a, b): r, where a, b are instance names, C is a concept, and r is a role. The set of axioms expressible in a DL L over a signature Σ is denoted L(Σ). An interpretation I satisfies a SROIQ axiom α, denoted I |= α, from those axiom types mentioned above as follows: I |= C v D if CI ⊆ DI I |= r v s if rI ⊆ sI I |= r ◦ s v t if rI ◦ sI ⊆ tI I |= C(a) if aI ∈ CI I |= r(a, b) if haI , bI i ∈ rI

If an interpretation I satisfies every axiom in an ontology O, I is called a model of O, denoted I |= O.

The restriction of an interpretation I to a signature Σ is denoted I|Σ. Two I interpretations I and J coincide on a signature Σ (denoted I|Σ = J |Σ) if ∆ = ∆J and tI = tJ for each t ∈ Σ.

2.1.2 Standard Reasoning Services

A central aspect of logic-based knowledge representation formalisms is the abil- ity to employ (efficient) automated deduction systems. In DLs these systems, so-called reasoners, implement decision procedures (for instance, tableau [7,8], hypertableau [80], consequence-based [93], or even resolution-based [64]) for the

1Also referred to as subsumptions. 2.1 Description Logics 24 following reasoning problems:

Satisfiability A concept C is satisfiable with respect to O if there exists at least one model I of O where CI 6= ∅. Otherwise, i.e. CI = ∅ in all models of O, C is unsatisfiable (thus O |= C v ⊥). The computational complexity of deciding satisfiability is 2NExpTime for the SROIQ DL [48], and PTime for EL [3]. If O contains one or more unsatisfiable atomic concepts, it is said to be incoherent, otherwise O is coherent.

Consistency A ontology O is consistent if there exists an interpretation I that satisfies every axiom in O, i.e., O has a model I |= O. Deciding consistency of an ontology written over a DL L is as computationally complex as deciding satisfiability in L (as discussed below).

Entailment An axiom α is entailed by an ontology O, denoted O |= α, if in every model I of O we have that I |= α. Deciding entailment over a SROIQ DL is reducible to deciding satisfiability in SROIQ [49]; checking if O |= A v B can be done by checking if A u ¬B is unsatisfiable with respect to O.

Realisation Realising an ontology entails finding all atomic concepts each in- stance is a type of, i.e., testing entailments of the form O |= C(a), where a is an instance and C an atomic concept in Oe. In the worst case, this task requires performing |NC | × |NA| entailment tests.

Classification Classification is the task of computing all subsumptions between atomic concepts (and atomic roles), i.e., testing entailments of the form O |= A v B, where A, B are atomic concepts (respectively, r v s, where r, s are atomic roles). In the worst case, this task requires performing a quadratic number of entailment tests in the order of |O|e restricted to atomic concepts (respectively, atomic roles) only. The problem of computing role hierarchies can be reduced to a concept hierarchy computation problem, as shown in [31].

Note that, for DLs with negation, the first three problems are inter-reducible, and so belong in the same : verifying whether an axiom α : A v B follows from O can be done by testing whether A u ¬B is unsatisfiable (and vice- versa). Similarly, consistency of an ontology O is reducible to satisfiability, by 2.1 Description Logics 25 verifying whether > is satisfiable, i.e., whether O |= > v ⊥. The set of entailments obtained from the classification of an ontology O is denoted Con(O), and contains all subsumptions between atomic concepts A, B ∈ Oe, excluding the following:

• ⊥ v A or A v >, for any A ∈ Oe

• A v B, where O |= A v ⊥

So Con(O) represents the of the concept hierarchy graph of O. By reducing the set of atomic subsumptions to those of the form A v C, where there is no B such that O |= A v B and O |= B v C, we get the transitive reduction, denoted Conr(O). The classification time of an ontology O using reasoner R, denoted RT(O, R), comprises the time for consistency checking, classification and coherence checking (concept satisfiability). When R is clear from the context, we also use RT(O). Additionally, where both O and R are unambiguous, we simply use RT().

2.1.3 Non-Standard Reasoning Services

Aside from the key reasoning services presented in the previous section, there are a variety of “non-standard” services, some of which rely on (standard) reasoning tasks. In this section we present those that are used or referred to throughout this thesis.

Justifications A well known form of explanation for DL entailments [60], par- ticularly useful for debugging ontologies. A justification J for an entailment α is a ⊆-minimal subset of an ontology O that is sufficient for α to hold, that is, J ⊆ O, J |= α and there is no J 0 ( J such that J 0 |= α. A laconic justification J for an axiom α is a justification for α where all axioms in J contain only subconcepts which are necessary for α to hold [46], that is, it discards super- fluous parts of regular justification axioms. For example, consider the ontology O := {A v B u D,B v C}, and the entailment α : A v C; a justification J for α is J := O, while a laconic justification for η is J 0 := {A v B,B v C}.

Conservative Extensions An extensively studied notion in mathematical logic, conservative extensions are a reasoning problem that can be used to fa- cilitate merging or extending documents [100]. We use the DL notions of model 2.1 Description Logics 26 and deductive conservative extensions (mCE and dCE, accordingly) [30, 75], and associated inseparability relations [89].

Definition 1. Given two ontologies O1, O2 over a DL L, and a signature Σ.

mCE O2 is model Σ-inseparable from O1 (O1 ≡Σ O2) if (2.1)

{I|Σ | I |= O1} = {J |Σ | J |= O2} L O2 is deductive Σ-inseparable from O1 w.r.t. L (O1 ≡Σ O2) if (2.2)

{α ∈ L(Σ) | O1 |= α} = {α ∈ L(Σ) | O2 |= α} L diff(O1, O2)Σ = {η ∈ L(Σ) | O1 6|= η and O2 |= η} (2.3)

L L Note that diff(O1, O2)Σ = ∅ if and only if O1 ≡Σ O2. Also, bear in mind that L mCE O1 6≡Σ O2 implies O1 6≡Σ O2 [75]. We use SROIQ GCIs for L, and omit L if this is clear from the context. All diff-related functions, such as the one in Definition1, are defined asymmet- rically. Thus to get the full diff between two ontologies we compute diff(O1, O2) for additions and diff(O2, O1) for removals.

2.1.4 Structural Notions

Structural Equivalence The notion of structural equivalence in the OWL 2 specification [79] establishes that a concept Φ1 : C u D (C t D) is equivalent to

Φ2 : D u C (respectively, D t C), denoted Φ1 ≡s Φ2, since the order of operands in conjunctions and disjunctions is irrelevant. The order of axioms is also non- important, and the equivalence between them is determined according to the concepts therein. For example, α1 : A v B u ∃r.(C tD) is structurally equivalent to α2 : A v ∃r.(D t C) u B. An ontology O1 is structurally equivalent to an ontology O2 if, for each axiom α ∈ O1, there exists an axiom β ∈ O2 such that

α ≡s β, and, analogously, for each axiom β ∈ O2 there exists an axiom α ∈ O1 such that β ≡s α.

Modularity The study of modularity advances techniques to extract modules from ontologies and to partition an ontology into subsets. Techniques such as ε-connections [22, 21] and atomic decomposition [26] allow us to compute a full 2.1 Description Logics 27 partitioning of a given ontology, while conservative extensions [75, 67] and local- ity [17, 18] provide us with theoretical foundations to extract modules from an ontology for a given signature. For the purposes of this thesis we use locality- based modularity, since verifying whether an ontology is a model or deductive conservative extension of another with respect to a logic L is undecidable when L is SROIQ.

Locality-Based Modules Inspired by advances in conservative extensions for DLs, locality is designed to formulate sufficient conditions for an ontology to be a deductive (and consequently model) conservative extension of another. Modules based on locality benefit from desirable logical properties, in particular coverage: a locality-based module M for a signature Σ is a subset of an ontology O (i.e. M ⊆ O) that preserves all consequences of O expressible between terms in Σ [18], i.e., if O |= α and αe ⊆ Σ then M |= α. Locality comes in two variants: the first, semantic locality (whose inseparability relation is denoted ≡sem), and the second, syntactic locality (denoted ≡syn), which is an approximation of semantic locality. Both locality variants are approximations of deductive, and consequently model conservative extensions. Therefore we have the following relations between the mentioned notions [89]:

syn sem dCE mCE O1 ≡Σ O2 =⇒ O1 ≡Σ O2 =⇒ O1 ≡Σ O2 =⇒ O1 ≡Σ O2

According to semantic locality, an axiom α is ⊥-local with respect to Σ if, by replacing atomic terms in αe \ Σ with the empty set, we have that ∅ |= α [18]. For example, consider α : A v B u ∃r.C and Σ := {A}; by replacing terms in αe \ Σ as specified beforehand, we end up with the axiom A v ⊥. Thus α is not ⊥-local with respect to Σ. Given that semantic locality involves reasoning, we use the computationally cheaper syntactic locality variant throughout the thesis.2 A syntactic locality- based x-module M extracted from an ontology O for a signature Σ is denoted x-mod(Σ, O), for x one of >⊥*, > or ⊥ [89]. The >⊥* module notion involves interleaving > and ⊥ locality checks until a fixpoint is reached; these modules,

2While for a single module extraction the performance difference between syntactic and semantic module extraction may be negligible, in programs involving hundreds to thousands of module extractions it can have a significant impact on overall run time. 2.2 OWL: The Web Ontology Language 28 though smaller than ⊥ or >-modules, benefit from the same logical guarantees.

2.2 OWL: The Web Ontology Language

The Web Ontology Language (OWL) is a language for knowledge representation whose semantics are based on description logics. In OWL there are two kinds of axioms allowed: logical axioms and non-logical (annotation) axioms. The latter are analogous, with respect to reasoning, to comments in programming languages. The former, logical axioms, are statements that are asserted to be true under well- defined semantics, which in turn allow inferring axioms that are true though not explicitly asserted: so-called entailments. Primarily developed for the Semantic Web, OWL became a W3C recommen- dation in 2004, at the time based on the SHOIN DL [50]. The first iteration of OWL specified the following three “species”:

• OWL DL: based on the SHOIN DL, whose key inference services have an NExpTime worst case complexity [50].

• OWL Lite: based on the SHIF DL, which benefits from “more tractable” inference services, the worst case complexity being ExpTime [98].

• OWL Full: a syntactic and semantic extension of RDF and RDF Schema (RDFS) and a superset of OWL DL for which inference services are un- decidable. An ontology is in OWL Full if certain constraints imposed on axioms in order to retain in OWL DL [52, 53] are not met.

The second iteration of OWL became a W3C recommendation in 2009, also incorporating several species, now called “profiles”. In particular, OWL 2 specifies the OWL 2 EL profile which benefits from polynomial time worst case complexity inference services.

• OWL 2 DL: based on the SROIQ DL, with a worst case complexity of 2NExpTime for key inference services [48].

• OWL 2 EL: based on the EL++ DL, for which the key reasoning tasks have a PTime worst case [3]. 2.2 OWL: The Web Ontology Language 29

• OWL 2 RL: a subset of OWL 2 DL and an extension of RDFS, the RL profile is motivated by Description Logic Programs (DLP) [42] and pD* [97], and designed to allow for scalable reasoning based on rules.

• OWL 2 QL: based on the DL-Lite family of logics [14], in particular DL-

LiteR, it is meant to provide efficient query answering with a worst case complexity of AC0 with respect to the size of the ABox.

An OWL ontology can be written (and serialised) in several syntaxes: RD- F/XML (the primary exchange syntax), OWL/XML, Functional, Manchester, and . Description logic concepts, roles and instances are referred to in the OWL specification as classes, object properties and individuals, respectively. Indeed OWL has different types of “roles”, as well as the following features not originally found in DLs:

Annotations As of OWL 2, annotations can be added to terms (referred to as entities in the OWL specification), axioms, and the ontology itself by means of so-called annotation properties. For example, the annotation property rdfs:label is commonly applied to terms, and used to change the way OWL ontology interfaces render term names in an ontology: according to their label value or URI.

Concrete domains OWL draws on advances in augmenting DLs with concrete domains to allow describing concrete qualities of objects, such as weight, height, temperature, and so on [6,9, 73, 74]. To make use of concrete domains, OWL features so-called data properties, which assign to a class or individual a concrete value of some (data) type. The accepted “datatypes”, as they are referred to in the OWL specification, are taken from the set of XML Schema Datatypes,3 the RDF specification,4 and the specification of plain literals.5

Imports In OWL it is possible to import external ontologies (for reuse, for example) via the owl:imports construct [20]. Systems like reasoners that take ontologies as input will typically take into account the entire imports closure when performing reasoning tasks.

3http://www.w3.org/TR/xmlschema11-2/ 4http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/ 5http://www.w3.org/TR/rdf-plain-literal-1/ 2.3 Experimental Setup 30

Punning OWL 2 allows using the same name for different kinds of terms. Specif- ically, one can have a class and an individual, or a class and a property with the same name. However, if an ontology has different kinds of properties with the same name, for instance an object and data property, it is no longer a valid OWL 2 ontology.

Any of the OWL species and profiles mentioned allows using concrete do- mains, however OWL 1 and 2 differ in their datatype maps (i.e. sets of allowed datatypes).

2.3 Experimental Setup

This section describes global methods and materials employed in experiments throughout subsequent chapters. In particular, we report on the hardware and software infrastructure used, as well as the input objects for experiments.

2.3.1 Infrastructure

Starting with hardware, the experiments carried out throughout this thesis are executed in the two following machines:

Machine 1: Intel Quad-Core Xeon 3.2GHz processor, 32GB 1066MHz DDR3 ECC RAM, running Mac OS X 10.8.4 (Mac Pro mid-2010 model).

Machine 2: Intel Dual-Core i7 2.7GHz processor, 16GB 1333MHz DDR3 ECC RAM, running Mac OS X 10.7.5 (Mac Mini mid-2011 model).

Moving on to software, both machines have Oracle Java Runtime Environment (JRE) v1.7 installed, and as the default JRE. All experiments are run using the OWL API v3.4.5 [45, 12]. In order to avoid memory allocation issues between the Java Virtual Machine (JVM) and the , the maximum heap memory allocated to the JVM is, in Machine 1: 30GB, and in Machine 2: 14GB. In every experiment where run time is logged, the machines used are “iso- lated”. That is, all asynchronous processes are terminated (for example, software updates, disk indexing, backups) and network access is restricted to Secure Shell (SSH) connections only. Finally, all non-essential system and user processes are also terminated. 2.3 Experimental Setup 31

The choice of reasoners for our experiments is based on the following criteria: (a) coverage of all of OWL 2, (b) freely available for academic use, and (c) native support for the OWL API (v3.x). For some experiments an extra criterion is used: (d) reasoners based on sound, complete and terminating algorithms. Based on these restrictions, we are left with the following reasoners: FaCT++ [99], HermiT [92], Pellet [95], and JFact.6 Where criterion (d) is not applied, we also use TrOWL [86]; a sound but incomplete reasoner that transforms given input ++ ++ ++ ++ ++ to EL , ELC (EL with negation) or ELCQ (EL with negation and cardi- nality restrictions), and subsequently applies the REL reasoner (a component of TrOWL).

2.3.2 Ontology Corpora

The selection of corpora for our experiments is based on the following criteria: (a) publicly available for download, (b) engineered with real-world applications in mind, (c) logically non-trivial, that is, ontologies with high enough expressivity that an OWL 2 DL reasoner is necessary for reasoning, and (d) challenging from a reasoning perspective. Additionally, for certain parts of this thesis, the following requirement must be met: (e) high number of versions. As such, the following corpora were selected for our evaluation:

NCI Thesaurus Since 2000, the Enterprise Vocabulary Services (EVS) project of the National Cancer Institute of the United States (NCI) has been developing and publishing a biomedical reference terminology: the National Cancer Institute Thesaurus (NCIt) [23, 24, 29, 32, 81, 94].7 Part of the development methodology for the NCIt is to base it on an ontology encoded in a description logic. Since 2003, the monthly releases of the NCIt have included a version in OWL, resulting in a set of 112 OWL ontologies freely downloadable from the Web.8 The full corpus is challenging to work with, for example, the latest version has nearly 100,000 atomic concepts and around 150,000 logical axioms. The whole corpus weighs in at 15 GB for the uncompressed OWL files. In several versions of the NCIt (the earliest of which being v14 and v16) we came across a parsing issue; these ontologies make use of an annotation property

6http://jfact.sourceforge.net/ 7http://ncit.nci.nih.gov/ 8http://evs.nci.nih.gov/ftp1/NCI_Thesaurus 2.3 Experimental Setup 32 and a data property both named “code”. This results in annotation assertions being parsed as data property assertions, thereby leading to the mass-creation of instances. However, this is a tool artefact rather than a fundamental modelling choice.9 Since OWL does not allow properties of different types having the same name, ontology parsers (in our case OWL API parsers) need to decide how to interpret those properties. In order to circumvent this issue, we remove all ABox axioms (including the data property assertions) before carrying out any analysis of the NCIt. Because this accurately reflects the developers’ intent, the “repaired” NCIt corpus is the one used throughout the thesis.

NCBO BioPortal BioPortal is a Web based repository for health care and life science ontologies containing a large collection of user contributed, “working” ontologies covering a wide range of biomedical domains [83]. We use a snapshot of (publicly downloadable ontologies from) the BioPortal repository from Novem- ber 2012, consisting of 292 OWL and OBO parseable ontologies. The average number of logical axioms in the corpus is 28,439 (total: 8,190,504 and median: 979 axioms), and 89 of these ontologies contain named individuals. 4 ontologies contained no logical axioms at all and thus were discarded. In terms of expres- sivity, the ontologies range from the inexpressive AL DL to the very expressive SROIQ DL. The BioPortal ontologies are developed and used in a wide range of largely unrelated projects for a variety of purposes, and using a variety of tools. Many of these ontologies do not fall under our criteria (i.e., they are fairly trivial and non-challenging), so we restrict ourselves to a subset of BioPortal containing 13 ontologies for which at least one reasoner took one minute or more to classify.

9According to NCIt developers, the intention is purely to produce “code” annotations for all terms in the NCIt. Chapter 3

Impact Analysis

In this chapter we discuss key aspects of impact analysis in OWL ontologies, and determine, via a case study of the NCIt, desirable services to detect changes and their respective impact on ontologies.

3.1 Impact in OWL

When an ontology engineer makes a change to an ontology, the main consequence of interest is the effect of the change on the ontology, i.e., its impact. The key principle here is that not all change matters: for instance, two ontologies with the same axioms ordered differently would be considered by a syntactic diff as distinct. Though this is a change that does not alter the behaviour of the ontology; since they contain the same axioms they are logically equivalent. So we need to determine what constitutes a significant change, where “significant” is defined with respect to the documents under analysis: OWL ontologies. An OWL ontology specifies constraints, in the form of axioms, over terms. From these (asserted) axioms we can derive, via reasoning, entailed axioms of particular forms; one of the commonly used forms being atomic subsumptions computed as a result of classification. In the context of this thesis we restrict ourselves to impact analysis of logical axioms, since annotation axioms have no logical impact (from a DL point of view). Immediately we have here three di- mensions of an ontology that can be subject to change: (1) asserted axioms, (2) entailed axioms, and (3) terms. In order to compute entailments one relies on reasoners, analogous to in high-level programming languages. This in- troduces a fourth, little to insufficiently well-understood dimension: (4) reasoner 3.2 NCIt Case Study 34 performance. Identifying the impact of axiom changes on an ontology, its terms, and its entailments, has recently received some attention from the ontology community [68, 58, 77, 69]. Generally though, current solutions to determining such impact leave users with possibly large, unstructured sets of changes, thus making change understanding rather challenging. Because of the various implementations of DL reasoners, with different calculi and optimisations, analysing dimension (4) thoroughly is non-trivial; it is difficult to predict how these systems will behave on arbitrary ontologies, especially since both reasoners and ontologies are evolving systems. Indeed many an ontology de- veloper will at some point come across such pathological cases as an ontology that classifies fast, and after a batch of (seemingly harmless) changes, ends up classi- fying significantly slower for a particular reasoner. Though if one were to switch reasoner there may be no noticeable performance change between these ontology versions. This suggests that, unlike dimensions (1)-(3), analysing axioms and terms alone does not give us sufficient information to fully understand reasoner behaviour, instead one should consider the ontology/reasoner combination.

3.2 NCIt Case Study

While there have been a number of analyses of the NCIt [15, 23, 32, 29, 81, 90, 94], these have all been of particular versions (typically, either a known snapshot, such as the one described in [32], or the “latest” version). Existing, history- oriented comments merely give an indication of the overall rate of growth (for example, 700-900 “new entries” a month). Aside from this, there is little fine- grained information regarding the evolution of the NCIt, which is surprising given that 112 temporally evenly-spaced versions of the same ontology form a unique resource for studying ontology evolution. Since the NCIt is a subject of much ontology engineering research,1 it is surprising that there has been no prior in-depth analysis of its evolution, both in terms of logical content and reasoner performance. While the NCIt is distributed with “concept change” logs for each release, these are intended for consumers of the terminology (i.e., those only concerned with the (hierarchy of) terms), not

1In addition to driving the development of the Prot´eg´eOWL plugin, it has been used in research on reasoning [63], ontology diff analysis [58], and modularity [22], to name but a few. 3.2 NCIt Case Study 35 people interested in the ontological portions of the NCIt. Internally, the EVS keeps detailed “edit-based” change logs [24] which are deeply embedded in the development and quality assurance process (according to [24]), however these are not public. Thus modellers who wish to understand the scope of the changes from version to version need to rely on post facto analysis. The aim behind the NCIt study is to determine whether the NCIt is indeed a suitable corpus for change analysis (i.e. if there is significant change activity) and, if so, what “services” would be necessary and desirable to identify, explain, or cope with the kinds of impact discussed in Section 3.1.

3.2.1 Methods and Materials

In this section we provide a detailed analysis of the 112 OWL versions of the NCIt, charting its change over the past 10 years. For this purpose, we collect the following data from each version: (1) parsing time, (2) expressivity, (3) number and type of axioms, (4) classification time, (5) asserted atomic subsumptions, (6) inferred atomic subsumptions (both transitive reduction and closure),2 and finally (7) unsatisfiable concepts. The parsing times, item (1), should give us an idea of whether parsing OWL ontologies of such size as the NCIt is currently problematic, in terms of time or resource consumption. The data items (2) and (3) could be suggestive of the amount and type of change that occurs, for example, the introduction of new constructs or different types of axioms; this would give us a sense of change activity throughout the corpus. The item (4) is meant to show how well modern reasoners handle the NCIt corpus, and simultaneously reveal whether any major performance changes occur. Items (5) through to (7) will allow us to determine if meaningful inference or logical bugs take place in the NCIt corpus, thus warranting the need for some form of entailment change analysis. The reasoners used in this study are: FaCT++ v1.6.2, Pellet v2.3.1, HermiT v1.3.8, and JFact v1.0. Machine 1, as explained in Section 2.3.1, was used for all data gathering.

2The computation of inferred subsumptions is done according to the criteria set out in Section 2.1.2. 3.2 NCIt Case Study 36

3.2.2 Results

This section presents the analysis of the aforementioned gathered data, organised as follows: parsing times, asserted and inferred axioms, followed by reasoner performance over all NCIt versions.

3.2.2.1 Parsing Times

Given the size of the NCIt, and the complexity of parsing RDF/XML to OWL, parsing time has historically been an issue. However, currently this is not such a major issue. The parsing times across the NCIt corpus increase linearly with the number of axioms, and there is very little difference between the times within the OWL API and Prot´eg´ev4.1.3 For instance, an early version, v6, is parsed in 6.1 seconds using the OWL API (6.8 seconds in Prot´eg´e),and having the logical part split from the non-logical (annotations) as two distinct ontologies, the latter loads in 5 seconds in the OWL API (5.9 seconds in Prot´eg´e), and the former in 3 seconds (3.8 seconds in Prot´eg´e). So the logical part does not create much overhead on the parsing times of the NCIt, in fact the non-logical part alone takes nearly as much as the whole ontology to load. On average, and considering versions as they are distributed (i.e., containing both logical and non-logical axioms), the parsing time of each NCIt version is 7.9 seconds, the median being 6.7 seconds and maximum 12.9 seconds in the latest (and biggest) version of the corpus.

3.2.2.2 Asserted Axioms

There is a steady growth in terms of axioms over the NCIt corpus (see Figure 3.1), with a few exceptions. The most striking of these are: a drop in annotation and logical axioms from v16 to v17 (annotated as A in Figure 3.1),4 a drop in annotation axioms from v59 to v60 (B in Figure 3.1), and finally a rise in annotation axioms from v69 to v70 (C in Figure 3.1). The majority of axioms in each NCIt version are annotations, which make up for, on average, 89% of each version over the whole corpus (the minimum being

3Though Prot´eg´etakes some time to render the concept hierarchy, which was not broken out. 4Chronologically, between v16 and v17 there are 2 NCIt releases that are unparseable using the OWL API, meaning that v17 was released 3 months after v16, and so one should expect a high number of changes between these versions. 3.2 NCIt Case Study 37

1400000

1200000 D C 1000000 B A 800000

600000

400000

200000

0 v1 v3 v5 v7 v9 v11 v13 v15 v17 v19 v21 v23 v25 v27 v29 v31 v33 v35 v37 v39 v41 v43 v45 v47 v49 v51 v53 v55 v57 v59 v61 v63 v65 v67 v69 v71 v73 v75 v77 v79 v81 v83 v85 v87 v89 v91 v93 v95 v97 v99 v101 v103 v105 v107 v109 v111

Logical Axioms Annotaon Axioms Total Axioms

Figure 3.1: Axiom growth of the NCIt, where annotation axioms dominate (x- axis: NCIt version, y-axis: number of axioms).

85% and the maximum 90%). The annotation axiom growth is roughly consistent across the corpus, aside from four cases: (a) a decrease of 71,025 annotations from v16 to v17 (A in Figure 3.1), and 184,446 from v59 to v60 (C in Figure 3.1), and (b) an increase of 55,567 annotations from v27 to v28 (B in Figure 3.1), and 86,500 from v69 and v70 (D in Figure 3.1). In terms of logical axioms, aside from a decrease of 27,134 axioms from v16 to v17 (A in Figure 3.1), they grow consistently throughout versions of the NCIt. The expressivity of the underlying logic increases across the NCIt, except from v62 to v63 where the use of datatypes in role ranges is dropped. From v63 until the final version v112 datatypes are no longer used. The DL underlying the latest NCIt version is SH. Until v14, inclusive, the majority of logical axioms are subsumptions, with the remainder being role domain and range axioms (see Figure 3.2). From v15 onwards some disjointness axioms start being used, typically less than or about 200 of them. In v17, and all versions after that, there are a high number of equivalence axioms, starting at around 4,500 and increasing until around 12,500 in the latest version. Also in v17 we see the first appearance of sub-role axioms, 8 to be exact, which increase in number to 17 in v24. This count of sub-role axiom remains the same until the last version. 3.2 NCIt Case Study 38

140000 D 120000 C A 100000 B

80000

60000

40000

20000

0 v1 v5 v3 v7 v9 v11 v13 v15 v17 v19 v21 v23 v25 v27 v29 v31 v33 v35 v37 v39 v41 v43 v45 v47 v49 v51 v53 v55 v57 v59 v61 v63 v65 v67 v69 v71 v73 v75 v77 v79 v81 v83 v85 v87 v89 v91 v93 v95 v97 v99 v101 v103 v105 v107 v109 v111

SubClass EquivalentClass DisjointClass RBox Axioms Total Logical Axioms

(a) TBox axioms breakdown, and total logical axioms.

110 A B C D 100

90

80

70

60

50

40

30

20

10

0 v1 v3 v5 v7 v9 v11 v15 v21 v25 v31 v35 v41 v45 v51 v55 v61 v65 v71 v75 v81 v85 v91 v95 v13 v17 v19 v23 v27 v29 v33 v37 v39 v43 v47 v49 v53 v57 v59 v63 v67 v69 v73 v77 v79 v83 v87 v89 v93 v97 v99 v101 v103 v105 v107 v109 v111

SubObjectProperty Object Property Domain Object Property Range Data Property Range (b) RBox axioms breakdown.

Figure 3.2: Breakdown of logical axioms occurring in the NCIt (x-axis: NCIt version, y-axis: number of axioms). Note that in Figure 3.2a role axioms are grouped together, and then broken down in Figure 3.2b.

3.2.2.3 Inferred Axioms

It is rumoured that, as part of the development process of the NCIt, all entail- ments computed by classifying an NCIt version Oi are subsequently asserted in 3.2 NCIt Case Study 39

5 Oi prior to its public release. In order to verify whether this is true, we classify each NCIt version and obtain the sets of asserted and inferred strict atomic sub- sumptions,6 both direct and indirect [11]. That is, we compute, respectively, the transitive reduction and transitive closure of each version’s concept hierarchy, and compare these with the asserted concept hierarchy. If we find that the inferred concept hierarchy (either reduction or closure) coincides with the asserted hier- archy, then we have reason to argue that inferred axioms are indeed asserted in that version. This can be done by verifying whether there are any purely inferred (i.e. not asserted) axioms. The gathered entailment counts are shown in Figure 3.3. Additionally we compute the set of inferred equivalences, but these only occur in v16: 1, v58: 354, and v59: 355.

700000 120000 110000 600000 100000 90000 500000 80000 70000 400000 A C 60000 300000 50000 B 40000 200000 30000 20000 100000 10000 D 0 0 v1 v5 v9 v1 v5 v9 v13 v17 v21 v25 v29 v33 v37 v41 v45 v49 v53 v57 v61 v65 v69 v73 v77 v81 v85 v89 v93 v97 v21 v25 v41 v45 v61 v65 v81 v85 v13 v17 v29 v33 v37 v49 v53 v57 v69 v73 v77 v89 v93 v97 v101 v105 v109 v101 v105 v109

Asserted Subsumpons Inferred Subsumpons Asserted Subsumpons Inferred Subsumpons (a) Transitive closure. (b) Transitive reduction.

Figure 3.3: Entailment growth of NCIt: asserted and inferred entailment counts (x-axis: NCIt version, y-axis: number of axioms).

Aside from three cases in the corpus, the ratio of asserted against inferred en- tailments is roughly constant; on average 1:7 (minimum being 1:6, and maximum 1:8) in the transitive closure, and 1:0.2 (minimum being 1:0, and maximum 1:0.3) in the transitive reduction. The mentioned outlier cases are versions 16, 58 and 59, all of which contain a high number of unsatisfiable concepts (37,436, 21,819 and 21,866 respectively). The presence of unsatisfiable concepts explains the de- crease in asserted subsumptions in all three versions. Consider the ontologies

O1 := {A1 v A2,A2 v A3} and O2 := O1 ∪{A3 v A4 u¬A4}. From O1 we derive 5Perhaps due to the popular misconception about reasoning being easier when one asserts all inferred subsumptions. 6Note that set of inferred axioms excludes those axioms which are already asserted. 3.2 NCIt Case Study 40

three atomic subsumptions, two of which asserted: Con(O1) := O1 ∪ {A1 v A3}, while from O2 we derive three atomic subsumptions, none of them asserted:

Con(O2) := {Ai v ⊥ | i ∈ {1..3}}. Since v16 only has 4,009 satisfiable concepts out of 41,535, this causes a high inferred subsumption drop when considering the transitive closure (A in Figure 3.3a); unsatisfiable concepts only occur in a single entailment, of the form A v ⊥, for an unsatisfiable concept A. By looking into the transitive reduction (C in Figure 3.3b) there is an inferred subsumption rise where 37,436 out of 37,560 entailments found are of the form A v ⊥. Contrary to v16, both v58 and v59 exhibit a rise in inferred atomic subsump- tions, in the transitive closure (B in Figure 3.3a) as well as in the transitive reduction (D in Figure 3.3b). This is due to the fact that these versions have a high number of inferred equivalence axioms (354 and 355, respectively), each of which containing on average 7 subconcepts. In turn, this causes a decline in subsumptions: Take, for example, the ontologies O1 := {A v B,A v C u D}, and O2 := O1 ∪ {B v A}. From O1 we deduce an asserted and two inferred ax- ioms: Con(O1) := {A v B,A v C,A v D}, while from O2 we get two asserted and the rest inferred axioms: Con(O2) := {A v B,B v A, A v C,A v D,B v C,B v D}. So in the concept hierarchy graph, as unary atomic concept-nodes are collapsed into n-ary nodes (containing equivalent concepts) this can poten- tially result in a blow up of inferred atomic subsumptions, such as the one in v58 and v59. Overall in the first 15 NCIt versions there are no purely inferred atomic sub- sumptions in the transitive reduction (see Figure 3.3b), which could indicate that these axioms have indeed been asserted – or there are none to infer in the first place. Note that in the transitive closure there are still purely inferred subsump- tions, though possibly because tools like Prot´eg´etypically compute the former rather than the latter. So while the rumour may have been true for these early versions, from v16 onwards it certainly does not happen that (all) inferred sub- sumptions are asserted, thus proving the rumour to be false with respect to the whole corpus. 3.2 NCIt Case Study 41

3.2.2.4 Reasoner Performance

It is often the case that, for reasoner testing, only a few or even a single ontology version is tested. Prior to our work, there had been no reported reasoner bench- mark using a sequence of different versions of a single ontology. So, in the process of analysing the NCIt, we evaluate how modern reasoners handle all published OWL versions of the NCIt. The classification times of all versions are shown in Figure 3.4.

550 B C 500

450 A 400

350

300

250

200

150

100

50

0 v1 v5 v3 v7 v9 v11 v13 v15 v17 v19 v21 v23 v25 v27 v29 v31 v33 v35 v37 v39 v41 v43 v45 v47 v49 v51 v53 v55 v57 v59 v61 v63 v65 v67 v69 v71 v73 v75 v77 v79 v81 v83 v85 v87 v89 v91 v93 v95 v97 v99 v101 v105 v111 v103 v107 v109 Pellet HermiT FaCT++ JFact

Figure 3.4: Reasoner performance across NCIt (x-axis: NCIt version, y-axis: time in seconds). Versions that could not be classified are not plotted.

Out of the reasoners put to test, FaCT++ behaves consistently faster than the remainder, with its slowest performance being in v112: 74,6 seconds. JFact is unable to classify 18 of 112 NCIt versions, where it throws an IllegalArgumentEx- ception error (C in Figure 3.4). In versions v58 to v59 the classification times are much lower than the preceding (v57) and succeeding (v60) versions (A in Figure 3.4). This is due to the fact that these versions have a high number of unsatisfiable concepts, meaning that many subsumptions tests are avoided and thus reasoning is faster. There are also unsatisfiable concepts in v16, more so than in v58 and v59, though the classification time difference is not as noticeable (apart from the difference between v16 and v17), since it is already low. Note that, from v79 to v80, there is a big difference in HermiT’s classification time: it decreases from around 8 minutes to 1 minute (B in Figure 3.4). From our analysis there is no unusual difference in axiom or term counts, nor expressivity, 3.3 Conclusions 42 between these versions. Similarly peculiar is the increase in Pellet’s classification time from v95 onwards until the final version. These cases suggest that the problem lies in the reasoner rather the ontology, and so it would be of interest to know what axioms cause it to perform much worse (or much better); this could not only help users finding performance-degrading axioms, but also supply reasoner developers with a test case that could be used to profile and attempt to optimise/debug the reasoner.

3.3 Conclusions

From our analysis of the NCIt we were able to identify noteworthy events and trends, particularly (a) versions with unsatisfiable concepts (v16, v58, and v59) and (b) major reasoner performance variations (v79, v80). While normally NCIt versions are logically “well-behaved”, that is, they have no logical bugs, the ver- sions specified in (a) deviate from that norm: they are incoherent. By examining gathered data, particularly the relation between unsatisfiable concepts, entail- ments and reasoner performance, we were able to (at least) superficially justify some observed phenomena: for example, reasoners performing faster when the input is incoherent. And although this may suffice for certain goals, for example, confirming the structural growth, or logical coherence of the ontology, it is not enough to explain why these events occurred in the first place. In such cases it would be desirable to be able to determine precisely which changes between versions gave rise to these new entailments of the form A v ⊥, for some A, and also what other terms have been affected by the changes. For this purpose one would have to align asserted axiom changes with inferred axiom changes as well as term changes. Based on the analyses carried out so far, the performance variation from v79 to v80 using HermiT is particularly unexplainable, especially seeing as only one system is so sensitive to the changes between those versions. This suggests that some hard-for-HermiT axioms were removed which made reasoning easier; to detect these, we would need a diff to compute changes between these ontologies, as well as performance analytics services to pinpoint those axioms that might have caused the performance degradation. From our analysis of NCIt’s expressivity there is no indication that different logical constructs were used between those versions where major performance 3.3 Conclusions 43 changes occurred. So, in order to get a better understanding of what changed (in terms of axioms and terms), and why those changes produced certain (in some cases undesirable) effects, such as those noted in (a) and (b), it becomes necessary to resort to more sophisticated and fine-grained analytical methods. Specifically, we highlight the following key services: (1) detection of axiom changes and char- acterisation of their effect on the ontology, its terms, and its entailments, and, if these changes produce any performance variation (significant or otherwise) for some reasoner, (2) determine which axioms, or sets of axioms, are probable causes for that variation. The first service (1) would be of particular interest to help users determine and realise the (multi-dimensional) impact of changes between ontologies, while (2) would be useful to understand reasoner performance varia- tions, such as those observed in the NCIt corpus. Chapter 4

Related Work

There are a variety of services for detecting axiom and term changes between on- tologies, typically split into syntactic or semantic diffs, though recently tools such as ContentCVS [57, 58] started to perform both syntactic and semantic changes detection. The key distinction between syntactic and semantic diffs is that syn- tactic diffs reveal the edit effort between ontologies, while semantic diffs reveal changes of meaning. In other words, semantic diffs strictly detect changes to the meaning of ontologies (i.e. its models), while syntactic diffs detect all changes regardless of whether they affect models or not. Both of these are discussed in Section 4.1. In terms of reasoner performance “change” detection or prediction, despite the wide range of reasoner optimisations, and even some approximate reasoning techniques (whether sound and incomplete, or unsound and complete), little work has been done in the specific problem we address. Predicting the performance of a reasoner is a known hard problem, recently investigated in [62], though “simply” predicting reasoner performance helps very little, perhaps even nothing when it comes to coping with cases that are unmanageable (for instance, an ontology that takes days to classify). In Section 4.2 we discuss available reasoner performance analytics services.

4.1 Diffing Techniques

Existing diff approaches for OWL ontologies rely on either syntactic or semantic notions of equality, either between the axioms or the terms in an ontology. We start by discussing syntactic diff notions in Section 4.1.1, followed by semantic 4.1 Diffing Techniques 45 diffs in Section 4.1.2.

4.1.1 Syntactic

The following diff services rely partly or entirely on syntactic methods to compute ontology differences.

ContentCVS The version control system ContentCVS1 uses the notion of structural difference to return the sets of structural additions and removals be- tween two ontologies. No distinction is made regarding the logical effect of axiom changes. The axioms in the diff are then presented to the user via an alignment according to the term on the left hand side of the axiom, while GCIs are grouped under their own label. The ContentCVS tool also performs semantic diff, which will be discussed in Section 4.1.2.

Bubastis Also based on structural difference, the diff tool Bubastis2 [77] em- ploys the same change alignment mechanism used within ContentCVS. Addition- ally, Bubastis presents differences with respect to terms, that is, whether a term is added or removed from one ontology’s signature to the other.

OWLDiff Similar to Bubastis and ContentCVS, the tool OWLDiff3 [69] also employs structural difference to detect changes. Subsequently, the changes are aligned in the same manner as ContentCVS and Bubastis, i.e., according to the left hand side term of each change.

The difference engine described in [84], available as a plugin for Prot´eg´e4.2, computes the structural differences between ontologies, and addresses the problem of aligning renamed terms via a series of heuristics. For instance, the tool inspects annotation property values (such as rdfs:label) specific to each ontology to track a term name change, where the value of that role remains unchanged. Additionally it explores relations between terms, for example, if a pair of terms t1 and t2 have the same parent and child, the tool pinpoints t2 as a renaming of t1. This plugin builds partially on the diff tool PROMPTDIFF [82], used in Prot´eg´e3 (prior

1http://www.cs.ox.ac.uk/isg/tools/ContentCVS/ 2http://www.ebi.ac.uk/efo/bubastis/ 3http://krizik.felk.cvut.cz/km/owldiff/ 4.1 Diffing Techniques 46 to OWL 2). PROMPTDIFF establishes its own notion of structural difference, based on a set of heuristic matchers; the combination of the results of these matchers produces a structural difference between ontologies. Since the release of OWL 2, this diff tool has not been updated.

A different syntactic approach is that of an edit-based diff, wherein change logs are produced by the ontology editor being used, thereby capturing the history of change, as implemented in Swoop [61]. While this type of diff can log modification details (for example, the order in which changes are done, or undone) at a very fine level of granularity, it requires an ontology editor-specific implementation, and thus it does not allow post facto change detection unless change logs are kept during the modification of an ontology.

The diffs mentioned so far, apart from the last which is log-based, do not perform any fine-grained categorisation of reported changes, for instance, whether an axiom in O2 is the result of editing an axiom in O1. This forces the user to analyse each change in the diff, and determine whether it might be related to other changes. In other words, any form of axiom alignment is left up to the user to perform.

4.1.2 Semantic

In addition to the structural diff, ContentCVS also computes deductive differences between OWL 2 DL ontologies. As noted in Section 2.1.3, the problem of veri- fying whether an L-ontology is an L-deductive conservative extension of another is undecidable when L is SROIQ. So, in order to detect entailment differences,

ContentCVS computes a sound, but incomplete approximation of diff(O1, O2)Σ: by restricting itself to only considering finite entailment sets, ContentCVS en- sures decidability. More precisely, the tool computes differences with respect to entailments η of the form η : A v C or η : r v s, where A is an atomic concept, r, s are atomic roles, and C is a concept built from the grammar Gcvs (where B is an atomic concept and r is an atomic role), as follows:

Grammar Gcvs : C −→ B | ∃r.B | ∀r.B | ¬B

Grammar Gcvs is based on the designers’ intuitions of what might be an “inter- esting” yet finite set of entailments. After computing entailment differences, the 4.1 Diffing Techniques 47 tool presents them to the user, having the corresponding justifications available for inspection as well.

The diff system CEX4 [66, 67] computes deductive differences between acyclic ELHr terminologies (acyclic EL terminologies extended with role inclusions and r range restrictions), i.e., it computes diff(O1, O2)Σ for O1, O2 acyclic ELH termi- nologies. Note that even for the basic description logic ALC the computational complexity of deciding whether diff(O1, O2)Σ = ∅ is already 2ExpTime-complete, for ELHr terminologies it is PTime [65], and undecidable for SROIQ [75]. 5 CEX implements a decision procedure for AT(O1, O2)Σ, defined as follows:

Definition 2. Given ontologies O1 and O2, and signature Σ, the set of affected terms is defined as follows:

AT(O1, O2)Σ = {A ∈ Σ | there exists A v C ∈ diff(O1, O2)Σ or

C v A ∈ diff(O1, O2)Σ}

By restricting attention to changes to individual terms, so-called affected terms, CEX produces nicely interpretable and thus manageable diff reports. CEX gets interpretability both by focusing on changes to individual terms in themselves (instead of on coordinated changes to sets of terms) and by exploiting the direc- tionality of its witness axioms. The output of CEX are the entailments in the deductive difference, together with the affected terms AT(O1, O2)Σ. CEX divides L AT(O1, O2)Σ into specialised, denoted AT(O1, O2)Σ, and generalised, designated R AT(O1, O2)Σ, depending on whether a term appears on the left or right hand side of a witness axiom (the same term may appear in both), respectively.

Observe that the semantic diffs discussed here completely ignore ineffectual changes, that is, they are not regarded as changes. This seems rather extreme seeing as, despite the lack of logical impact, there was effort put into them, which should be appropriately shown.

4http://cgi.csc.liv.ac.uk/~konev/software/ 5Though for the restricted case of acyclic ELHr terminologies. 4.2 Reasoner Performance 48

4.2 Reasoner Performance

Tweezers A number of ontology profiling techniques are proposed and realised in the tool Tweezers for Pellet [102]. Tweezers allows users to investigate perfor- mance such as the satisfiability checking time of each concept, number of clashes encountered during satisfiability tests, and model depth and size. This reasoner-specific (for Pellet) tool purely gathers and returns this information, re- lying on the user to use and apply it. Furthermore, no suggestions are given by the authors (or the tool) as to how one might go about exploiting the given data.

OCT Time and OCT Memory In [16] the author proposes techniques to identify potentially expensive “constructs” (i.e. concepts, roles or axioms), with regards to time (OCT Time) or space (OCT Memory) consumed. These tech- niques search for hard constructs by recursively splitting a given ontology in different manners, and individually testing performance (in terms of time and memory) over the partitions until suspect axioms are found. Recursion is done on the partition that is slowest for the input reasoner, and stops when all par- titions are found to be “easy”. The tool then backtracks to the last step and marks every lost axiom in this step as a “suspect” axiom. These axioms are removed from the original ontology and, subsequently, each axiom is added back. If the performance over the altered ontology is acceptable then the axiom is kept, otherwise the axiom is removed and marked as an “expensive axiom”. A similar strategy is used for pinpointing expensive concepts and roles. While this does suggest a potential performance hot spot discovery mechanism (as performance- degrading constructs were indeed found), no guarantees are given with respect to the content or size of the hot spots; they may be too large to even consider removing, or contain unrelated constructs and thus complicating the maintenance of the ontology (i.e., the user may have to do without key axioms for different “topics of the ontology”).

JustBench In [10] the authors present a form of OWL reasoner benchmark based on justifications. JustBench6 computes all justifications for entailments in a given ontology, and measures the performance of reasoners on those justifica- tions. The goal of this tool is to identify justifications that are hard to classify, thus giving the user a form of “performance hot spot”, or at least an indicator

6https://code.google.com/p/justbench/ 4.2 Reasoner Performance 49 for a hot spot. However, the results of their experiments fall short of supporting this goal, as most justifications are easily classifiable (typically within or around 1 millisecond for naturally-occurring ontologies). Additionally, as noted by the authors; “justification-based tests do not test scalability, nor do they test inter- actions between unrelated axioms, nor do they easily test non-entailment finding, nor do they test other global effects” [10].

Pellint A “performance lint” dedicated specifically to the Pellet reasoner, Pellint7 [72] draws on the knowledge of the Pellet reasoner developers to gen- erate a set of rules for what sorts of constructs and modelling patterns are likely to cause performance degradation when using Pellet — essentially it is a Pellet specific, ontology performance tuning expert system. Pellint not only identifies problem constructs, but it suggests approximations (typically by weakening or rewriting axioms) which “should” improve performance. The kinds of problem- atic constructs identified, and (where possible) repaired by Pellint are:

• GCIs, for example, axioms of the form A u B v C. Pellint provides no repair strategy for such axioms.

• Hidden GCIs, for example, α1 : C ≡ AuB, α2 : C v ∃r.D. This is repaired

by weakening α1 to a subsumption axiom.

• Existential explosion. This occurs when the reasoner creates a high number of new nodes in the completion graph, which is reported as performance- degrading if a certain, configurable number of nodes is reached. No repair is given for this kind of construct.

• Equivalence to a universal restriction, for example, C ≡ ∀r.A. Pellint repairs this axiom by weakening it into a subsumption.

Despite providing a “repair” facility in some cases, Pellint does not verify whether the repaired version of the ontology performs any better than its orig- inal. Aside from this, Pellint is tied to a particular reasoner, and given that reasoners vary in performance even for the same ontology, it seems unlikely that performance-degrading axioms identified by Pellint necessarily degrade perfor- mance in other reasoners. 7The tool is integrated in the Pellet reasoner distribution, accessible at http:// clarkparsia.com/pellet/. 4.3 Discussion 50

4.3 Discussion

The current state of the art in ontology diff services focuses on detecting changes of a particular type: either asserted or inferred axioms. This is typically followed by an alignment of changes according to the left-hand side of the axiom change; while in CEX and ContentCVS this alignment implies that the term aligned with the change is semantically affected, the same does not necessarily hold for syntac- tic diffs. It would be desirable, as noted in Section 3.3, to achieve an alignment between asserted changes and the terms these affect, as well as the inferred axiom changes associated with an asserted change. Currently OWL diff tools produce no such alignment that takes into account all three change dimensions. Further- more, as briefly noted in Section 4.1, we expect that a fine-grained categorisation of (the impact of) changes would be helpful for change analysis, particularly for identifying relations between changed axioms (for example, pinpointing that an added axiom is a rewrite of two previously existing axioms) regardless of these being effectual or ineffectual, which currently is not done by any diff service. In terms of reasoner performance analytics services, the solutions currently available are either tied to a particular reasoner (Pellint and Tweezers), or oth- erwise fall short of providing us with the necessary instrumentation to achieve a performance analysis service as outlined in Section 3.3. The ideal service should be dynamic, so that one can apply it to any given reasoner. OCT Time could potentially allow us to identify expensive axioms with any reasoner, but in its na- ture the technique driving this service does not attend to the logical “closeness” of axioms pinpointed as expensive. It would be preferable that the set of ex- pensive axioms returned be somehow logically connected, so as to not part away with axioms that concern different parts of the ontology (for example, an axiom concerning human anatomy and another describing plant life). Finally, note that of the approaches described in Section 4.2, only Pellint provides us with a form of repair mechanism aside from deletion. That is, when expensive axioms are found, Pellint might be able to alter them into a performance-wise less-damaging form, though without any guarantee that the ontology will perform any faster. Ideally, rather than simply removing performance-degrading axioms, it would be preferable to provide approximations of those axioms, so as to reduce information removal from the ontology. Chapter 5

Axiom-Centric Impact Analysis

In Chapter3 we have analysed the evolution of an ontology, by taking into account synchronic metrics of each version. However, mere differences in, for example, axiom or term count, hardly constitute a reasonable measure of how much change occurred between versions. For instance, we could have two versions with match- ing axiom counts and misleadingly assume they are equivalent, but actually the same number of axioms was deleted and then added. Thus, in order to establish a sensible mechanism for “measuring change effort”, one must identify differences between ontologies, that is, axiom additions and removals. As described in Chap- ter4, current axiom-based diff solutions lack a variety of desirable features, such as change impact analysis or alignment. In this chapter we tackle these problems by providing an axiom-based diff notion that performs a fine-grained categori- sation (and alignment) of changes, according to certain relations identifiable by entailment and justification analysis.

5.1 Motivation

Despite recent advances in ontology diff research, diffs for OWL ontologies still lack various abilities that are common in diffs for text documents. In particular when it comes to categorising and aligning changes, as noted in Section 4.1; partly due to the fact that in regular text documents order matters, while in ontologies the order of axioms is (semantically) irrelevant. This is indeed a major flaw in current OWL diff algorithms; failure to establish justifiable relations between changes, such as an axiom in O1 that is found to be in an entailment relation with one in O2, for example, O1 := {A v B} compared to O2 := {A v B u C}. 5.1 Motivation 52

Without relying on a change log, such an ‘apparent edit operation’ cannot be identified using current diffs. A structural equivalence based diff would pinpoint α as a removal and α0 as an addition. Intuitively, however, we see a direct relation between these axioms, namely that {α0} |= α (though {α} 6|= α0), so we can im- mediately jump to the conclusion that this constitutes a form of “strengthening” edit operation,1 in the sense that there exists a unidirectional entailment relation between two axioms in the change set. One of the key problems in ontology diffing is defining what constitutes a change of interest. Clearly certain types of change are irrelevant; take, for in- stance, two notationally distinct yet logically equivalent ontologies, encoded in

Manchester syntax and OWL functional syntax: O1 := {Class: A SubClassOf

Class: B}, and O2 := {SubClassOf(AB)}, respectively. A character-based diff would flag these as radically different. This reason alone suffices to convey that (standard) textual diff algorithms perform poorly on OWL ontologies, by over- reporting change. Other types of irrelevant change would include, for instance, the order of axioms in an ontology, the order of conjuncts and disjuncts in an axiom, or redundancies. For example, O1 := {A v B u C,C v ∃r.D} is equiv- alent to O2 := {C v ∃r.D, A v C u ∃r.D u B}. In order to overcome some of these issues, the OWL 2 specification defines a high level notion of syntactic equivalence — structural equivalence (and thus the associated notion of structural difference), which abstracts from the order-related uninteresting types of change mentioned. As discussed in Chapter4, several diff tools, namely ContentCVS, Bubastis and OWLDiff, exploit this equivalence relation to reveal axioms that have been added and removed between ontologies. However, simply distinguish- ing additions and removals via structural difference does not suffice to align edit operations, such as the strengthening operation discussed above, between α and α0 or, for example, a rewrite such as D ≡ E changed to {D v E,E v D}. In other words, reverse-engineering a change tuple (source and target) is not possi- ble without edit-based change logs. Additionally, structural difference does not determine whether a change has any logical impact, i.e., whether a modified (ei- ther added, removed or edited) axiom produces a change in the entailment set of either ontology. This must be determined by a semantic diff notion, that is, one that determines whether each change semantically strengthens or weakens

1Note that we cannot tell whether this accurately reflects the ‘actual edit operation’; one user could have removed the weaker axiom and another user added a stronger one. 5.2 Overview 53

an ontology, for instance, if there exists an axiom α ∈ O2 such that α∈ / O1 and

O1 6|= α, we can say that α strengthens O2, i.e., it introduces new meaning, and thus is an effectual change. In case O1 |= α, this means that the addition of α to

O2 produces no new logical effect, so α is an ineffectual change. The division between effectual and ineffectual changes still leaves the user with possibly large and unstructured sets of changes, with no indication as to what kind of change each of them represents. A reasonable presentation of changes will cluster them according to relevant properties. The aim of these is to capture the impact of each change, for example, whether it is a redundancy with respect to existing axioms, and finding alignments between (sets of) axioms.

5.2 Overview

As mentioned in Chapter1, our diff is designed to be log-free, as this requires no modifications to editing tools, nor adherence to some log format, and finally, it can be applied to any pair of ontologies regardless of being consecutive or not. The intuition behind our diff derives from editing behaviour in ontology en- gineering: given an ontology O1, one (1) builds an axiom and adds it to O1, or

(2) edits an axiom, or (3) removes an axiom from O1, after which we have a new version O2. We refer to the change types (1) and (3) as atomic changes, since they do not alter existing axioms. In other words, a linearisation of the change sequence yields, for a type (1) change α: O2 := O1 ∪ {α}, and for a type (3) change α: O2 := O1 \{α}. For atomic changes, the first thing to determine is whether the modification produces any logical effect, i.e., whether the added

(or removed) axiom changes the entailment set of O1 (or O2). This gives rise to a partitioning of atomic changes into four exhaustive categories: effectual and ineffectual additions and removals. The type of changes (2) above are called compound changes, since they involve a pair of axiom sets (i.e., source and target). A linearisation of this modality yields, for a change α: O2 := (O1 \ Γ) ∪ {α}, where |Γ| > 1. Based on the categorisation of atomic changes we can lay out all possible combinations of α and those axioms β ∈ Γ, and derive a more fine-grained categorisation of compound changes depending on whether α is, and Γ contains, an effectual or ineffectual change. Our categorisation of compound changes aims at instantiating all possible (meaningful) combinations, as outlined in Table 5.1. 5.2 Overview 54

Table 5.1: Compound change categories.

α Γ Effectual Addition Ineffectual Addition Effectual Removal Modified Definition (New) Redundant Strengthening, Ineffectual Removal Rewrite, Reshuffle Modified Definition

For a given compound change (Γ, α), we have that Γ puts α in exactly one category. Where there exist multiple Γ sets, α may end up in more than one category. The full categorisation is defined in Section 5.3, where definitions are symmetric (modulo category labels), that is, the categorisation of added axioms is symmetric to that used for removed axioms, the only difference is in the naming of categories. The diff categories are shown in Figures 5.1 and 5.2.

Addition

Effectual Ineffectual

Retrospective New Description Strengthening Modified Definition Pure Addition Rewrite Redundant Redundancy

Modified Definition New Retrospective Strengthening NT Pure Addition NT Reshuffle NT Redundancy

Figure 5.1: Categorisation hierarchy of additions.

Removal

Effectual Ineffectual

Prospective Retired Description Weakening Modified Definition Pure Addition Rewrite Redundant Redundancy

Modified Definition New Prospective Weakening RT Pure Addition RT Reshuffle RT Redundancy

Figure 5.2: Categorisation hierarchy of removals.

Observe that the effectual change categories include two modalities; those changes involving new (or retired) terms, and changes involving shared terms between both ontologies (i.e. terms in Oe1 ∩ Oe2). This duality of categories de- pending on the signature of axioms is illustrated in Table 5.2. We can see that there cannot be a strengthening which involves retired terms, or analogously a 5.3 Specification 55 weakening involving new terms. Similarly, new or retired descriptions are based on new or retired terms, accordingly.

Table 5.2: Effectual change categories.

Category Shared Terms New Terms Retired Terms Strengthening XX - Modified Definition XX - New Description - X - Pure Addition - Additions XX Weakening X - X Modified Definition X - X Retired Description - - X

Removals Pure Removal X - X

5.3 Specification

In this section we define and exemplify the axiom-based diff introduced in Section 5.2. The first subsection (Section 5.3.1) contains the categorisation definitions, accompanied by brief examples. This is followed by a more comprehensive exam- ple walkthrough in Section 5.3.2.

5.3.1 Axiom Categorisation

In order to distinguish additions and removals between ontologies we use the notion of structural difference, which relies on OWL’s notion of structural equiv- alence (≡s; as defined in Section 2.1.4). The reason behind this is that structural difference abstracts from some obviously irrelevant changes, and is part of the OWL W3C standard. The sets of structurally added and removed axioms are obtained according to Definition3.

Definition 3 (Structural Difference). The structural difference between ontologies

O1 and O2 are the following sets:

Additions(O1, O2) = {β ∈ O2 | there is no axiom α ∈ O1 s.t. α ≡s β}

Removals(O1, O2) = {α ∈ O1 | there is no axiom β ∈ O2 s.t. α ≡s β} 5.3 Specification 56

According to Definition3, if there exists an axiom β s.t. β ∈

Additions(O1, O2), this implies that β ∈ O2 \O1, where ‘set containment’ is determined according to structural equivalence. Analogously, an axiom α ∈

Removals(O1, O2) means that α ∈ O1 \O2. From these sets alone we cannot tell whether the identified changes have any effect on entailments between ontologies. In other words, that an axiom α is contained in Additions(O1, O2), does not mean it produces a change in semantics. So, on top of structural difference, we want to distinguish between those additions

(removals) which are not entailed by O1 (O2), accordingly. Such changes are called logically effectual, in the sense that they alter semantics. Otherwise, if an addition (removal) is entailed by O1 (O2), it is called logically ineffectual. This distinction is specified in Definition4.

Definition 4 (Logical Difference). The effectual and ineffectual changes between

O1 and O2 are the following sets:

EffAdds(O1, O2) = {β ∈ Additions(O1, O2) | O1 6|= β}

EffRems(O1, O2) = {α ∈ Removals(O1, O2) | O2 6|= α}

IneffAdds(O1, O2) = {β ∈ Additions(O1, O2) | O1 |= β}

IneffRems(O1, O2) = {α ∈ Removals(O1, O2) | O2 |= α}

The resulting set of ineffectual additions IneffAdds(O1, O2) is composed of added axioms that do not change the set of entailments of O1, i.e., they have no logical effect. Conversely, the set of ineffectual removals IneffRems(O1, O2) contains removed axioms that do not alter the entailment set of O2. The effectual additions EffAdds(O1, O2) are added axioms that have logical impact with respect to O1. Analogously, the effectual removals EffRems(O1, O2) are removed axioms that alter the entailment set of O2.

5.3.1.1 Ineffectual Change Categorisation

As pointed out in Section 4.1.2, while semantic diffs would consider all ineffectual changes as irrelevant, these are still representative of some degree of engineering effort (even if, in some sense, vacuous). As such, these kinds of changes should be presented, or at least pointed out to users, so that they are aware that some change effort may have been in vain. 5.3 Specification 57

In order to help understand why ineffectual changes carry no logical effect, we designed a categorisation of those axioms based on their justifications. By examining the relation between each justification and the change set we can distinguish different kinds of behaviour; for example, one could have rewritten A ≡ B into {A v B,B v A}, or added an axiom that is entailed by shared axioms (and is therefore truly redundant). The ineffectual change categorisation is presented in Definition5, and is followed by examples.

Definition 5 (Ineffectual change categorisation). Let Justs(O, β) be the set of justifications for an axiom β entailed by O.

An axiom β ∈ IneffAdds(O1, O2) is in one of the categories defined below, if there exists a justification J ∈ Justs(O1, β) matching the specified criteria:

(1.1) β ∈ ReWrt(O1, O2), i.e. a rewrite, if J ∩ Removals(O1, O2) 6= ∅, and

{β} |= J . If J ⊆ Removals(O1, O2) then β is a complete rewrite, otherwise β is a partial rewrite.

(1.2) β ∈ Red(O1, O2), i.e. an added redundancy, if J ⊆ (O1 ∩ O2).

(1.3) β ∈ RsRed(O1, O2), i.e. a retrospective redundancy, if J ∩

Removals(O1, O2) 6= ∅, and {β} 6|= J . Elements of this category are further divided into:

(1.3.1) β ∈ ReShuf(O1, O2), i.e. a reshuffle, if J ⊆ (O1 ∩ O2) ∪

IneffRems(O1, O2) and J ∩ IneffRems(O1, O2) 6= ∅.

(1.3.2) β ∈ NewRed(O1, O2), i.e. a new redundancy, if J ∩

EffRems(O1, O2) 6= ∅.

An axiom α ∈ IneffRems(O1, O2) is in one of the categories defined below, if there exists a justification J ∈ Justs(O2, α) matching the specified criteria:

(2.1) α ∈ ReWrt(O1, O2), i.e. a rewrite, if J ∩ Additions(O1, O2) 6= ∅, and

{α} |= J . If J ⊆ Additions(O1, O2) then α is a complete rewrite, other- wise α is a partial rewrite.

(2.2) α ∈ Red(O1, O2), i.e. a removed redundancy, if J ⊆ (O1 ∩ O2).

(2.3) α ∈ PsRed(O1, O2), i.e. a prospective redundancy, if J ∩

Additions(O1, O2) 6= ∅, and {α} 6|= J . Elements of this category are further divided into: 5.3 Specification 58

(2.3.1) α ∈ ReShuf(O1, O2), i.e. a reshuffle, if J ⊆ (O1 ∩ O2) ∪

IneffAdds(O1, O2) and J ∩ IneffAdds(O1, O2) 6= ∅.

(2.3.2) α ∈ NewRed(O1, O2), i.e. a new redundancy, if J ∩

EffAdds(O1, O2) 6= ∅.

Intuitively, an axiom α is considered a:

Rewrite if there exists a justification for α that is logically equivalent to the axiom itself, implying that the meaning of α is exactly the same as that of the justification.

Redundancy if there exists a justification for α consisting solely of shared ax- ioms, meaning that the removal of α causes no change in entailments (i.e., α is truly vacuous).

Reshuffle if there exists a justification for α whose axioms contain no new (or

lost) meaning with respect to O1 (respectively, O2), thus suggesting a sort of reshuffling of concepts in the ontology.

New redundancy if there exists a justification for α whose axioms contain new

(or lost) meaning with respect to O1 (respectively, O2), implying that the justification for α conveys more (in an entailment changing way) meaning than α.

In order to illustrate the intuition behind these categories, we briefly exemplify each of them below (in the same order as they are found in Definition5):

Example 1. Each categorised axiom β below is an ineffectual addition, i.e.,

β ∈ O2 and β ∈ IneffAdds(O1, O2).

(1) O1 := {A v B,B v A}, O2 := {β : A ≡ B}; β is a complete rewrite of

J := O1, since {β} |= J , J |= β, and J ⊆ Removals(O1, O2).

(2) O1 := {A v B u C}, O2 := O1 ∪ {β : A v C}; β is an added redundancy

w.r.t. J := O1, as J ⊆ (O1 ∩ O2).

(3) Each of the following are examples of retrospective redundancies, in the sense

that β ∈ O2 would be a redundancy in O1 if added to it. 5.3 Specification 59

(3.1) O1 := {A v B u C,A v C}, O2 := {A v C, β : A v B}; β is a reshuffle w.r.t. J := {A v B u C}, as the single axiom in J is an

ineffectual removal, i.e., J ⊆ IneffRems(O1, O2).

(3.2) O1 := {A v ∃r.B u C}, O2 := {β : A v C}; β is a new retrospective

redundancy w.r.t. J := O1, since the single axiom in J is an effectual removal.

For illustration purposes, in Example1 we focused on axioms with a single justification, but consider now the following example, where an axiom contains more than one justification:

Example 2. Let O1 := {A v B u C,A v D,D v B}, and O2 := {β : A v B,A v D,D v B}, where β is an ineffectual addition, i.e.,

β ∈ IneffAdds(O1, O2) with justifications Justs(O1, β) := {J1, J2} listed below:

• J1 := {A v B u C} shows that β is a new retrospective redundancy.

• J2 := {A v D,D v B} indicates that β is an added redundancy. So we have that axiom β falls under two categories, giving rise to an overlap between these categories.

As demonstrated in Example2, the categories of ineffectual axioms may over- lap. Note that these categories are exhaustive, in the sense that any ineffectual axiom is in at least one of the defined categories as per Lemma1.

Lemma 1. Given an axiom β ∈ IneffAdds(O1, O2), and the set of justifications

Justs(O1, β), we have that:

1. β is in one of the major categories 1.1 to 1.3 (from Definition5):

ReWrt(O1, O2), Red(O1, O2), or RsRed(O1, O2), respectively.

2. If β ∈ RsRed(O1, O2) – category 1.3, then β is either in category 1.3.1 or

1.3.2: ReShuf(O1, O2) or NewRed(O1, O2), respectively.

Proof 1. Consider a single justification J ∈ Justs(O1, β).

1. If J ⊆ (O1 ∩O2), then β is a redundancy – β ∈ Red(O1, O2). If J contains

at least one removal (ineffectual or otherwise), i.e., J ∩Removals(O1, O2) 6=

∅, then β is either a retrospective redundancy – β ∈ RsRed(O1, O2), or if

{β} |= J a rewrite – β ∈ ReWrt(O1, O2). 5.3 Specification 60

2. If J ∩ EffRems(O1, O2) 6= ∅ then β ∈ NewRed(O1, O2), otherwise if J ⊆

{O1 ∩ O2} ∪ IneffRems(O1, O2) then β ∈ ReShuf(O1, O2). Observe that, from Lemma1, we get that: Each justification puts β into exactly one leaf category. Consequently, if |Justs(O1, β)| = 1 then β is in a single category.

5.3.1.2 Effectual Change Categorisation

Having specified a categorisation of ineffectual changes, we now focus on effec- tual changes. Since these changes are only entailed by the ontology where the respective axioms are asserted, we no longer have justifications with which to align effectual changes. So in our effectual change categorisation we rely on en- tailment relations, as well as signature analysis. This categorisation is specified in Definition6.

Definition 6 (Effectual change categorisation). An axiom β ∈ EffAdds(O1, O2) is categorised as follows:

(1.1) If βe ⊆ Oe1, β is an element of:

(1.1.1) Strgth(O1, O2), i.e. a strengthening, if there exists a non-

tautological axiom α ∈ O1 s.t. {β} |= α.

(1.1.2) ModDef(O1, O2), i.e. a modified definition, if β is of the form C ≡

D, where C ∈ NC , and there exists an axiom α ∈ Removals(O1, O2) of the form C ≡ D0 s.t. ∅ |= D0 v D.

(1.1.3) PrAdd(O1, O2), i.e. a pure addition, otherwise; i.e., if β∈ /

Strgth(O1, O2) ∪ ModDef(O1, O2).

(1.2) If βe * Oe1, let T := βe \ Oe1 be the set of new terms. Then β is an element of:

(1.2.1) StrgthNT(O1, O2), i.e. a strengthening with new terms, if there

exists a non-tautological axiom α ∈ O1 s.t. {β} |= α.

(1.2.2) NewDesc(O1, O2), i.e. a new description, if we have that ⊥-mod(T, {β}) 6= ∅.

(1.2.3) ModDefNT(O1, O2), i.e. a modified definition with new terms, if

β is of the form C ≡ D, where C ∈ NC , and there exists an axiom 0 0 α ∈ Removals(O1, O2) of the form C ≡ D s.t. ∅ |= D v D. 5.3 Specification 61

(1.2.4) PrAddNT(O1, O2), i.e. a pure addition with new terms, otherwise; that is, if β∈ /

StrgthNT(O1, O2) ∪ ModDefNT(O1, O2) ∪ NewDesc(O1, O2).

Symmetrically (modulo category labels and names), an axiom α ∈

EffRems(O1, O2) is categorised as follows:

(2.1) If αe ⊆ Oe1, α is an element of:

(2.1.1) Weakng(O1, O2), i.e. a weakening, if there exists a non-tautological

axiom β ∈ O2 s.t. {α} |= β.

(2.1.2) ModDef(O1, O2), i.e. a modified definition, if α is of the form C ≡

D, where C ∈ NC , and there exists an axiom β ∈ Additions(O1, O2) of the form C ≡ D0 s.t. ∅ |= D0 v D.

(2.1.3) PrRem(O1, O2), i.e. a pure addition, otherwise; i.e., if α∈ /

Strgth(O1, O2) ∪ ModDef(O1, O2).

(2.2) If αe * Oe1, let T := αe \ Oe1 be the set of new terms. Then α is an element of:

(2.2.1) WeakngRT(O1, O2), i.e. a weakening with retired terms, if there

exists a non-tautological axiom β ∈ O2 s.t. {α} |= β.

(2.2.2) RetDesc(O1, O2), i.e. a retired description, if we have that ⊥- mod(T, {α}) 6= ∅.

(2.2.3) ModDefRT(O1, O2), i.e. a modified definition with retired terms, if

α is of the form C ≡ D, where C ∈ NC , and there exists an axiom 0 0 β ∈ Additions(O1, O2) of the form C ≡ D s.t. ∅ |= D v D.

(2.2.4) PrRemRT(O1, O2), i.e. a pure removal with retired terms, otherwise; i.e., if α∈ /

WeakngRT(O1, O2) ∪ ModDefRT(O1, O2) ∪ RetDesc(O1, O2).

Intuitively, an axiom α is considered a:

Strengthening if there exists an axiom β such that α entails β (where β is not a tautology) but not the other way around, that is, α is a more constraining (i.e. stronger) version of β. 5.3 Specification 62

Modified definition if α is an equivalence axiom, and there exists another equivalence axiom β where one of the sides of the equivalence is the same, but the other side contains more constraints in α than it does in β.

New or retired description if α describes a new (or retired) term in O2 (re-

spectively in O1).

Pure addition or removal if α does not fall under any other category above.

In Example3 we demonstrate each effectual addition category, in the same order as found in Definition6.

Example 3. Each categorised axiom β below is an effectual addition, i.e., β ∈ O2 and β ∈ EffAdds(O1, O2).

(1) β with βe ⊆ Oe1:

(1.1) O1 := {α : A v C,B v ∃r.>}, O2 := {β : A v (∃r.B) u C}; β is a strengthening of α, as {β} |= α.

(1.2) O1 := {α : A ≡ B,C v D}, O2 := {β : A ≡ B u C}; β is a modified definition of α, since α and β both have A on one side of the definition, and ∅ |= B v B u C.

(1.3) O1 := {A v ∃r.>,B v C}, O2 := {β : ∃r.B v ∃r.C}; β is a pure addition, as it is neither a strengthening nor a modified definition of

any axiom in O1.

(2) β with βe * Oe1:

(2.1) O1 := {α : A v C}, O2 := {β : A v C u D}; β is a strengthening with

new terms of α, as {β} |= α and β contains the term D 6∈ Oe1.

(2.2) O1 := {B v ∃r.>}, O2 := {β : A v ∃r.B}; β is a new description, with A being the new term, since ⊥-mod({A}, {β}) = {β}.

(2.3) O1 := {α : A ≡ B}, O2 := {β : A ≡ B u C}; β is a modified definition with new terms of α, since α and β share A on one side of the definition,

∅ |= B v B u C, and C 6∈ Oe1.

(2.4) O1 := {A v ∃r.>}, O2 := {β : ∃r.B v ∃r.C}; β is a pure addition with

new terms, as {B,C} * Oe1 and β is not a strengthening, modified definition or new description. 5.3 Specification 63

Modified definition axioms, i.e. those in ModDef(O1, O2), are not categorised as strengthenings because, unlike the case of a strengthened axiom, a modified definition is strengthened in “one direction”, but weakened in the other. Take, for example, η : A ≡ B and η0 : A ≡ B u C: if we rewrite these definitions into 0 0 dual subsumptions we have η := {η1 : A v B, η2 : B v A}, and η : {η1 : A v 0 0 B u C, η2 : B u C v A}. On the one hand, we have that {η1} |= η1, and therefore 0 0 η1 would be a strengthening of η1. On the other hand, we have that {η2} |= η2, 0 0 which means η2 is a weakening of η2. Thus categorising η as a strengthening of η would be logically inaccurate.

5.3.2 Example Walkthrough

In order to demonstrate the potential usefulness of the devised change categori- sation, this section contains a simple example walkthrough on the comparison of two ontologies O1 and O2 as defined in Table 5.3.

Table 5.3: Example ontologies O1 and O2.

O1 O2

α1 : A v C β1 : A v B t C β9 : G v ∃s.H u H α2 : B v C β2 : A v B β10 : D v F u ∃p.> α3 : E ≡ D β3 : B v C β11 : F v G u I α4 : D v F β4 : B v K β12 : K v ∃r.F α5 : F v G β5 : E v D β13 : D v F u ∃s.A α6 : G v H u ∃s.H β6 : D v E α7 : F v I β7 : E v B t ∃r.C α8 : F v G u I u J β8 : D v E t G

From O1 and O2 we have the following structural differences:

 Additions(O1, O2) = {β1, β2, β4, β5, β6, β7, β8, β10, β11, β12, β13}

 Removals(O1, O2) = {α1, α3, α4, α5, α7, α8}

O 1 ∩ O2 = {α2 ≡s β3, α6 ≡s β9}

Note that α6 is not syntactically equal to β9 (α6 6= β9), however they are structurally equivalent (α6 ≡s β9). Therefore these axioms are not reported as changes. Given the sets of structural additions and removals from O1 to O2 5.3 Specification 64 we now distinguish between effectual and ineffectual changes, resulting in the following:

 EffAdds(O1, O2) = {β2, β4, β7, β10, β12, β13}

 EffRems(O1, O2) = {α8}

 IneffAdds(O1, O2) = {β1, β5, β6, β8, β11}

 IneffRems(O1, O2) = {α1, α3, α4, α5, α7}

There are several ineffectual changes in the change set, while effectual changes are mostly additions (and a single removal). The changes are categorised in Tables 5.4 (removals) and 5.5 (additions).

Table 5.4: Categorisation of removals in diff(O1, O2).

Type Category Axiom(s) Alignment

ReWrt(O1, O2) α3 : E ≡ D {β5 : E v D, β6 : D v E}

α1 : A v C {β2 : A v B, β3 : B v C}

NewRed(O1, O2) {β13 : D v F u ∃s.A} α4 : D v F {β10 : D v F u ∃p.>}

α1 : A v C {β1 : A v B t C, β3 : B v C} Ineffectual

ReShuf(O1, O2) α5 : F v G {β11 : F v G u I} α7 : F v I

WeakngRT(O1, O2) α8 : F v G u I u J {β11 : F v G u I} Effectual

In the removals set (Table 5.4) there is only one effectual axiom: α8, which represents a weakening of β11 with retired terms (J is not mentioned in O2). This is an example of a removal that might be worth revisiting, since its meaning has been weakened. The remaining changes are ineffectual, for example, axiom

α3 is a rewrite of axioms {β5, β6} as they are logically equivalent. Note that the existence of a rewritten axiom from O1 to O2 does not imply that the same holds in the opposite direction, since we find alignments for individual axioms 5.3 Specification 65

Table 5.5: Categorisation of additions in diff(O1, O2).

Type Category Axiom(s) Alignment

ReWrt(O1, O2) β11 : F v G u I {α5 : F v G, α7 : F v I}

β11 : F v G u I {α8 : F v G u I u J}

β8 : D v E t G {α4 : D v F, α8 : F v G u I u J}

β1 : A v B t C {α1 : A v C} NewRed(O1, O2)

β5 : E v D

Ineffectual {α3 : E ≡ D} β6 : D v E

{α3 : E ≡ D}, β8 : D v E t G {α4 : D v F, α5 : F v G}

Strgth(O1, O2) β13 : D v F u ∃s.A {α4 : D v F }

StrgthNT(O1, O2) β10 : D v F u ∃p.> {α4 : D v F }

NewDesc(O1, O2) β12 : K v ∃r.F -

PrAdd(O1, O2) β2 : A v B - Effectual

β7 : E v B t ∃r.C - PrAddNT(O1, O2)

β4 : B v K -

rather than sets of axioms, for example, β5 is involved in a rewrite but is not a rewritten axiom itself, though α3 is. This is applicable to all categories. Consider now axiom α1: it falls under the reshuffle and new prospective redundancy cat- egories. One justification for α1 is J1 = {β2, β3}, which indicates that not only the meaning of α1 is preserved, but also “new meaning” has been added (since

β2 ∈ EffAdds(O1, O2)). A second justification, J2 = {β1, β3}, gets α1 categorised as a reshuffle, seeing as the axiom is aligned with shared and ineffectual axioms. Moving on to the additions set (Table 5.5), consider the ineffectual addition

β11: it is a rewrite of {α7, α5}, as well as a new retrospective redundancy, due to

α8. The pure additions appear to be adjustments to the concept hierarchy, some associated with new terms in O2. Both axioms β10 and β13 are strengthenings of α4, which suggests they could be merged, especially since there exists some 5.4 Implementation 66

intra-axiom redundancy. Finally there is a new term K in O2 being described via axiom β12. Generally speaking, with such a categorisation it is possible to understand axiom changes, their impact, and the reasons for such impact. Even the often dismissed ineffectual changes reveal useful information about the changes between

O1 and O2, for example, that the meaning of the axiom α4 is “strengthened” in two distinct, yet partially superfluous ways, i.e., via β10 and β13. Similarly we discover that axiom β11 is weakened, from α8, which should be reconsidered as we now have that O2 6|= F v J (J becoming a retired term).

5.4 Implementation

The algorithms to compute the diff are shown in Section 5.4.1, and heavily rely on decision procedures for entailments [51], justification finding [60], and module extraction algorithms [18]. The algorithms are described in a symmetric way, so in order to compute the additions and removals between two ontologies one would perform diff(O1, O2) and then diff(O2, O1). This applies to all algorithms in this chapter.

5.4.1 Algorithms

The computation of structural differences between two ontologies is described in Algorithm1.

Algorithm 1 computeStructuralDifferences

Input: Ontologies O1, O2 Output: Set of added axioms

1: Additions(O1, O2) ← ∅ 2: for logical axioms α ∈ O2 do 3: if there is no β ∈ O1 s.t. α ≡s β then 4: Add α to Additions(O1, O2) 5: end if 6: end for 7: return Additions(O1, O2)

Having the results of structural difference in hand, one can then distinguish between effectual and ineffectual changes according to Algorithm2. 5.4 Implementation 67

Algorithm 2 computeLogicalDifferences

Input: Ontology O1, set of axioms Additions(O1, O2) Output: Sets of effectual and ineffectual additions

1: EffAdds(O1, O2) ← ∅, IneffAdds(O1, O2) ← ∅ 2: for axioms α ∈ Additions(O1, O2) do 3: if O1 6|= α then 4: Add α to EffAdds(O1, O2) 5: else 6: Add α to IneffAdds(O1, O2) 7: end if 8: end for 9: return hEffAdds(O1, O2), IneffAdds(O1, O2)i

Now with the division of effectual and ineffectual changes we can apply the categorisation of the former according to Algorithm3, and of the latter via Al- gorithm4. Algorithm2 and the ‘compute justifications’ step of Algorithm4 are im- plemented both as specified (i.e., sequentially), as well as concurrently. The concurrent version of Algorithm2 is straightforward; the computation of hEffAdds(O1, O2), IneffAdds(O1, O2)i and hEffRems(O1, O2), IneffRems(O1, O2)i is done on two separate threads, which may or may not be executed on different cores depending on whether the processor has multiple cores. The concurrent version of the ‘compute justifications’ step of Algorithm4 is implemented using a divide-and-conquer mechanism, provided by the fork-join framework in Java 7. Using this approach, each (small) set of axioms, from the set of all axioms to be categorised, is processed (sequentially) in a separate thread, and the final results aggregated once all threads have terminated. Algorithm2 could also be implemented in this way, but the performance gain is little to none, since the operation is already reasonably fast for big ontologies.

5.4.2 Preliminary Evaluation

We start by determining how well our diff implementations perform. Table 5.6 shows a breakdown of operation times per comparison, including timings of se- quential versus concurrent implementations of the algorithms in Section 5.4.1. Based on the run times of diffing every consecutive pair of NCIt versions, our fastest implementation takes on average around 1 minute to compare two NCIt 5.4 Implementation 68

Algorithm 3 categoriseEffectualChanges

Input: Ontologies O1, O2, set of axioms EffAdds(O1, O2)

1: Strgth(O1, O2) ← ∅, StrgthNT(O1, O2) ← ∅ 2: ModDef(O1, O2) ← ∅, ModDefNT(O1, O2) ← ∅ 3: PrAdd(O1, O2) ← ∅, PrAddNT(O1, O2) ← ∅ 4: NewDesc(O1, O2) ← ∅ 5: for axioms α ∈ EffAdds(O1, O2) do 6: categorised = false 7: for axioms β ∈ IneffRems(O1, O2) ∪(O1 ∩ O2) do 8: if {α} |= β and β is not a tautology then 9: if αe ⊆ Oe1 then 10: Add α to Strgth(O1, O2) 11: else 12: Add α to StrgthNT(O1, O2) 13: end if 14: categorised = true 15: end if 16: end for 17: if α is of the form C ≡ D where C is atomic then 18: for axioms β ∈ >⊥*-mod(α,e O1) do . reducing search space 19: if β is of the form C ≡ E and ∅ |= E v D then 20: if αe ⊆ Oe1 then 21: Add α to ModDef(O1, O2) 22: else 23: Add α to ModDefNT(O1, O2) 24: end if 25: categorised = true 26: end if 27: end for 28: end if 29: if ⊥-mod(Oe1 \ α,e {α}) 6= ∅ then 30: Add α to NewDesc(O1, O2) 31: categorised = true 32: end if 33: if categorised == false then 34: if αe ⊆ Oe1 then 35: Add α to PrAdd(O1, O2) 36: else 37: Add α to PrAddNT(O1, O2) 38: end if 39: end if 40: end for 5.4 Implementation 69

Algorithm 4 categoriseIneffectualChanges

Input: Ontologies O1, O2, set of axioms IneffAdds(O1, O2)

1: Red(O1, O2) ← ∅, ReWrt(O1, O2) ← ∅ 2: ReShuf(O1, O2) ← ∅, NewRed(O1, O2) ← ∅ 3: for axioms α ∈ IneffAdds(O1, O2) do 4: for justifications J ∈ Justs(O1, α) do . compute justifications 5: if {α} |= J and J ∩ Removals(O1, O2) 6= ∅ then 6: Add α to ReWrt(O1, O2) 7: if J ⊆ Removals(O1, O2) then 8: α is a complete rewrite 9: else 10: α is a partial rewrite 11: end if 12: end if 13: if J ⊆ O1 ∩ O2 then 14: Add α to Red(O1, O2) 15: end if 16: if J ⊆ (O1 ∩ O2) ∪ IneffRems(O1, O2) then 17: Add α to ReShuf(O1, O2) 18: end if 19: if J ∩ EffRems(O1, O2) 6= ∅ then 20: Add α to NewRed(O1, O2) 21: end if 22: end for 23: end for versions, using Machine 1 (see Figure 5.3). Over half of that time is typically spent computing justifications, the most time-consuming subroutine. Note that the average time per input axiom in Table 5.6 is calculated with respect to the input of the operation, for example, in structural diff: all axioms, in logical diff: structural changes, and so on. Regarding the change categorisation, specifically the categorisation of com- pound changes: one could potentially come up with additional sub-categories that might be interesting. In particular, we investigated the possibility of distin- guishing, among those new prospective or retrospective redundancies, between those whose (laconic) justification was truly new or not. That is, extending Defi- nition5 with the following sub-categories of 3.2. Let Laconic(O, β) be the set of laconic justifications of O for an ineffectual addition axiom β, then:

0 (3.2.1) β ∈ Nov(O1, O2), i.e. a novel redundancy, if there exists a J ∈ 5.4 Implementation 70

Table 5.6: Operation times per comparison throughout the NCIt (in seconds).

Average time Operation Average time per input axiom Structural diff 0.24 0.0000023 Logical diff [sequential] 5.1 0.005 Logical diff [concurrent] 4.05 0.004 Effectual additions categorisation 6.28 0.006 Effectual removals categorisation 4.07 0.02 Ineffectual additions categorisation 0.27 0.003 ‘compute justifications’ [sequential] 35.03 0.54 ‘compute justifications’ [concurrent] 15.59 0.41 Ineffectual removals categorisation 0.52 0.002 ‘compute justifications’ [sequential] 86.93 0.39 ‘compute justifications’ [concurrent] 26.13 0.17 Total [sequential] 137.43 0.014 Total [concurrent] 57.14 0.0006

0.24, 1%

4.05, 7%

Structural Diff 10.34, 18% Logical Diff Effectual Change Categorisaon 0.79, 1% Ineffectual Change Categorisaon Jusficaon Finding 41.72, 73%

Figure 5.3: Diff operation times per NCIt comparison.

0 00 00 Laconic(J , β) s.t. J 6≡ J for all J ∈ Laconic(O2, β).

(3.2.2) β ∈ PsdNov(O1, O2), i.e. a pseudo-novel redundancy, otherwise; i.e., if

β∈ / Nov(O1, O2). 5.5 Empirical Evaluation 71

Below are examples of retrospective novel and pseudo-novel redundancies, where their justifications have a non-empty intersection with effectual removals:

(3.2.1) O1 := {A v B,B v C}, O2 := {B v C, β : A v C}; β is an added novel

retrospective redundancy w.r.t. J := O1, as {A v B} ∈ J is an effectual removal and Laconic(J , β) contains no justification that is equivalent to

another justification in Laconic(O2, β).

(3.2.2) O1 := {A v C u ∃r.B u D,D v C}, O2 := {β : A v C}; β is a pseudo-novel retrospective redundancy w.r.t. J := {A v C u ∃r.B u D}, since the single axiom in J is an effectual removal, and Laconic(J , β) contains a laconic justification J 0 := A v C that is equivalent to the laconic

justification {A v C} ∈ Laconic(O2, β).

Now consider an alternative, logically equivalent way to encode O1 as defined in (3.2.2) above: O1 := {A v ∃r.B u D,D v C}. Now β ∈ O2 := {β : A v C} is categorised as a novel retrospective redundancy rather than pseudo- novel, because the entailment is derived from two subsumptions instead of one. Cases like these can complicate the analysis of such changes, particularly without detailed knowledge of the underlying categorisation mechanism. Aside from the cognitive overhead, and the high sensitivity to the syntactic form of axioms, this extension also introduces performance problems. Finding la- conic justifications takes, on average, 73.89 seconds for ineffectual removals, and 61.35 seconds for ineffectual additions throughout the NCIt corpus (see Figure 5.4). By including the distinction between novel and pseudo-novel redundan- cies, the average diff time is over three times slower at 192.37 seconds per NCIt comparison. While such finer-grained categorisation might be of interest for proficient users, the marginal gain for the average user may well be non-existent, perhaps even generate confusion. So, given the issues pointed out regarding this potential extension, we decided to not include it in the diff and empirical evaluation of our approach.

5.5 Empirical Evaluation

In the previous section we verified the computational feasibility of our diff im- plementation. Now with the case study in Section 5.5.1 we aim at determining 5.5 Empirical Evaluation 72

0.24, 0% 4.05, 2% 10.34, 5% 0.79, 1%

Structural Diff

41.72, 22% Logical Diff Effectual Change Categorisaon Ineffectual Change Categorisaon Jusficaon Finding 135.23, 70% Laconic Jusficaon Finding

Figure 5.4: Diff operation times including laconic justification finding. whether the devised categories are realised throughout ontology corpora. Addi- tionally, in Section 5.5.2 we demonstrate how our categorisation can be used to support change analysis via a walkthrough of a particular comparison instance, for example, by allowing users to focus on changes which have a particular kind of impact (as opposed to having to browse through an unstructured change set while figuring out what effect each change has), we can reduce the cognitive load associated with change analysis.

5.5.1 Case Study

In order to evaluate our diff method, we carry out a study of the NCIt corpus where all pairwise, consecutive diffs between NCIt versions are extracted. Based on this we show that all categories are instantiated throughout the corpus.

5.5.1.1 Coarse-Grained Change Analysis

Overall there are more additions than removals throughout the NCIt corpus, with an average of 62% additions versus 38% removals (see Table 5.7), which is hardly surprising for a constantly evolving ontology. Throughout the corpus 89% of changes are effectual, while the remaining 11% are ineffectual. There are, however, cases where the number of effectual and ineffectual changes is more balanced, such as d23 where 52% of changes are ineffectual, or d26, d28 and d29, 5.5 Empirical Evaluation 73 with 48% ineffectual changes (see Figures 5.5 and 5.6). A few diffs consist mostly of effectual changes, for instance d3 and d107 where 99.7% of changes are effectual.

Table 5.7: Coarse-grained changes throughout the NCIt.

Change type Additions Removals Structural 235,420 146,687 Effectual 225,059 113,429 Ineffectual 10,361 33,258

10000

1000

100

10

1 d1 d3 d5 d7 d9 d11 d15 d21 d25 d31 d35 d41 d45 d51 d55 d61 d65 d71 d75 d81 d85 d91 d95 d13 d17 d19 d23 d27 d29 d33 d37 d39 d43 d47 d49 d53 d57 d59 d63 d67 d69 d73 d77 d79 d83 d87 d89 d93 d97 d99 d101 d103 d105 d107 d109 d111

Effectual Ineffectual

Figure 5.5: Breakdown of effectual vs ineffectual additions (x-axis: NCIt version, y-axis: number of axioms in a logarithmic scale).

Generally there is a high enough number of ineffectual changes occurring to warrant some attention on our part, especially since semantic diffs ignore these changes. Now it would be useful to determine whether and how negligible such ineffectual changes really are, which is what we aim to find in the following section.

5.5.1.2 Ineffectual Changes

Throughout the corpus there is a high number of ineffectual removals, reaching percentages of 97% in d16 and 93% in d29 out of all removals. On average, inef- fectual removals account for 34% of all removals. Most of these (94%) turn out to be new prospective redundancies (see Table 5.8), which indicates continuous refinement of content throughout the corpus. For example, d26 reveals 3,104 new 5.5 Empirical Evaluation 74

10000

1000

100

10

1 d1 d5 d3 d7 d9 d11 d13 d15 d17 d19 d21 d23 d25 d27 d29 d31 d33 d35 d37 d39 d41 d43 d45 d47 d49 d51 d53 d55 d57 d59 d61 d63 d65 d67 d69 d71 d73 d75 d77 d79 d81 d83 d85 d87 d89 d91 d93 d95 d97 d99 d101 d103 d105 d107 d109 d111

Effectual Ineffectual

Figure 5.6: Breakdown of effectual vs ineffectual removals (x-axis: NCIt version, y-axis: number of axioms in a logarithmic scale). prospective redundancies out of 3,157 ineffectual removals, and 3,843 removals in total. There are a total of 450 removed redundancies throughout the NCIt, constituting on average 4% of the ineffectual removal set of each diff. This shows that there is some pruning of logically redundant information going on in the corpus. Specifically, there are removed redundancies in 58 out of the 111 diffs, some cases with more than others: for example, in d6 63 out of 146 ineffectual removals are redundancies.

Table 5.8: Ineffectual removals of diff(Oi, Oi+1), for 1 ≤ i ≤ 112.

Prospective Redundant Rewrite Redundant Reshuffle New Total 117 (0.4%) 450 (1.4%) 958 (3%) 31,873 (96%) 33,258

On average 4.4% of additions are ineffectual, though there are some high val- ues such as 2,924 ineffectual additions out of 4,757 additions (61%) in d23. The majority of ineffectual additions are new retrospective redundancies (see Table 5.9), followed by redundancies. Generally there is a somewhat high number of added redundancies in the corpus, with higher frequency up until d7 (average of 112 added redundancies per comparison). Upon investigating these axioms, we found that many such added redundancies are those derived from the transitivity of the subsumption relationship, though not only atomic subsumptions, for ex- ample, O1 = {α1 : A v ∃r.B, α2 : C v A}, O2 = {α1, α2, α3 : C v ∃r.B}. From 5.5 Empirical Evaluation 75

the example we see that α3 is redundant; C v A suffices for C v ∃r.B to hold, yet entailments of this form constitute the majority of added redundancies.

Table 5.9: Ineffectual additions of diff(Oi, Oi+1), for 1 ≤ i ≤ 112.

Retrospective Redundant Rewrite Redundant Reshuffle New Total 118 (1%) 1,020 (10%) 415 (4%) 8,882 (86%) 10,361

As shown in Table 5.8 and 5.9, a number of rewrites were identified in the corpus. In particular, there are 227 rewritten axioms in d32. Upon inspecting these axioms, we noticed that such changes are not only syntactic but also trivial and easily detected. The changes typically have the following form:

α : A ≡ B u (C u ∃r.D)

Jα : A ≡ (B u (C u ∃r.D)) While ideally the underlying structural diff would not include these, at least with our categorisation and alignment with source axioms, it is easy to spot and recognise the triviality. Clearly certain ineffectual changes, such as these rewrites, are in fact “refactorings”, albeit in the case of prospective or ret- rospective redundancies the refactoring would have to be of a set of axioms rather than a single axiom. Note that prospective or retrospective redundancy axioms, despite using seemingly “new content”, do not necessarily mean that the ontology is strengthened, since the change might either introduce a redun- dancy or redistribute information from other axioms. Consider an ontology 0 O1 = {α1 : A v B, α2 : A v C}, and a change of α1 into α1 : A v B u C.

The axiom α1 is a retrospective redundancy (a reshuffle to be precise), but the 0 resulting ontology O2 = {α1 : A v B u C, α2 : A v C} is not strengthened.

In cases where the number of ineffectual changes is quite high (for example, d23 where 52% of changes are ineffectual) note that semantic diffs would significantly understate the amount of activity performed. While diffs based on structural equivalence detect these changes, their logical impact is not characterised at all.

5.5.1.3 Effectual Changes

Among the categories of effectual changes we discovered that the majority of these are, in the additions: new descriptions (46%), and in the removals: pure 5.5 Empirical Evaluation 76 removals with retired terms (62%). The latter are usually axioms that have as a superconcept either a removed atomic concept, or a restriction involving a removed term. The axioms describing these removed entities are appropriately categorised as retired descriptions, which account for 11% (12,883 in total) of effectual removals (see Table 5.10). In the additions there are many more of this type of change, with 103,806 new descriptions in total (see Table 5.11).

Table 5.10: Effectual removals of diff(Oi, Oi+1), for 1 ≤ i ≤ 112.

Axiom Modified Retired Pure Terms Weakening Definition Description Removal Total Shared 1,137 2,263 - 26,046 29,446 Retired 516 0 12,883 70,584 83,983 Total 1,653 2,263 12,883 96,630 113,429

Table 5.11: Effectual additions of diff(Oi, Oi+1), for 1 ≤ i ≤ 112.

Axiom Modified New Pure Terms Strengthening Definition Description Addition Total Shared 4,731 2,236 - 38,939 45,906 New 4,945 400 103,806 70,002 179,153 Total 9,676 2,636 103,806 108,941 225,059

The highest number of new descriptions occurs in d5, with 18,912 such changes. This high number is unsurprising as the terminology keeps evolving over time. Also in d5 we have the highest number of retired descriptions (11,835). It could be possible though, that some of these new or retired entities have in fact been renamed,2 in which case we could restrict our attention to new and retired descriptions to find the refactoring axiom pair. Together, strengthenings with shared and new terms average around 4% of effectual additions, indicating refinements with additional constraints. There are cases where the number of strengthenings is high, such as d26 which has 142 with new terms and 1,065 without, and d16 where 4,195 strengthenings with new

2Without edit-based logs or any refactoring information it is difficult to detect term renam- ings. One could apply ontology matching [56, 55, 27] algorithms, and with our categorisation at least the search space is reduced: since a renaming event occurs between a term in a retired description and another in a new description. 5.5 Empirical Evaluation 77 terms occurred. The latter typically reflect an introduction of new terms in order to further constrain existing ones, or existing terms being re-described based on newly introduced terms and axioms. Pure additions account for 48% of effectual additions, split between 31% of additions with new terms and 17% with shared ones. The highest number of pure additions with new terms occurs in d24, with 11,746 axioms. This version also contains the most pure removals with retired terms, amounting to 10,645 axioms. In terms of pure additions with shared terms, d79 contains the most, amounting to 4,462 axioms. Coincidently this same version also has the highest number of pure removals with shared terms, with a total of 4,562 such axioms. Typically pure changes suggest adjustments to the concept hierarchy, possibly to accommodate the introduction of new terms. The average of pure removals throughout the corpus is 85% of all effectual removals, split between 62% with retired terms and 23% with shared ones. There are not as many weakenings in the corpus as there are strengthenings. This tells us that typically there is not much reduction of constraints from version to version. The average percentage of weakenings (with or without retired terms) throughout the NCIt is of 1.5% of effectual removals, with the highest percentage being 69% (739) of effectual removals in d23. The next diff, d24, contains the highest number of weakenings with retired terms: 508.

5.5.1.4 Discussion

As shown in Tables 5.10 and 5.11 of the previous section, all but one of the effectual change categories are instantiated in the NCIt corpus; there are no modified definitions with retired terms, though there do exist modified definitions with shared terms. A good example of that is d32, between versions O32 and O33, which contains changes in nearly all categories, with the exception of weakenings and modified definitions with retired terms. In terms of ineffectual changes, we found that there exist axioms in all devised categories throughout the NCIt. Even though we found no “proper” rewrites, this category was instantiated with axioms that were trivially rewritten; the only difference between each rewritten axiom and its justifications is in the nesting of subconcepts. This, however, proved useful in singling out and aligning axioms that structural equivalence would consider regular changes. 5.5 Empirical Evaluation 78

5.5.2 NCIt Axiom Diff Walkthrough

In order to elaborate on the potential usefulness of the devised categories, in this 3 section we examine a particular version comparison: diff(O32, O33), and carry out an interpretation of the diff output from a user perspective in a manner inspired by cognitive walkthroughs [103]. The envisioned scenario is as follows: the user, an ontology engineer with some generic domain knowledge whose role is change quality assurance, intends to pinpoint axioms that may need revising by the appropriate domain experts. For that purpose, the user needs to (1) ensure that effectual changes are appropriate, (2) confirm that no (possibly) unintended loss occurred, and (3) verify whether, and if so why committed changes have no logical effect. The output of diff(O32, O33) is summarised in Tables 5.12 through to 5.15. For conciseness of axioms that will be presented, we abbreviate NCIt term names according to Table 5.16.

Table 5.12: Effectual additions in diff(O32, O33).

Axiom Modified New Pure Terms Strengthening Definition Description Addition Total Shared 108 79 - 218 405 New 8 1 106 29 144 Total 116 80 106 247 549

Table 5.13: Ineffectual additions in diff(O32, O33).

Retrospective Redundant Rewrite Redundant Reshuffle New Total 114 1 19 30 164

Overall, by looking at the diff change count, the user immediately spots that the changes are balanced between effectual and ineffectual, with most of the change “activity” being materialised as effectual additions and ineffectual re- movals. The user is firstly drawn to added information, and begins by inspecting the effectual additions (Table 5.12). Among these, the new descriptions are of particular interest, as they can be thought of as “new features”, while the remain- ing changes could be interpreted as refinements of existing features. Browsing

3Corresponding to NCIt releases 06.08d and 06.09d, respectively. 5.5 Empirical Evaluation 79

Table 5.14: Effectual removals in diff(O32, O33).

Axiom Modified Retired Pure Terms Weakening Definition Description Removal Total Shared 2 65 - 201 268 New 0 0 8 5 13 Total 2 65 8 206 281

Table 5.15: Ineffectual removals in diff(O32, O33).

Prospective Redundant Rewrite Redundant Reshuffle New Total 113 2 9 335 459 through new descriptions the user instantly becomes aware of the new terms that have been introduced in the new version, as the following axioms demonstrate:

α1 : OG v PA

α2 : AV P v PNSP

α3 : AV P v ∃fP artOf.V P

The axiom α1 introduces the term OG, while α2 and α3 introduce AV P . In this category, experts could restrict their attention to those axioms that involve terms within their domain expertise (so-called domain axioms and domain terms, respectively). The user continues the change analysis, now onto strengthenings. These are bound to introduce new constraints with respect to to (previously) existing ax- ioms, so it suffices to confirm that these new constraints are appropriate, for 0 example, axiom α4 is a strengthening of α4, and α5 a strengthening with new 0 terms (AF M) of α5:

α4 : SAA ≡ A u BESN u BSAN 0 α4 : SAA v A

α5 : AF MM ≡ FMM u ∀hasP AS.AF M 0 α5 : AF MM v FMM In these, as with many other strengthening axioms, a subsumption is altered 5.5 Empirical Evaluation 80

Table 5.16: Mapping of term names in the NCIt to abbreviations.

Term Name Abbreviation Oxidized Glutathione OG Protective Agent PA Anterior Visual Pathway AVP Nervous System Part PNSP Anatomic Structure Is Physical Part Of fPartOf Visual Pathway VP Skin Appendage Adenoma SAA Adenoma A Benign Epithelial Skin Neoplasm BESN Benign Skin Appendage Neoplasm BSAN Anterior Foramen Magnum Meningioma AFMM Foramen Magnum Meningioma FMM Disease Has Primary Anatomic Site hasPAS Anterior Foramen Magnum AFM Benign Sebaceous Neoplasm BSN Sebaceous Neoplasm SN Epidermal Involvement EI Cutaneous Involvement CI UMLS Cross-Reference Concept UCRC c Concept Status cCS Renal Cell Carcinoma with t X 1 p11 p34 RCC Xp11 Translocation-Related Renal Cell Carcinoma XTRCC Disease Has Cytogenetic Abnormality hasCA t X 1 p11 p34 tX Disease Has Molecular Abnormality hasMA PSF-TFE3 Fusion Protein Expression PFPE CS-1008 CS1 Chemical Or Drug Has Mechanism Of Action hasMoA Antigen Binding Interaction ABI Monoclonal Antibody MA Metastatic Malignant Neoplasm MMN Malignant Neoplasm MaN Metastatic Neoplasm MeN Disease Has Finding hasF Metastatic Lesion ML Lactic Acid L LAL Industrial Product IP Pharmaceutical Excipient PE Industrial Aid IA 5.5 Empirical Evaluation 81

0 into an equivalence. In the first case, where α4 is changed to α4, we have that

SAA has two new superconcepts: BESN and BSAN. Also α4 states that any instance of the conjunction of A, BESN and BSAN is necessarily an instance of SAA. After verifying that throughout strengthening axioms all such new constraints are correct, the user progresses onto the modified definitions.

α6 : BSN ≡ BESN u BSAN u SN 0 α6 : BSN ≡ BSAN u SN Similar to strengthenings, modified definitions introduce new constraints (some involving new terms). The user employs the same analysis procedure as for strengthenings to determine whether any change needs revisions.

Finally the user goes through the pure additions, for example, axiom α7 is a pure addition, and α8 is a pure addition with a new term cCS:

α7 : EI v CI

α8 : UCRC v cCS This type of axiom generally indicates adjustments to the concept hierarchy, which the user may want to verify in more detail where appropriate. Switching over to ineffectual changes, we denote in each justification axiom the superscripts: “[e]” to denote an effectual change and “[i]” an ineffectual change in diff(O32, O33), and “[s]” a shared axiom (i.e., an axiom in O32 ∩ O33). The user begins by inspecting the rewrites. Since rewritten axioms convey the same logical meaning in both O32 and O33, the user wants to understand why these are presented by the diff, and finds via our alignment that, for example, axiom α9 is a rewrite of another axiom contained in the justification for α9: Jα9 , as follows:

α9 : RCC ≡ XTRCC u ((∀hasCA.tX) u (∀hasMA.P F P E)) [i] Jα9 := {RCC ≡ XTRCC u (∀hasCA.tX) u (∀hasMA.P F P E) }

The change from the single axiom in Jα9 to α9 is purely syntactic and ir- relevant; the axioms are logically equivalent. While OWL 2’s notion of axiom

(structural) equality would have axiom α9 pinpointed as a change, our diff de- tects and aligns these pseudo-changes appropriately, making them easy to single 5.5 Empirical Evaluation 82 out and ignore, if appropriate. Seeing as all inspected rewrites have this very same form, the user foregoes further analysis of this category. Next the user inspects redundant axioms; since redundancies do not add any logical meaning the user may want to prune them. Such redundancies are guar- anteed not to alter the semantics of the ontology, and so can be immediately disposed of. A single redundancy is found in the diff: α10, with a justification

Jα10 ⊆ O33, as follows:

α10 : CS1 v ∃hasMoA.ABI [s] [s] Jα10 := {CS1 v MA ,MA v ∃hasMoA.ABI } Upon finding this, the user could proceed to remove the axiom from the ontology. Meantime it would be helpful to enquire with the corresponding domain experts as to why such redundancy was added in the first place. The user carries on to inspect retrospective redundancies, starting with the reshuffles. There are 9 such changes, and the user begins by examining α11 and its justification Jα11 ⊆ O33:

α11 : MMN ≡ MaN u MeN [i] [i] Jα11 := {MMN ≡ MaN u MeN u (∀hasF.ML) , MeN v ∀hasF.ML }

The user first notices that the axioms in the justification are all ineffectual and, second, that the previously existing extra constraint on the definition of MMN,

∀hasF.ML, would still hold in O33 if the subsumption MeN v ∀hasF.ML also holds. Seeing as the first axiom in Jα11 is entailed by O33, then it must certainly hold, and so what occurred here was a kind of reshuffling of constraints without loss of information. Having confirmed that other reshuffles exhibit a similar form, the user then carries on to inspect new retrospective redundancies; among these the user comes across several of the same kind as α12, with corresponding justification Jα12 ⊆ O32:

α12 : LAL v IP [e] [s] [s] Jα12 := {LAL v PE ,PE v IA ,IA v IP } This change reveals that LAL lost a superconcept, PE, via the first axiom in

Jα12 . Though now α12 explicitly asserts an otherwise lost subsumption between

LAL and IP , which used to hold in O32. Hence this change introduces “new” 5.6 Conclusions 83 content, and so is categorised as a new retrospective redundancy. This kind of change introduces information that existed in O32, but given some effectual removals would not exist in O33. We have shown here an example of how such categorisation makes change analysis more manageable, as opposed to having users inspect a whole set of changes with little to no guidance.

5.6 Conclusions

The devised diff method with its categorisation mechanism addresses a major problem with existing diff methodologies: the lack of alignment between source and target ontologies and structure in the resulting change sets. The diachronic study of the NCIt revealed that all changes occur throughout the corpus, with the exception of the retired terms variant of modified definitions – though there were modified definitions in the NCIt. By means of this categorisa- tion we can group changes according to their impact, allowing users to shift their attention to specific types of changes, rather than going through an unstructured change set while inspecting both ontologies. With our correspondence of changes between source and target ontologies we can show the changed axioms and what they are a change of. Consequently, by analysing changes in this way there is no need for constantly having to inspect ontologies manually. So with the method described we can support users in understanding the impact of their changes (or lack thereof), and refine these before publishing newer versions. We found that ineffectual changes account for a significant amount of changes throughout the NCIt. Despite the fact that semantic diffs ignore these changes in their output, we show that they provide helpful modelling insights, and thus are worth examining. For instance we discovered a high number of redundant axioms in the NCIt, all of which could be disposed of. Also we found a number of structurally distinct ineffectual changes that are clearly irrelevant: the rewritten axioms. These indicate the need for improvement of the underlying structural equivalence notion in OWL 2, though with our change alignment the user can immediately spot that these rewrites are irrelevant, and may skim through such axioms until a “true” rewrite is found. On the other hand, if there were no alignment, we can see this as a major drawback, since a shallow inspection of the change set would convince users that these were actual changes. In general the 5.6 Conclusions 84 inspection of ineffectual changes is helpful to prevent, for example, re-doing work or introducing redundancy. Relying on semantic diffs alone one would be missing out on these ineffectual changes, which could have helped users recognising the lack of impact of certain changes, thus helping to adjust their modelling practice accordingly. Chapter 6

Term-Centric Impact Analysis

In the previous chapter we addressed the problem of identifying and characteris- ing the impact of changes by, for instance, pinpointing whether changes produce any logical effect. This work focused exclusively on axiom level analysis. Since OWL ontologies are sets of axioms, this is a natural level of analysis. However, ontologies often serve as means to manage taxonomies, that is, the set of ax- ioms is a development time artefact supporting the delivery of a hierarchically organised set of categorical terms. Furthermore, ontology development tends to be very term-oriented, with editing tools such as Prot´eg´eadopting a frame-based interface. Consequently, end users are most directly concerned with changes to the (hierarchy of) terms, and may not even have access to the axioms. Thus, the modeller must not only be aware of the axioms they have touched, but how those changes affect the concepts in the ontology.

6.1 Motivation

For the purpose of identifying differences at the term-level, recent notions of se- mantic difference based on conservative extensions, so-called Σ-difference, have provided a robust theoretical and practical basis for analysing these logical effects

[68]. Unfortunately Σ-difference, i.e., diff(O1, O2)Σ as defined in Section 2.1.3, is computationally expensive (ExpTime-complete) even for inexpressive logics such as EL [76]. For the very expressive logics such as SROIQ (the DL underly- ing OWL 2) it is undecidable, while in the ALC, ALCQ, and ALCQI DLs its computational complexity is 2ExpTime-complete [75]. The set of Σ-differences diff(O1, O2)Σ (from Definition1 in Section 2.1.3) alone, if non-empty, tells us 6.1 Motivation 86 that there are new entailments expressed in the designated signature, though it does not extrapolate from those differences which Σ-concepts are affected. So, similar to CEX, we focus on restricted elements of diff(O1, O2)Σ – subsumptions with an atomic left hand (respectively right hand) side, i.e., of the form A v C (respectively C v A) where A is atomic and C is a possibly complex concept, called the witness concept. All terms that appear in those positions in axioms in diff(O1, O2)Σ form the set of affected terms AT(O1, O2)Σ (from Definition2 in Section 4.1.2). Aside from the computational feasibility problem, standard Σ-difference runs into other difficulties in more expressive logics when we consider differences with respect to non-shared terms between two ontologies. In particular, if we compare entailment sets over logics with disjunction and negation we end up with vacuously altered terms: any logically effectual change will alter the mean- ing of every term. Consider the following ontologies: O1 = {A v B}, and

O2 = {A v B,C v D}. Clearly O2 is a conservative extension of O1 with re- spect to Σ = {A, B}, but if we consider Σu := Oe1 ∪ Oe2 then that is no longer the case; a witness axiom for the separability would be η := A v ¬C t D. This 0 witness “witnesses” a change to every concept A ∈ Σu; for each witness axiom 0 0 0 0 η : A v ¬C t D we have that O1 6|= η , while O2 |= η . Such a witness would suffice to pinpoint, according to Σ-difference, that all terms in Σu have changed: AT(O , O ) = Σ since > v ¬C t D. Consequently, this kind of witness is 1 2 Σu u uninteresting for any particular concept aside from >. Likewise, a change A v ⊥ implies that, for all B in the signature of the ontology in question, we have that A v B. Yet these consequences are of no interest to any concept B. Similar to the case of the least common subsumer [70], the presence of disjunc- tion (and negation) trivialises definitions that are meaningful in less expressive logics. In our case, a direct extension of Σ-difference for more expressive logics such as ALC would be futile; when we step beyond EL as a witness language and allow witnesses in ALC instead, if O1 6≡ O2 then AT(O1, O2)Σ contains all terms in Σ. Thus we need to refine our diff notion when dealing with propositionally closed witness languages. To address the vacuity problem, we present a non-trivializable notion of se- mantic difference of concepts, which includes a mechanism for distinguishing di- rectly and indirectly affected concepts. To address the undecidability of even our refined semantic difference problem for the SROIQ DL, we define a series of 6.2 Specification 87 motivated semantic diff approximations for expressive description logics. These algorithms are evaluated on a select subset of the National Cancer Institute (NCI) Thesaurus (NCIt) corpus, by a comparison of the changes found via the proposed approximations and related approaches. Our experiments show that our strongest approximation, “Grammar diff”, finds significantly more changes than all other methods across the corpus, and far more than are identified in the NCIt change logs. We show that distinguishing direct and indirect changes is necessary and effective for making concept based change logs manageable.

6.2 Specification

Taking into account the shortcomings of existing methodologies, as well as the triviality of Σu-difference in expressive ontologies, we present a semantic diff method in accordance with the following requirements: (1) determines which concepts have been affected by changes, (2) differentiates concepts that have been directly or indirectly affected, (3) is computationally feasible for OWL 2 DL, and finally (4) based on a principled entailment grammar.

Given two, non equivalent ontologies O1 := ∅ and O2 := {> v ∃r.(∃s.>) u

∃r.(∀s.C),A v B}, we know that O1 and O2 are not Σu-inseparable with respect mCE to model inseparability, that is, O1 6≡Σu O2, since they have different models by

Definition1. Between O1 and O2 we have that A and B are relevantly changed 0 (O1 6|= A v B while O2 |= A v B), and so one would assume that Σ := {A, B} is mCE one minimal witness signature for the separability, i.e., O1 6≡Σ0 O2. However, this is an invalid assumption; the two ontologies are separable with respect to a smaller signature: {A}, as we cannot enforce that all models I|{A} |= O1 coincide mCE with all models J |{A} |= O2, and consequently O1 6≡{A} O2. Similarly, we have mCE that the empty signature also witnesses the separability, that is, O1 6≡∅ O2. Of course, this involves deciding whether, for a given signature Σ, two ontologies are mCE-inseparable with respect to Σ. And seeing as mCE-inseparability is un- decidable for SROIQ [75],1 as pointed out in Section 2.1.3, we rely on deductive rather than model inseparability-based approximations of AT(O1, O2)Σ.

Consider the example ontologies O1 and O2 defined in Table 6.1; they will be used throughout this section as a running example.

1Indeed mCE-inseparability is already undecidable for general EL ontologies [76]. 6.2 Specification 88

Table 6.1: Example ontologies O1 and O2.

O1 O2

α1 : A v B β1 : A v B α2 : B v C β2 : B v C u D α3 : D v ∃r.E β3 : D v ∃r.E α4 : E v ∀s.G β4 : E v ∀s.(G u F ) α5 : ∃r.I v J β5 : ∃r.I v J β6 : ∀t.H v I

6.2.1 Characterising Concept Impact

Prior to determining how a concept in a signature Σ has changed (for example, it has a new superconcept), we employ a diff function Φ which, given two ontologies and Σ, formulates a set of witness axioms over Σ, denoted Φ diff(O1, O2)Σ, such that, for each η ∈ Φ diff(O1, O2)Σ: O1 6|= η and O2 |= η. Now given such a set

Φ diff(O1, O2)Σ, we can tell apart specialised and generalised concepts depending on whether the witness concept is on the right or left hand side of the witness axiom, accordingly. Furthermore, we regard an atomic concept A as directly specialised (generalised) via some witness C if there is no atomic concept B that is a superconcept (subconcept) of A, such that C is also a witness for a change in B. Otherwise A changed indirectly.

Definition 7. Given Φ diff(O1, O2)Σ ⊆ diff(O1, O2)Σ, the sets of affected atomic concepts for a signature Σ are defined as follows: ( > {>} if there is a > v C ∈ Φ diff(O1, O2)Σ Φ- AT(O1, O2) = Σ ∅ otherwise ( ⊥ {⊥} if there is a C v ⊥ ∈ Φ diff(O1, O2)Σ Φ- AT(O1, O2) = Σ ∅ otherwise L Φ- AT(O1, O2)Σ = {A ∈ Σ | there exists A v C ∈ Φ diff(O1, O2)Σ and

> v C/∈ Φ diff(O1, O2)Σ}

R Φ AT(O1, O2)Σ = {A ∈ Σ | there exists C v A ∈ Φ diff(O1, O2)Σ and

C v ⊥ ∈/ Φ diff(O1, O2)Σ}

S Y Φ- AT(O1, O2)Σ = Y ∈{L,R,>,⊥} Φ- AT(O1, O2)Σ 6.2 Specification 89

L R Given an atomic concept A ∈ Φ-AT(O1, O2)Σ (analogously A ∈ Φ-AT(O1, O2)Σ), and a signature Σ+ := Σ ∪ {>, ⊥}, we define the following notions:

A direct change of A is a witness C s.t. A v C (C v A) ∈ Φ diff(O1, O2)Σ + and there is no B ∈ Σ s.t. O2 |= A v B (O2 |= B v A), O2 6|= A ≡ B, and

B v C (C v B) ∈ Φ diff(O1, O2)Σ.

An indirect change of A is a witness C s.t. A v C (C v A) ∈ Φ diff(O1, O2)Σ + and there is at least one B ∈ Σ s.t. O2 |= A v B (O2 |= B v A),

O2 6|= A ≡ B and B v C (C v B) ∈ Φ diff(O1, O2)Σ. A concept A is purely directly changed (i.e. specialised or generalised) if it is only directly changed (specialised or generalised), and analogously for purely indirectly changed (specialised or generalised).

Once again consider the ontologies in Table 6.1; we have that B is purely directly specialised via witness D: O1 6|= B v D and O2 |= B v D, while A is indirectly specialised via the same witness, since O1 6|= A v D, O2 |= A v D,

O2 |= A v B and B v D ∈ diff(O1, O2). In other words, concept A changes via B. Additionally, the concept D is directly generalised via B, but indirectly generalised via A. Thus D is not purely directly changed, but rather we have a mixed effect on the concept. The distinction between directly and indirectly affected atomic concepts, in addition to the separation of concepts affected via > and ⊥, allows us to over- come the problem described in Section 6.1, with respect to propositionally closed description logics. If there exists a global change to > (analogously to ⊥), it is singled out from the remaining localised changes, and its effect is appropriately marked as an indirect change to every atomic concept. Thus the diff results are no longer “polluted” by vacuous witnesses such as those exemplified and discussed in Section 6.1. Similarly, (purely) indirect changes are appropriately isolated from the primary changes of interest: direct ones. The notion of “change effect” as per Definition7 is applicable to any diff function Φ that produces a set of witness axioms Φ diff(O1, O2)Σ. 6.2 Specification 90

6.2.2 Diff Functions

Given the undecidability of mCE-inseparability (as discussed in Section 6.2), we devise several sound but incomplete approximations to the problem of computing diff(O1, O2)Σ in expressive logics: “Subconcept” diff, denoted SubDiff(O1, O2)Σ, and “Grammar” diff, denoted GrDiff(O1, O2)Σ. These approximations are de- signed to capture as much change as possible while maintaining both computa- tional feasibility and legibility of witness axioms. The set of differences that would be captured by a simple comparison of concept hierarchies between two ontologies, i.e., differences in atomic subsumptions, is denoted AtDiff(O1, O2)Σ. Hereafter we refer to the semantic diff notion used within ContentCVS as CvsDiff(O1, O2)Σ.

The SubDiff(O1, O2)Σ approximation is based on subconcepts explicitly as- serted in the input ontologies, and returns those differences in entailments of type C v D, where C and D are possibly complex concepts from the set of Σ- subconcepts of O1 and O2 (see Definition8). It is at least conceivable that many entailments will involve complex concepts in their asserted form, and, if that is the case, those would be witnesses that the user could understand and, indeed, may have desired. Using this approach we can detect changes in a principled and relatively cheap way, for example, from the ontologies in Table 6.1 we have that

O1 6|= A v ∃r.E, and O2 |= A v ∃r.E, pinpointing A as affected. In order to avoid only considering witnesses in their explicitly asserted form we designed GrDiff(O1, O2)Σ, which detects differences in additional types of entailments using the following grammars (where SC, SC0, with SC 6= SC0, stand for subconcepts of O1 ∪ O2, and r an atomic role):

0 Grammar GL : C −→ SC | SC t SC | ∃r.SC | ∀r.SC | ¬SC

0 Grammar GR : C −→ SC | SC u SC | ∃r.SC | ∀r.SC | ¬SC

GrDiff(O1, O2)Σ combines the basic intuitions about interesting logical forms with the ontology specific information available from SubDiff(O1, O2)Σ. By re- stricting fillers of the restrictions to the (inherently) finite set of subconcepts, we ensure termination. The grammars are designed to avoid pointless redundancies, such as testing for A v C u D which is equivalent to A v C and A v D.

In terms of complexity of computing Φ-diff(O1, O2)Σ, there are two dimen- sions to be considered: (1) the complexity of deciding entailment in the input language, and (2) the number of entailment tests. Regarding the latter, the 6.2 Specification 91 maximum number of candidate witness axioms is polynomial in the number of the inputs’ subconcepts, namely quadratic for SubDiff(O1, O2)Σ and cubic for

GrDiff(O1, O2)Σ. The semantic difference between ontologies with respect to each specified diff function, including CEX and CvsDiff(O1, O2)Σ, is boiled down to finding an en- tailment of a certain form that holds in O2 but not O1; what varies between each function is the kind of entailment grammar used (i.e., witness language), which in turn dictates the computational feasibility of the diff function. The entailment grammar used by CvsDiff(O1, O2)Σ, denoted Gcvs, is defined in Section 4.1.2.

Definition 8. Given two ontologies, a diff function Φ, and a signature Σ, the set of Σ-differences is:

Φ diff(O1, O2)Σ := diff(O1, O2)Σ ∩ Φ-ax where the set Φ-ax is defined as follows:

At-ax = {C v D | C,D ∈ Σ}

Sub-ax = {C v D | C,D are subconcepts in O1 ∪ O2, where Ce ∪ De ⊆ Σ}

Gr-ax = {C v D | C ∈ Σ and D is a concept over GL, where De ⊆ Σ, or

D ∈ Σ and C is a concept over GR, where Ce ⊆ Σ}

Cvs-ax = {C v D | C ∈ Σ and D is a concept over Gcvs, where De ⊆ Σ} CEX-ax = {C v D | C,D are subconcepts in L(Σ)}

Applying the diff functions At, Sub, and Gr from Definition8 to our example ontologies from Table 6.1, we get the sets of affected terms shown in Table 6.2. The differences in atomic subsumptions are easily identifiable: A, B and D are affected via the addition of β2, the first two concepts are specialised while the last one is generalised. In addition to these, SubDiff(O1, O2)Σ pinpoints the axioms

β4 and β5 as new entailments in O2, thus concept E is specialised via β4, and

I generalised via β5. Finally GrDiff(O1, O2)Σ spots two more affected concepts: D is specialised via witness axiom D v ∃r.∀s.(F u G), and J is generalised via ∃r.∀t.H v J. Naturally the more we expand our entailment grammar, the closer we get to the actual change set. As specified in Section 6.2, a key requirement of our diff functions is to remain decidable, and as long as the language generated by the 6.2 Specification 92

Table 6.2: Affected concepts (specialised, generalised and total) between O1 and O2 according to the mentioned diff notions.

Φ = At Φ = Sub Φ = Gr

> Φ-AT(O1, O2)Σ ∅ ∅ ∅ ⊥ Φ-AT(O1, O2)Σ ∅ ∅ ∅ L Φ-AT(O1, O2)Σ {A, B}{A, B, E}{A, B, D, E} R Φ-AT(O1, O2)Σ {D}{D,I}{D,I,J}

Φ-AT(O1, O2)Σ {A, B, D}{A, B, D, E, I}{A, B, D, E, I, J} grammar is finite then decidability is ensured. Having discussed the computa- tional upper bound above, we will comment on the performance of our imple- mentation in Section 6.3.2. In order to be able to compare our diffs with CEX, in the following section we specify ways in which CEX can be used to compare more expressive ontologies.

6.2.2.1 CEX-Based Functions

The current implementation of CEX only takes as input acyclic ELHr terminolo- gies, that is, ELHr TBoxes which are 1) acyclic and 2) every concept appears (alone) on the left-hand side of an axiom exactly once. In order to apply CEX to ontologies that are more expressive than ELHr, one must rely on approximation algorithms. An EL approximation does not suffice, as there may exist cycles, GCIs, or more than one axiom with the same left hand side. Therefore, as a means to apply CEX to expressive ontologies, we use two ELHr approximations.

Definition 9. For an ontology O, we define the approximation function r r ELH App1(O) that approximates O into ELH as follows:

(a) Remove all subsumptions with a non-atomic left hand side and all non-EL axioms.

(b) Remove all but one axiom with a given atomic left-hand side; if there is an equivalence axiom with an atomic concept X on either side, and a non-empty set of subsumptions Ψ that have X on their left hand side, remove all axioms in Ψ and preserve the equivalence. 6.2 Specification 93

(c) Break cycles by non-deterministically removing axioms in cycles until the resulting ontology is cycle-free.

r r The approximation function ELH App2(O) is the same as ELH App1(O) but with Step (b) replaced with (d) as follows:

(d) Replace the set of axioms with a common left hand side concept A, for exam- ple, {A v C,A v D}, with a subsumption between A and the conjunction of all concepts on the right hand side of all such axioms, for example, A v CuD.

r The intuition behind the second CEX-based approximation ELH App2(O) is that by applying such rewrites there could be less loss of information with r respect to ELH App1(O). One could apply other, more sophisticated forms of approximation, though our intention here is to verify whether CEX, even using such basic approximations to ELHr, can detect changes that our diff functions miss. If that does indeed occur, it would suggest that combining diff functions, or attempting other forms of ELHr approximations could be desirable for users. Based on these approximation algorithms, we can now use CEX as a subrou- tine in a diff function for non-ELHr ontologies.

Definition 10. Given two ontologies, a signature Σ, and an ELHr approximation r function ELH Appi(O), the set of Σ-differences CexiDiff(O1, O2)Σ is:

r 0 1. For each j ∈ {1, 2}, execute ELH Appi(Oj), resulting in Oj.

0 0 2. Apply CEX to (O1, O2, Σ), resulting in the change set: TempCS.

3. For each α ∈ T empCS, add α to CexiDiff(O1, O2)Σ if O1 6|= α and O2 |= α.

Given the loss of axioms during the input approximation step (via the ELHr approximation functions), especially due to its non-deterministic nature, we may well introduce spurious changes. Thus Step 3 in Definition 10 is designed to en- sure that changes detected within the ELHr approximations (obtained in Step 2) are sound changes with respect to the whole (untouched) input ontologies. In other words, we verify which detected changes are due to the approximation step. Obviously, this approximation-based procedure throws away a lot of information and is not deterministic. However, even such an approximation can offer useful insight, particularly if it finds changes that other methods do not. There are 6.2 Specification 94 more elaborate existing approximation approaches (for example, [86]), but they generally do not produce ELHr terminologies, so their use requires either chang- ing the approximation output or updating CEX to take non-terminological EL input.

6.2.3 Example Walkthrough

In the preceding sections we have briefly shown the result of applying our concept- based diff functions to a simple pair of artificial ontologies. In this section we reuse the ontologies in Table 6.1 to demonstrate the full result of the distinction between direct and indirectly affected concepts. Having the set of affected atomic concepts from Table 6.2, we distinguish between those that are directly and indi- rectly affected. Note that a specialised concept A in O2 means that A has a new superconcept, while a specialised concept B in O1 means B lost a superconcept (and analogously for generalised concepts). Tables 6.3, 6.4, and 6.5 contain the breakdown, for AtDiff(O1, O2)Σ, SubDiff(O1, O2)Σ, and GrDiff(O1, O2)Σ respec- tively, of specialised and generalised atomic concepts depending on whether these are purely directly, purely indirectly, or both directly and indirectly affected.

Table 6.3: Breakdown of concept impact in At-AT(O1, O2)Σ.

Effect Type Concept name Witness axiom(s)

Purely BB v D directly

Purely AA v D

Specialised indirectly

B vd D Mixed D A vi D Generalised

In Table 6.3 we show the changes to the concept hierarchy, where B gets a new superconcept D and, as a consequence, A as well. Given these changes, D is generalised both directly and indirectly via B and A, appropriately. Once we also consider subsumptions involving subconcepts, shown in Table 6.2 Specification 95

Table 6.4: Breakdown of concept impact in Sub-AT(O1, O2)Σ.

Effect Type Concept name Witness axiom(s)

Purely EE v ∀s.(F u G) directly

A v D Purely A A v C u D indirectly A v ∃r.E

d Specialised B v D Mixed B B vd C u D B vi ∃r.E

Purely I ∀t.H v I directly

B vd D Mixed D i

Generalised A v D

6.4, we find two additional affected concepts: E and I. Moreover this diff finds an extra witness, not identified via AtDiff(O1, O2)Σ, for the concept changes of A and B. In particular, observe that one of the new witnesses for the change to B: ∃r.E, is derived from the fact that O2 |= B v D and O2 |= D v ∃r.E, so it is an indirect witness. Seeing as B is purely directly affected according to

AtDiff(O1, O2)Σ, with the finding of this witness B shifts to the mixed affected concepts. This goes to show that, while more affected concepts are found as we increase the expressivity of our witness language, the categories they fall under may change as a consequence. Both the newly found specialised atomic concept E and generalised atomic concept I are purely directly affected by explicitly asserted subconcepts, in the added axioms β4 and beta6, respectively.

The final diff notion, GrDiff(O1, O2)Σ, finds an additional specialised concept D, purely directly affected via witness ∃r.(∀s.(F u G)). Another newly found affected concept is J, which is purely directly generalised via ∃r.(∀t.H). Finally, note that two new witnesses are found for A and B, though these do not change the categorisation of these concepts with respect to SubDiff(O1, O2)Σ (i.e., they remain purely indirectly and mixed affected, respectively). 6.3 Implementation 96

Table 6.5: Breakdown of concept impact in Gr-AT(O1, O2)Σ.

Effect Type Concept name Witness axiom(s)

Purely EE v ∀s.(F u G)

directly DD v ∃r.(∀s.(F u G))

A v D A v C u D Purely A A v ∃r.E indirectly A v ∃r.(∀s.G) A v ∃r.(∀s.(F u G)) Specialised B vd D B vd C u D Mixed B B vi ∃r.E B vi ∃r.(∀s.G) B vi ∃r.(∀s.(F u G))

Purely I ∀t.H v I

directly J ∃r.(∀t.H) v J

B vd D Mixed D i

Generalised A v D

6.3 Implementation

A naive algorithm for any of the diffs mentioned could be to generate and test all possible witness axioms for the language considered (for example, generate sub- sumptions between each atomic concept A and each subconcept C, for all A and C). However, this is bound to incur performance and memory problems for even modest input, especially using the closest approximations. Given these issues, we devise a less naive approach where the goal is to avoid explicit entailment tests. Instead we manipulate the input in such a way that our entailments will be tested during classification, thus exploiting built-in reasoner optimisations for the task. The algorithms to compute our diff functions are presented in Section 6.3.1. Subsequently, in Section 6.3.2, we briefly discuss how our implementation performs on naturally occurring ontologies. 6.3 Implementation 97

6.3.1 Algorithms

Our implementation of SubDiff(O1, O2)Σ is outlined in Algorithm5, where we use L L the following notation: DirAT(O1, O2)Σ (respectively IndAT(O1, O2)Σ), denotes a set of directly (respectively indirectly) affected terms, and A vd B denotes a direct subsumption between A and B, i.e., one that is not mediated by another atomic concept. The intuition of the algorithm is as follows: we create equiva- lences between subconcepts and new atomic concepts, add them to both ontolo- gies and classify. This relays onto a reasoner the work of computing subsumptions between atomic concepts and, as a consequence of introducing the equivalences, L the subconcepts in the ontologies. Then, when computing AT(O1, O2)Σ, for each concept in Σ we compare its set of firstly indirect, and secondly direct supercon- cepts from O1 and O2 to determine which concepts changed, and whether the witness is an indirect or direct one, accordingly (as outlined in Algorithm6 for the computeDifferences(...) subprocedure). To retrieve the change witnesses or witness axioms, it would be necessary to replace the newly added atomic concepts with their respective subconcepts in the equivalence axioms created. L Note that, to get the full diff, we must compute SubDiff(O1, O2)Σ and R SubDiff(O1, O2)Σ, for the pair (O1, O2), and then for reversed pair (O2, O1). R Algorithm5 is easily altered to compute SubDiff( O1, O2)Σ, it would be sufficient to replace v with w throughout. Similarly, we compute AtDiff(O1, O2)Σ using a reduced version of Algorithm5, by simply skipping lines 2 - 7; i.e., we use the same procedure but do not inflate the ontologies with equivalence axioms.

The implementation of GrDiff(O1, O2)Σ is outlined in Algorithm7, where the only difference with respect to Algorithm5 for SubDiff( O1, O2)Σ is the second step of candidate witness generation; after defining the subconcepts and adding them to the ontologies, we formulate the remaining possible concepts according to our entailment grammar. For that purpose we add all atomic concepts in Σ (line 8) to the set of subconcepts, and then generate candidate negation, existential, universal and disjunction witnesses. Algorithm7 can be modified to compute the concept-based difference with respect to the entailment grammar employed in CvsDiff(O1, O2)Σ: it suffices to consider only (atomic) Σ-concepts in line 9 instead of subconcepts, and ignore lines 14 - 18 since these produce disjunctions, a form of witness not included in the entailment grammar of CvsDiff(O1, O2)Σ. 6.3 Implementation 98

L Algorithm 5 computeSubDiff(O1, O2)Σ

Input: Ontologies O1, O2, and signature Σ Output: Sets of (directly and indirectly) affected atomic concepts

L L 1: DirAT(O1, O2)Σ ← ∅, IndAT(O1, O2)Σ ← ∅ 2: SC ← {C | C is a complex subconcept in O1 ∪ O2, with Ce ⊆ Σ} 3: for subconcept C ∈ SC do 4: for 1 ≤ j ≤ 2 do 5: Add X ≡ C to Oj .X is a new atomic concept 6: end for 7: end for 8: Classify O1 and O2 9: for atomic concept A ∈ Σ do 10: computeDifferences(O1, O2, Σ, A) . update sets in line 1 (see Alg.6) 11: end for L L 12: return DirAT(O1, O2)Σ, IndAT(O1, O2)Σ

Algorithm 6 computeDifferences

Input: Ontologies O1, O2, signature Σ, atomic concept A Output: Updated sets of (directly and indirectly) affected atomic concepts

1: ind1 ← {C | O1 |= A v C} 2: ind2 ← {C | O2 |= A v C} 3: if ind1 6= ind2 then d d 4: dir1 ← {C | O1 |= A v C} . v is a direct subsumption d 5: dir2 ← {C | O2 |= A v C} 6: ind1 ← ind1 \ dir1 7: ind2 ← ind2 \ dir2 8: if dir1 6= dir2 then 9: for concept C ∈ (dir1 ∩ Σ) do 10: if C/∈ (dir2 ∩ Σ) then L 11: Add A to DirAT(O1, O2)Σ 12: end if 13: end for 14: end if 15: for concept C ∈ (ind1 ∩ Σ) do 16: if C/∈ (ind2 ∩ Σ) then L 17: Add A to IndAT(O1, O2)Σ 18: end if 19: end for 20: end if L L 21: return DirAT(O1, O2)Σ and IndAT(O1, O2)Σ 6.3 Implementation 99

L Algorithm 7 computeGrDiff(O1, O2)Σ

Input: Ontologies O1, O2, and signature Σ Output: Sets of (directly and indirectly) affected atomic concepts

L L 1: DirAT(O1, O2)Σ ← ∅, IndAT(O1, O2)Σ ← ∅ 2: SC ← {C | C is a complex subconcept in O1 ∪ O2, with Ce ⊆ Σ} 3: for subconcept C ∈ SC do 4: for 1 ≤ j ≤ 2 do 5: Add X ≡ C to Oj .X is a new atomic concept 6: end for 7: end for 8: Add all atomic concepts A ∈ Σ to SC 9: for subconcept C ∈ SC do .X,Y,Z are new atomic concepts 10: for 1 ≤ j ≤ 2 do 11: Add X ≡ ¬C to Oj 12: for atomic role r ∈ Σ do 13: Add Y ≡ ∃r.C to Oj and Z ≡ ∀r.C to Oj 14: end for 15: for subconcept C0 ∈ SC do 16: if C 6= C0 then 0 17: Add Z ≡ C t C to Oj 18: end if 19: end for 20: end for 21: end for 22: Classify O1 and O2 23: for atomic concept A ∈ Σ do 24: computeDifferences(O1, O2, Σ, A) . update sets in line 1 (see Alg.6) 25: end for L L 26: return DirAT(O1, O2)Σ, IndAT(O1, O2)Σ

6.3.2 Preliminary Evaluation

In terms of computation times, using Machine 1, the average number of con- cepts processed (i.e., determined to be affected or not) per minute by each diff function is shown in Table 6.6.2 For a pair of NCIt versions released between July 2005 and August 2006, the average time to compute affected concepts and their respective witnesses ranges from seconds for AtDiff(O1, O2)Σ, to ≈30 min- utes for Cex1Diff(O1, O2)Σ, Cex2Diff(O1, O2)Σ, and Sub-AT(O1, O2)Σ, to days for

2 L Note that, originally, CvsDiff(O1, O2)Σ only computes AT(O1, O2)Σ, but in order to provide R a direct comparison with our diff functions we also compute AT(O1, O2)Σ according to the Gcvs entailment grammar. 6.4 Empirical Evaluation 100

GrDiff(O1, O2)Σ and CvsDiff(O1, O2)Σ.

Table 6.6: Number of concepts processed per minute by each diff function Φ.

Φ = Cex1 Φ = Cex2 Φ = At Φ = Sub Φ = Cvs Φ = Gr #Concepts 151 143 13,547 127 58 50 per minute

Table 6.6 shows that all but AtDiff(O1, O2)Σ are rather computationally ex- pensive, especially GrDiff(O1, O2)Σ and CvsDiff(O1, O2)Σ given the high number of witnesses constructed from their respective entailment grammars. Indeed we are unable to fully compute either of those diffs for two versions of the NCIt from 2005, containing ≈65,000 axioms, ≈45,000 atomic concepts, and 103 atomic roles. The sheer number of candidate witnesses very quickly consumes the entire 30GB of RAM dedicated to the JVM (in Machine 1), and within an hour the process terminates in error due to not having sufficient memory to continue.

6.4 Empirical Evaluation

In the previous section we discussed the performance of our implementations. This section now aims at demonstrating the quality of our concept-based diff notions. We first do this by means of a case study of the NCIt corpus in Section 6.4.1. In particular, we set out to determine whether the distinction between di- rect changes that users should attend to, and indirect changes derived from direct ones, gives a sensible enough structuring of the change set to yield a reduction of the cognitive load associated with term change analysis. Subsequently in Section 6.4.2, we use a particular comparison instance to demonstrate, via a walkthrough, how users may interpret and benefit from the presented diff notions.

6.4.1 Case Study

The object of our case study is a subset of the NCIt corpus, specifically 14 ver- sions (from release 05.06f to 06.08d) that contain concept-based change logs. The NCIt versions considered range from ≈70,000 to ≈85,000 logical axioms, and from ≈43,000 to ≈57,000 atomic concepts. In order to investigate the applicability of our approach we (1) compare the results obtained via our approximations with those output by Cex1Diff(O1, O2)Σ, Cex2Diff(O1, O2)Σ and CvsDiff(O1, O2)Σ, (2) 6.4 Empirical Evaluation 101 compare the number of (purely) directly and indirectly affected concepts, and, finally, (3) inspect whether the devised approximations capture changes not re- ported in the NCIt change logs. To start with, we perform consecutive, pairwise comparisons between selected versions, and present the output of the diff functions in Sections 6.4.1.1 and 6.4.1.2, followed by the NCIt change log analysis in Section 6.4.1.3. Due to computational issues regarding GrDiff(O1, O2)Σ and CvsDiff(O1, O2)Σ, instead of comparing each pair of NCIt versions with respect to Σu, we take a random sample of the terms in Σu (generally n ≈ 1800) such that a straightforward extrapolation allows us to determine that the true proportion of changed terms in Σu lies in the confidence interval (+-3%) with a 99% confidence level.

6.4.1.1 Diff Function Comparison

For the purpose of comparing the discussed diff functions, we collect the set of affected concepts detected by each of them. In addition to these functions, we introduce a new one in order to verify how much more change a combination of diff functions would detect: Un-AT, which is the combination of CEX and

Sub-AT, i.e., Un-AT := {Cex1-AT ∪ Cex2-AT ∪ Sub-AT}. The comparison of the total number of affected atomic concepts found is presented in Table 6.7. Note that the values presented for Gr-AT are extrapolated from the set of affected atomic concepts found within the signature sample. The absolute number of concepts in Gr-AT, as well as Cvs-AT, is shown in Figure 6.1.

In general, GrDiff(O1, O2)Σ, even taking into account the confidence inter- val, consistently detects more affected concepts (both specialised and gener- alised) than all other diffs. The CEX-based approximation Cex1Diff(O1, O2)Σ performs poorly across the board, consistently capturing less affected con- cepts than even a comparison of atomic subsumptions. The second CEX- based approximation Cex2Diff(O1, O2)Σ, however, typically detects more af- fected terms than AtDiff(O1, O2)Σ, apart from two cases (d6 and d10), but still less than SubDiff(O1, O2)Σ. Regardless of this result, it is not the case that

SubDiff(O1, O2)Σ is always better than Cex2Diff(O1, O2)Σ, as the latter actually detects more generalised atomic concepts than SubDiff(O1, O2)Σ in all but one case. The gathered evidence suggests that indeed combining these approaches would perhaps result in a preferable solution than one or the other, as exhibited by the higher average coverage of 59% in Un-AT – although only an 8% increase 6.4 Empirical Evaluation 102

Table 6.7: Number of affected atomic concepts found by each diff function for Σ := Σu, and their respective coverage w.r.t. Gr-AT(Oi, Oi+1)Σ.

Comparison Cex1-AT Cex2-AT At-AT Sub-AT Un-AT Gr-AT

d1 1,134 1,922 1,416 2,131 3,311 43,096 d2 877 1,746 1,208 1,816 3,307 43,928 d3 5,415 6,287 6,135 6,528 8,818 45,639 d4 2,145 6,198 3,676 45,932 45,932 46,929 d5 3,964 7,656 4,978 15,691 15,758 48,075 d6 2,298 3,718 3,923 6,203 8,570 48,629 d7 1,893 3,393 3,217 6,330 7,508 49,189 d8 6,387 7,397 6,806 7,428 8,957 54,870 d9 1,655 4,460 2,745 5,329 6,913 55,555 d10 1,512 3,681 4,553 6,415 8,147 55,948 d11 1,102 3,026 1,714 4,325 5,916 57,036 Avg. Cov. 18% 23% 27% 55% 59% Min. Cov. 3% 8% 5% 18% 21% Max. Cov. 47% 49% 52% 100% 100%

1700 1600 1500 1400 1300 1200 1100 1000 900 CvsDiff-AT 800 GrDiff-AT 700 600 500 400 300 200 100 0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11

Figure 6.1: Comparison of number of specialised concepts found by

CvsDiff(O1, O2)Σ and GrDiff(O1, O2)Σ within the signature samples of the NCIt (y-axis: number of atomic concepts, x-axis: comparison identifier).

with respect to the average coverage of Sub-AT. As expected, GrDiff(O1, O2)Σ captures more affected concepts than CvsDiff(O1, O2)Σ in all cases, as evidenced 6.4 Empirical Evaluation 103 in Figure 6.1.3

6.4.1.2 Splitting Direct and Indirect Changes

With the results of each diff function at hand, that is, the set of affected concepts and, for each of these, the corresponding set of witnesses, we now inspect the results of distinguishing those atomic concepts that are directly, indirectly, or both directly and indirectly affected. Figure 6.2 shows the total number of purely direct, purely indirect, and both directly and indirectly affected concepts found within At-AT and Sub-AT. Note that the number of purely direct, indirect, or mixed concepts found via

SubDiff(O1, O2)Σ can be smaller than AtDiff(O1, O2)Σ, which happens in d3 and d4. For these particular cases, we bring to the foreground the smaller value

(i.e., SubDiff(O1, O2)Σ), and the value of AtDiff(O1, O2)Σ becomes the increment. L Also, this figure presents the total number of changes; the union of AT(O1, O2)Σ R and AT(O1, O2)Σ. In general, the number of purely directly changed concepts is much smaller than the number of purely indirect or mixed. One case is particularly surprising: in d4 the set Sub-AT contains 43,326 purely indirect changes, and only 1,122 purely direct ones. The former amount to a large number of changes that one can refrain from inspecting, seeing as they are a consequence of the significantly smaller set of (purely) direct changes. In an ontology engineering scenario, where one or more people need to analyse such change sets, having this mechanism for isolating changes of most interest is conceivably a preferable means to analyse a change set, in addition to providing a basis for producing more intelligible change logs.

6.4.1.3 Change Log Analysis

The change logs supplied with each version of the NCIt contain atomic concepts that were subject to changes according to the edit history. The types of changes reported in these logs are the following: create A new atomic concept is introduced. modify The URI of an atomic concept has changed.

3 R R Note that, since Gr-AT(O1, O2)Σ = Cvs-AT(O1, O2)Σ, we only present in Figure 6.1 the L L results of Gr-AT(O1, O2)Σ and Cvs-AT(O1, O2)Σ. 6.4 Empirical Evaluation 104

44000

43000

8000

7000

6000 Sub Mix At Mix 5000 Sub P.I. At P.I. Sub P.D. 4000 At P.D.

3000

2000

1000

0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11

Figure 6.2: Comparison of purely directly (“P.D.”), purely indirectly (“P.I.”), and both directly and indirectly (denoted “Mix”) affected concepts found within

At-AT(O1, O2)Σ (denoted “At”), and Sub-AT(O1, O2)Σ (denoted “Sub”) in NCIt versions (y-axis: number of atomic concepts, x-axis: comparison identifier).

retire an atomic concept is relieved of its superconcepts, and declared as a sub- sumee of Retired Kind. merge Two atomic concepts are merged into a single one, where one is retired while the other remains. split A single atomic concept is split into two, where one remains and the other is created.

However, it is unclear whether each reported concept change is associated with any relevant axiom changes; it could be the case that a reported concept change is purely involved in ineffectual changes. In spite of the ambiguity, it should be ex- pected that a change log contains at least those atomic concepts that are directly changed, and this is what we aim to find out in our next experiment; we extract the atomic concepts mentioned in the change log, and verify whether the obtained 6.4 Empirical Evaluation 105 direct changes for each NCIt comparison are contained in said change logs. The results are shown in Table 6.8, comparing the number of directly affected atomic concepts found within At-AT(O1, O2)Σ and Sub-AT(O1, O2)Σ, and how many of those are not present in the NCIt change logs. Overall, we determined that the change logs are missing a lot of direct changes. More specifically, on average,

At-AT(O1, O2)Σ contains 767 directly affected atomic concepts not mentioned in the change logs, while Sub-AT(O1, O2)Σ contains 908 such atomic concepts per NCIt comparison.

L Table 6.8: Number of directly affected concepts (1) in AT(O1, O2)Σ (denoted R “L”), (2) in AT(O1, O2)Σ (denoted “R”), (3) in the union of those two sets (de- noted “Total”), and (4) that do not appear in the NCIt change logs (denoted

“Missed”), found by AtDiff(O1, O2)Σ and SubDiff(O1, O2)Σ for Σ := Σu.

NCIt At-AT(O1, O2)Σ Sub-AT(O1, O2)Σ diff L R Total Missed L R Total Missed

d1 646 294 896 798 820 298 1,060 953

d2 565 274 772 149 1,147 294 1,298 211

d3 2,321 891 2,991 315 2,791 898 3,090 445

d4 1,624 1,187 2,683 190 2,725 1,198 2,814 432

d5 1,555 1,009 2,465 243 8,038 1,186 9,142 317

d6 890 385 1,130 199 1,306 401 1,485 199

d7 1,190 704 1,637 273 2,720 780 2,935 511

d8 6,075 1,421 6,389 5,546 6,411 1,465 6,693 5,723

d9 1,481 420 1,766 207 2,607 478 2,782 322

d10 3,321 370 3,579 216 4,964 427 5,217 298

d11 753 378 1,043 300 1,404 472 1,643 582 Total 20421 7,333 25,351 8,436 34,933 7,897 38,159 9,993

Subsequently we verify how many of the affected concepts in

Cex1-AT(O1, O2)Σ, Cex2-AT(O1, O2)Σ, At-AT(O1, O2)Σ and Sub-AT(O1, O2)Σ are actually contained in the NCIt change logs. The results of this verification are presented in Table 6.9. Overall we see that none of the diffs captures the exact number of reported concept changes in the logs. The maximum coverage of the change log occurs in comparisons d4 and d5, where Sub-AT(O1, O2)Σ captures 96% and 91% of the atomic concepts mentioned in the logs, respectively. By 6.4 Empirical Evaluation 106 taking the union of affected concepts found by the CEX-based approximations and Sub-AT(O1, O2)Σ, the average coverage of the change logs increases to 73%.

Table 6.9: Number of affected atomic concepts, AT(Oi, Oi+1)Σ, found by each diff function (in addition to Un-AT := {Cex1-AT ∪ Cex2-AT ∪ Sub-AT}) for Σ := Σu within the NCIt change logs.

NCIt Change Cex -AT Cex -AT At-AT Sub-AT Un-AT diff log 1 2

d1 2,159 107 168 103 126 269 d2 1,399 520 773 725 974 1,013 d3 4,234 2,497 2,973 3,102 3,148 3,150 d4 8,447 1,327 1,598 2,734 8,117 8,117 d5 3,847 1,595 2,655 2,602 3,503 3,504 d6 2,470 866 1,147 1,141 1,312 1,406 d7 5,302 1,217 1,253 1,982 2,668 2,699 d8 2,556 688 885 875 993 1,003 d9 3,945 1,060 2,205 1,878 2,530 2,755 d10 6,046 978 3,824 3,551 4,076 6,046 d11 2,065 628 764 853 1,091 1,168 Avg. Coverage 27% 43% 46% 67% 73%

Regarding the atomic concepts in the logs that the union of Cex1-AT(O1, O2)Σ and SubDiff(O1, O2)Σ diffs does not contain (total of 19,781, amounting to 47% of all atomic concepts), we hypothesise that one or more of the following applies: the missed concept a) represents a change not captured by either diff (since they are all incomplete), b) was subject to annotation changes, or c) represents a change that was logically ineffectual. Upon investigating whether there are annotation changes related to each atomic concept, we found that 72% (14,217) of missed changes had indeed been annotation changes, i.e., they may be related to annotations only. If indeed it is the case that annotation changes suffice to include a concept in the change log, then the number of missed changes is reduced to 13% of all the change logs.

6.4.2 NCIt Concept Diff Walkthrough

In this section we select one of the diffs performed in Section 6.4.1, and analyse the result of the comparison in more detail from a user perspective.

Specifically, we consider the diff d4, and the diff notions AtDiff(O1, O2)Σ and 6.4 Empirical Evaluation 107

SubDiff(O1, O2)Σ. The coarse-grained output of each diff is shown in Table 6.10, where AtDiff(O1, O2)Σ finds no more than 8% of the affected concepts detected by

SubDiff(O1, O2)Σ. While the latter only finds 10 more generalised concepts than the former, it detects 43,467 more specialised concepts. Immediately one sees that there is a high number of concepts affected via complex concepts, rather than atomic concepts. Note that there are no global changes, nor changes in- > volving unsatisfiable concepts among these versions; therefore AT(O1, O2)Σ and ⊥ AT(O1, O2)Σ are both empty. Looking at the sets of generalisations, despite their size it seems somewhat feasible (though it would take a long time) for a single person to inspect all those changes, for either diff notion. However when it comes to specialisations, particularly using SubDiff(O1, O2)Σ, without any kind of structure the analysis of such change set would certainly be challenging.

Table 6.10: Affected concepts (specialised, generalised and total) between O1 and O2 according to the mentioned diff notions.

Change kind Φ = At Φ = Sub

> |Φ-AT(O1, O2)Σ | 0 0 ⊥ |Φ-AT(O1, O2)Σ | 0 0 L |Φ-AT(O1, O2)Σ | 2,358 45,825 R |Φ-AT(O1, O2)Σ | 1,485 1,495

|Φ-AT(O1, O2)Σ | 3,676 45,932

By applying the distinction of directly and indirectly affected concepts, we get the sets shown in Table 6.11 for AtDiff(O1, O2)Σ, and Table 6.12 for

SubDiff(O1, O2)Σ. Notice in both tables how few purely direct specialisations there are in comparison with indirect ones.

Table 6.11: Number of changes in At-AT(O1, O2)Σ according to impact.

Purely Purely Change set Direct Indirect Mixed direct indirect

L AT(O1, O2)Σ 1,624 2,232 1,498 126 734 R AT(O1, O2)Σ 1,187 444 146 1,041 298 6.5 Conclusions 108

Table 6.12: Number of changes in Sub-AT(O1, O2)Σ according to impact.

Purely Purely Change set Direct Indirect Mixed direct indirect

L AT(O1, O2)Σ 2,725 45,747 2,647 78 43,100 R AT(O1, O2)Σ 1,198 445 148 1,050 297

A common occurrence in the sets of specialisations of both diffs is the intro- duction of mediator concepts, for example the concept D as follows:

O1 := {A v B,B v C,C v E}

O2 := {A v B,B v C,C v D,D v E } In this case, C is directly specialised via the witness axiom α : C v D, while A and B are purely indirectly specialised. The impact of changes such as α trickles down the concept hierarchy, causing other concepts to be indirectly specialised. Between these ontologies we also have that D is directly generalised via C, and indirectly through A and B. From this example, notice how a concept X can only be purely directly generalised when the witness W is a leaf concept, otherwise there will be other subconcepts of W via which X is indirectly generalised.

Inspecting the output of SubDiff(O1, O2)Σ, many of the purely indirect spe- cialisations are due to top-level atomic concepts having new role restrictions, including some with newly introduced atomic roles. So on top of the media- tor concepts discussed before, here we also have changes at the top level of the ontology, which cause changes throughout the hierarchy.

6.5 Conclusions

Based on the results presented in the previous section, one of the main re- marks is that, among the least complete approximations, Sub-AT(O1, O2)Σ

finds more affected concepts than At-AT(O1, O2)Σ, Cex1-AT(O1, O2)Σ, and

Cex2-AT(O1, O2)Σ, while often not reaching close to the projected values of the more complete Gr-AT(O1, O2)Σ (the average coverage being 55%). Indeed the diff function SubDiff(O1, O2)Σ finds many differences that would not show up if we restrict ourselves to either atomic subsumptions, or specific forms of entailments, in the manner of CvsDiff(O1, O2)Σ. 6.5 Conclusions 109

The GrDiff(O1, O2)Σ function detects more specialised concepts within the se- lected signatures than CvsDiff(O1, O2)Σ, while the number of generalised concepts is the same for both diffs (i.e., the full signature). Considering the high number of affected concepts in Sub-AT(O1, O2)Σ on comparisons d4 and d5 of the NCIt, one can argue that analysing such a change set without any sensible structuring would be difficult. By categorising atomic concepts in the change set according to whether they are directly or indirectly affected, we get a more succinct rep- resentation of the change set, thus significantly reducing information overload.

Note that in d4 there are 45,825 specialised concepts, out of which there are only 78 purely directly specialised concepts; the majority of the remainder are purely indirectly specialised concepts (43,100). Similarly in d5, from 15,254 specialised concepts there are only 1,527 purely direct specialisations. Immediately we see that this mechanism can provide an especially helpful means to (1) assist change analysis, by for instance, confining the changes shown upfront to only (purely) direct ones, and (2) generate more informative concept-based change logs. While the number of affected concepts found, even when restricted to (purely) direct ones, is rather high in some cases, it is somewhat expected from an ontology of such magnitude as the NCIt; bear in mind that teams of people work on this ontology, hence change analysis and quality assurance is unlikely to be carried out by a single user. When a high number of changes is detected, it might be useful to provide post-diff filtering mechanisms in order to minimise the cognitive load, for example, by showing only changes concerning specific terms or an entire subtree of the ontology chosen by the user. In summary, we have formulated the problem of finding the set of affected terms between ontologies via model inseparability, and presented feasible ap- proximations to finding this set. We have shown that each of the approximations can find considerably more changes than those visible in a comparison of concept hierarchies, and that both sound approximations devised capture more changes than CEX-based approximations. Regarding the latter, the restrictions imposed by CEX on the input ontologies make change-preserving approximations a chal- lenge, as we have seen in our attempt to reduce the NCIt to EL in a less naive way. The proposed distinction between (purely) direct and indirect changes allows users to focus on those changes which are specific to a given concept, in addi- tion to masking possibly uninteresting changes to any and all atomic concepts (such as those obtained via witnesses constructed with negation and disjunction), 6.5 Conclusions 110 thereby making change analysis more straightforward. As demonstrated by the NCIt change log analysis, we have found an often high number of direct changes that are not contained in the NCIt change logs, which leads us to believe the logging of changes does not seem to follow from even a basic concept hierarchy comparison, but rather a seemingly ad hoc mechanism. Chapter 7

Term and Axiom Change Alignment

In Chapters5 and6 we presented axiom-based and term-based diff notions. The problem we address in this chapter is the combination of these notions; that is, for a given term change, determine which axiom changes gave rise to that change and, dually, for a given axiom change, determine the terms affected by that change. In order to achieve this alignment between term and axiom changes, we design a mechanism, specified in Section 7.2, that ties together the change dimensions discussed. The change alignment algorithm is described in Section 7.3.1 and, together with all presented diff notions, implemented in a diff tool named ecco that aligns (a) axiom changes according to their categorisation, and (b) axiom changes with the terms they affect (directly or indirectly), and vice-versa. The ecco tool is further discussed in Section 7.3.2.

7.1 Motivation

Thus far we have specified and evaluated diff methods targeted at either asserted axioms or terms (and consequently inferred axioms), but have yet to combine these change dimensions in a practical way. While users could use solely axiom or term-based diff, by focusing on a single change form one could potentially miss useful information. Indeed it may happen that users fail to correctly interpret the given change set because of such missing details, especially seeing as these change dimensions are so closely related. 7.2 Specification 112

With this in mind, we design an alignment mechanism that takes into ac- count terms and (asserted and inferred) axioms to produce a unified change set. In particular, given a set of axiom changes, it would be useful to identify the set of terms each axiom (directly or indirectly) affects, and vice-versa for term changes, i.e., for each affected term determine which asserted axiom(s) gave rise to that change. Additionally, one might also be interested in knowing why certain asserted axioms give rise to term changes, in which case it suffices to present the (preferably direct) witness axioms for the term change.

7.2 Specification

In order to align axiom and concept changes, we present a method that relies on justifications for entailments in the difference (i.e., witness axioms for con- cept changes) to pinpoint axioms that possibly gave rise to a concept change. Once we have the justifications it suffices to single out those axioms that are effectual changes in the diff between ontologies, i.e., axioms in EffAdds(O1, O2) or EffRems(O1, O2). Our method is specified in Section 7.2.1, followed by an example walkthrough using toy ontologies, in Section 7.2.2.

7.2.1 Aligning Changes

Once the set of affected terms is determined and axiom changes are distinguished between effectual and ineffectual, we can then align these change types according to Definition 11. Specifically, we determine which axiom changes have an impact on which concepts by taking into account witness axioms for the concept changes and their justifications.

Definition 11. Let Φ be a concept diff function that returns a set of witness axioms Φ diff(O1, O2)Σ. An axiom α is said to have a Φ-impact on an atomic concept A, denoted

α →Φ A, if α ∈ EffAdds(O1, O2), and if there exists a witness β ∈ Φ diff(O1, O2)Σ for A and a justification J ⊆ O2 for β with α ∈ J .

An axiom α is said to have a Φ-impact on an atomic concept A, denoted α →Φ

A, if α ∈ EffRems(O1, O2), and if there exists a witness β ∈ Φ diff(O1, O2)Σ for

A and a justification J ⊆ O1 for β with α ∈ J . 7.2 Specification 113

If β is a witness to a direct change in A, then α is said to have a direct Φ- d impact on A, denoted α →Φ A. Otherwise if β is a witness to an indirect change i in A, then α is said to have an indirect Φ-impact on A, denoted α →Φ A.

Note that there may be more than one effectual axiom change in the justifi- cation for a change to A.

7.2.2 Example Walkthrough

In order to demonstrate our alignment method, we carry out an example walk- through using the ontologies defined in Table 7.1. Our walkthrough uses solely the diff functions presented in Chapter6. The ontologies used in this example are also used for the walkthrough of the diff tool ecco, in Section 7.3.2.

Table 7.1: Example ontologies O1 and O2.

O1 O2

α1 : A v B β1 : A v B α2 : B v C β2 : B v C u F α3 : C v ∃r.X β3 : C v ∃r.X α4 : ∃r.X v ∃r.Y β4 : X v D α5 : X v D u E β5 : F v ∃r.Y u G α6 : F v ∃r.Y

To start with, we compute the sets of effectual additions and removals between

O1 and O2, resulting in the following sets:

• EffAdds(O1, O2) = {β2, β5}

• EffRems(O1, O2) = {α4, α5}

Next, we apply all of our concept-based diff functions to the example on- tologies, and align the concept changes found with the effectual axiom changes detected. The output of this is shown in Tables 7.2 and 7.3. Since the output of SubDiff(O1, O2)Σ is a superset of that of AtDiff(O1, O2)Σ, in Table 7.3 we omit the concept changes already displayed in Table 7.2. There may be repeated mappings in the change alignment, for example, in Table 7.2 the witness axiom

B v G has a direct alignment with β2 and an indirect one via β5, where the first 7.2 Specification 114

Table 7.2: Affected concepts in At-AT(O1, O2)Σ with corresponding witness ax- ioms and justifications.

Affected Witness Axiom Effect Justification(s) concept axioms alignment Gained A v F {β , β }{β } →i A A 1 2 2 i superconcept A v G {β1, β2, β5}{β5} → A Gained B v F {β }{β } →d B B 2 2 i superconcept B v G {β2, β5}{β5} → B

Specialised d X Lost superconcept X v E {α5}{α5} → X F v G {β }{β } →d G Gained 5 5 G B v G {β , β }{β } →i G subconcept 2 5 2 A v G {β1, β2, β5} — ” — Gained B v F {β }{β } →d F F 2 2 subconcept A v F {β1, β2} — ” — Generalised d E Lost subconcept X v E {α5}{α5} → E

Table 7.3: Additional affected concept in Sub-AT(O1, O2)Σ with corresponding witness axiom and justification.

Affected Witness Axiom Effect Justification(s) concept axioms(s) alignment

d C Lost superconcept C v ∃r.Y {α3, α4}{α4} → C

is also an alignment for the witness axiom B v F . For presentational purposes, repeated mappings are omitted where clarity is compromised.

The SubDiff(O1, O2)Σ function detects an additional affected concept C (via witness ∃r.Y ) due to the removal of axiom α4. Seeing as the change is not me- diated by another atomic concept, this axiom has a direct impact on C. Both

GrDiff(O1, O2)Σ and CvsDiff(O1, O2)Σ functions find no more affected concepts 7.3 Implementation 115

1 than SubDiff(O1, O2)Σ, only additional witnesses. Therefore these are not ex- plicitly included in the example. Observe how the alignment makes explicit the (possibly for some users un- expected) effect of axioms on terms, for instance, β2 affects four concepts: the

(direct) effect of β2 on B and F can be extrapolated directly from the axiom and its alignment with the weaker axiom α6, but the (indirect) effect on A and G would require previous knowledge of the ontologies, or manually inspecting them. Furthermore, if users are interested in how changes affect query results, then inspecting the affected concepts and their corresponding witnesses allows one to immediately see how the results might change. For instance, in O2 if one queries for atomic subconcepts of G, the result will include three answers that would not be returned by querying O1: A, B, and F . Similarly, by using the

SubDiff(O1, O2)Σ function, we have in O2 an answer to a query for subconcepts of ∃r.Y , where we get the answer C in O2, but get none in O1.

7.3 Implementation

The algorithm for computing the change alignment is presented and discussed in Section 7.3.1. Further on, in Section 7.3.2, we present the diff system ecco, which implements and brings together all diff notions presented thus far. The tool is open-source, and available online at http://github.com/rsgoncalves/ecco.

7.3.1 Algorithms

The implementation of the change alignment is outlined in Algorithm8, where we assume the set of affected terms AT(O1, O2) has been computed, and a map Z of all such terms to their respective witness axioms has been produced. An- other precondition is the set of effectual additions EffAdds(O1, O2) (respectively

EffRems(O1, O2) for aligning removals). In Algorithm8 we iterate through each affected concept C, and compute the justifications of all witness axioms for C. Subsequently, we check which justifications contain effectual changes and, in those that do, we add for each effectual change α the mapping (α → C) to the map of aligned changes.

1Though in Chapter6 we found that often, but not always, there is a significant difference in the number of affected concepts detected by the various diff functions. 7.3 Implementation 116

Algorithm 8 alignChanges

Input: Ontology O2, map Z of affected concepts to their witness axioms, set of effectual additions EffAdds(O1, O2) Output: Map of axiom changes to the concepts these affect 1: alignmentMap ← ∅ 2: for concept C ∈ Z do 3: for witness axiom w s.t. (C → {w}) ∈ Z do 4: for justification J ∈ Justs(w, O2) do 5: if J ∩ EffAdds(O1, O2) 6= ∅ then 6: for α ∈ J ∩ EffAdds(O1, O2) do 7: Add (α → C) to alignmentMap 8: end for 9: end if 10: end for 11: end for 12: end for 13: return alignmentMap

Algorithm8, as described, is a post-process of axiom and term diff computa- tion, since it needs (some) output from both. However, it could be easily inter- leaved within term diff computation: whenever a witness axiom α for a change to A is found, compute the justifications for α and immediately map any effec- tual changes therein with A. So essentially skipping the first two ‘for’ loops in Algorithm8.

7.3.2 ecco: A Diff Tool for OWL 2 Ontologies

The ecco diff tool contains our implementation of both axiom and term diff al- gorithms, together with the change alignment described in Algorithm8. It is implemented in Java, and made available as a command-line tool with advanced features, as well as a Web-based application. The latter is distributed online at: http://github.com/rsgoncalves/ecco-webui.2 The ecco tool can be used in any UNIX-based or Windows operating systems, so long as these have Java 7 installed and as the default Java Runtime Environ- ment. The Web-based front end of ecco works with any modern browser, and provides users with a simple interface where the only input are the ontologies (see Figure 7.1). A number of pre-computed diff examples are available for users

2A demonstration instance is located at http://owl.cs.manchester.ac.uk/diff. 7.3 Implementation 117 to inspect. By default, the Web front end of ecco categorises asserted axioms between ontologies, and aligns them with affected concepts detected via a diff function chosen by the user (by default AtDiff(O1, O2)Σ).

Figure 7.1: Entry point to ecco on the Web.

The command-line version of ecco gives users more flexibility in what exactly is computed; it can be used as follows:

[ecco] -ont1 [ontology] -ont2 [ontology] [options] 7.3 Implementation 118

[ecco] In Windows: use ecco.bat, in UNIX-based systems: use ecco.sh [ontology] Input ontology file path or URL [options] -o Output directory for generated files -t Transform XML diff report into HTML -c Compute one of: [ at | sub | gr | cvs ] concept diff -r Analyse root ontologies only, i.e. ignore imports -n Normalise URIs; establish a common namespace for all terms -x Absolute file path of XSLT file -i Ignore Abox axioms -j Maximum number of justifications per change -v Verbose mode -h Print help message

The output of ecco is an XML3 change set file containing the asserted axioms in the diff and their respective alignment with terms and witness axioms. Users are given the option of which concept diff notion to use, for example, executing ecco with the option -c sub computes SubDiff(O1, O2)Σ. In order to present the change set in a more sensible way, and allow user inter- action with the output, the XML file is transformed into an HTML4 document, using Extensible Stylesheet Language (XSL)5 Transformations (XSLT).6 The re- sulting HTML file, together with the supplied Cascading Style Sheets (CSS)7 and JavaScript8 files, produces a hands-on front end to the diff result. The produced website allows users to browse through the change set by focusing on general categories, for example, all additions or only effectual ones, or more specific ones, for example, strengthenings. For each effectual change the terms that it directly affects are shown inline, and witness axioms appear when hovering over each con- cept. In the following section we demonstrate more in-depth usage of the tool, by means of a walkthrough.

3http://www.w3.org/TR/xml 4http://www.w3.org/TR/html 5http://www.w3.org/TR/xsl 6http://www.w3.org/TR/xslt 7http://www.w3.org/TR/CSS 8https://developer.mozilla.org/en/JavaScript 7.4 Tool Walkthrough 119

7.4 Tool Walkthrough

The walkthrough of the tool uses the example ontologies in Table 7.1 of Section 7.2.2. After executing the diff between the mentioned ontologies, the first visible item in the resulting webpage is a change summary table, shown in Figure 7.2 with the categories collapsed, and in Figure 7.3 with expanded categories. The top level links allow users to, accordingly, get the source XML file, show or hide all changes, and generate a permanent link (“permalink”) of the categories currently visible, which, together with the website, can be applied later to re-trigger the categories that were logged in the link.

Figure 7.2: Summary of changes between O1 and O2.

Just below the links there are buttons to trigger whether the axioms should be shown using term names, term labels (rdfs:label), and automatically generated symbols (“gensyms”). The latter can be used to mask term names by replacing them with shorter symbols, which, in cases where term names are too long, re- duces the amount of on-screen information, making pattern analysis easier. In the change summary we present a hierarchical structure of the axiom categories, and the number of changes in each of them. Additionally there are “information” buttons next to each category to inform what that category represents (when hovered over). Below the change summary there are links to expand or collapse all categories, so users get a more fine grained view of the kind of changes detected (as in Figure 7.3). If there are axioms in a category, for example, strengthenings with shared terms, then the category name contains a link to the appropriate axioms. When the category is selected, the relevant changes are shown and the link can be used to navigate to those changes. The basic layout of changes displays removed axioms on the left hand side, in red, and added axioms on the right hand side, in green. Each change has an 7.4 Tool Walkthrough 120

Figure 7.3: Summary of changes between O1 and O2 with categories expanded. associated ID in the XML document, which is shown in the leftmost column. Furthermore, if an effectual category is selected, for instance strengthenings (see Figure 7.4), then the concepts directly affected by each change in that category are shown on the two rightmost columns.9

Figure 7.4: Strengthenings between O1 and O2.

The first of these last two columns shows concepts that were specialised (which for an added axiom means those concepts have gained superconcepts) due to the axiom change, while the second column contains generalised concepts (gained

9As a design choice we only show directly affected concepts, though indirectly affected ones could easily be added, as they are included in the XML change set. 7.4 Tool Walkthrough 121 subconcepts). When a concept is hovered over, the witness axioms for the change to that concept are shown, as demonstrated in Figure 7.5.

Figure 7.5: Strengthenings between O1 and O2, where witness axioms for concept changes are in focus.

In Figure 7.5 we have the added axiom F v G u ∃r.Y , which is found to specialise F , since O1 6|= F v G and of course O1 6|= F v G u ∃r.Y . Additionally, this added axiom causes G to be generalised. Moving on to the ineffectual changes, in Figure 7.6 we focus on the only inef- fectual additions in the example: new retrospective redundancies. In such cases we present the justifications for each change (note that there can be more than one justification, so these are numbered). Furthermore the tool flags those ax- ioms in the justifications that are shared, effectual or ineffectual changes between both ontologies, for instance the single axiom in the justification shown in Figure 7.6 is an effectual removal, having a superscript “[e]” which, when hovered over, informs the user that it is effectual.

Figure 7.6: Ineffectual additions between O1 and O2.

Finally, in any category involving new or retired terms, such as the strengthen- ing with new terms shown in Figure 7.5, these terms are appropriately enumerated below the axiom. 7.5 Conclusions 122

7.5 Conclusions

Based on the alignment method presented, we effectively combine the three di- mensions of change discussed in Section 3.1, that is, inferred and asserted axiom and term changes, to produce a structured, multi-dimensional diff report; asserted axioms are aligned among each other based on the categorisation presented in Chapter5, term changes, which can be computed according to the concept diffs presented in Chapter6, are aligned with axiom changes, and, finally, inferred axiom changes (i.e. witness axioms) are aligned with the concepts whose change they “witness”. In cases where the number of detected changes is too high, one could rely of post-diff filters to allow users to focus on changes involving partic- ular (sets of) terms, for example, a user concerned with modelling rare disorders might not want to visualise changes to drugs and chemicals, so by specifying rare disorders as the subdomain of interest one could reduce on-screen information to only that concerning rare disorders. The implementation of the diff tool ecco, discussed in Section 7.3.2, strictly follows the design just mentioned. Using Web-related technologies to produce and facilitate interaction with its output, we can easily deploy ecco on the Web, as well as integrate it into other Web services. Additionally, the resulting website, together with the XML change set, can easily be serialised and shared. In par- ticular, the output of the tool includes a feature to generate a “permanent-view” link, which essentially logs the items (i.e., axiom categories) currently in view, allowing users to share the set of changes currently under analysis. Overall, the difference detection engine implemented in ecco advances the state of the art in ontology diff by:

1. Aligning axiom changes between each other

2. Computing term differences based on entailment differences built from more expressive or cheaper finite grammars

3. Distinguishing between terms that are directly and/or indirectly affected

4. Aligning axiom and term changes

Seeing as currently little to no effort is done to characterise change sets at such a fine level of granularity, the evaluation of these components would, in principle, solely revolve around usability aspects rather than conceptual ones. 7.5 Conclusions 123

Point (2) above, however, raises the question of whether users find our witness forms intuitive and useful. If analysing differences between atomic subsump- tions is straightforward for an ontology engineer, one can argue that the same holds for, at least, SubDiff(O1, O2)Σ since it only uses explicitly asserted (possi- bly complex) concepts. In particular, when users are familiar with the ontology, this diff function is ideal for retrieving more complete results with a low per- formance cost, whilst maintaining the cognitive overhead to a minimum. Now

GrDiff(O1, O2)Σ, the most expressive diff function, can potentially generate far more complex concepts than those asserted in either ontology. However, note that our goal is fundamentally to find affected concepts; witness axioms are the means to uncover those, and so, regardless of their form, detecting additional affected concepts than a less expressive function is a positive outcome. Although it has a higher performance cost than SubDiff(O1, O2)Σ, we expect that for small to medium ontologies the performance overhead of computing GrDiff(O1, O2)Σ will not be significant. The ecco tool has been employed by an NCIt biomedical informatics specialist who has been involved in the development of the ontology in OWL since 2003. The tool was used to analyse certain unusual cases in the NCIt lifetime such as incoherent versions, and the outcome of these comparisons was discussed inter- nally among other NCIt developers. It was reported that the alignment of axioms with affected terms helped them to understand, for example, the appearance of ≈20,000 unsatisfiable atomic concepts in NCIt versions v58 and v59 (as discussed in Section 3.2.2.3) following a batch of changes; by inspecting the axioms that were aligned with the unsatisfiable atomic concepts (and their corresponding en- tailment changes), the developer(s) were able to recognise the problematic axiom changes. Without resorting to our alignment of changes, an alternative solution for identifying the axiom changes that rendered these versions incoherent would require computing justifications for entailments of the form A v ⊥, for each un- satisfiable atomic concept A, and then manually co-relating the axioms in each justification with axiom changes. The tool also helped NCIt developers become aware of the multiple occurrences of redundant changes throughout the corpus. Additionally, a bioinformatician with substantial experience in developing on- tologies, and who regularly collaborates with other ontology developers and users, has found the ecco tool to be useful and practical enough to integrate it in an (internal) ontology versioning system. The ontology engineer mentioned that the 7.5 Conclusions 124 category-based axiom alignment is a sensible way to structure the change set, and to drive change analysis among developers. The tool allowed the engineer to flag changes in particular categories for further discussion and analysis by other de- velopers; this was done via the permalink feature of the tool, which was originally requested by this same user. Moreover, the engineer stated that the term-based impact alignment provides a satisfactory means to facilitate the understanding of the impact of axiom changes on terms between ontologies. In terms of user interface, ecco could benefit from certain presentation features employed in the ContentCVS tool, particularly when presenting entailment and term changes (even though the former are only shown on demand). The conflict resolution feature of ContentCVS presents axioms based on a dependency tree, similar to the way justifications are laid out in Prot´eg´e4. This feature could be adapted and applied to ecco in order to: (a) hierarchically present affected terms, where those terms indirectly affected by some axiom would appear under directly affected terms, and (b) present entailment differences based on the exact same axiom dependency tree. Chapter 8

Performance Heterogeneity and Homogeneity

Chapters5 and6 addressed the problem of computing axiom and term differences between ontologies, and subsequently Chapter7 covered a method to align these two forms of change. From the goals set out in Chapters1 and3, we have yet to address reasoner performance impact analysis. One of the points mentioned in Chapter4 regarding this topic is that, currently, there exists no mechanism to predict whether an ontology contains performance-degrading elements (i.e., axioms or concepts) for some reasoner. Knowing this would be useful because, if none exist, performing any kind of search for those elements is somewhat futile. In this chapter, we design and evaluate a technique to predict the performance profile of ontology/reasoner combinations, that is, predict whether a given ontol- ogy is likely to contain subsets that, possibly in combination with the remaining axioms, are a performance bottleneck for some reasoner.

8.1 Motivation

Reasoning tasks in expressive description logics, such as that underlying OWL 2 DL, have a high worst case complexity: 2NExpTime for the SROIQ DL [48]. While in practice modern reasoners hardly reach the worst case (largely owing to decades of research into reasoner optimisations), there are still cases for which performance is unacceptable.1 Knowing that seemingly harmless changes to an

1Obviously different reasoners might have different behaviour on the same input, not neces- sarily only dependent on the underlying calculi. 8.1 Motivation 126 ontology can potentially shift reasoning time by orders of magnitude, one way to explain why a reasoner performs poorly on some ontology is to predict whether the ontology contains such performance-altering axioms. In a preliminary study we analysed the striking performance variation of Her- miT between NCIt versions 79 and 80, O79 and O80 respectively; the event marked as B in Figure 3.4 of Section 3.2.2.4. Our goal was to verify whether there existed performance-degrading axioms in O79 (and even earlier versions that have com- parable performance to RT(O79)) which got removed in the subsequent version, seeing as the first ontology O79 takes ≈8 minutes (precisely 8 minutes and 20 sec- onds) to classify, while O80 takes only 1 minute. Because O80 is quickly classified, we assume the performance-degrading changes were removals in diff(O79, O80), i.e., Rem := O79 \O80. We attempt to find these performance-degrading changes as follows: for each α ∈ Rem, if RT(O80 ∪{α})  RT(O80) then add α to a set P.

In order to find a test case that “witnesses” the performance of O79 (that is, an axiom change that might explain the performance variation), we set the minimum time for RT(O80 ∪ {α}) to 8 minutes. This procedure terminated after ≈3 weeks of computation time, returning a set of 13 axioms.2 Curiously, we determined that adding all these axioms to O80 caused the classification time to rise beyond

9 hours, i.e., RT(O80 ∪ P) > 9h. Not only is this surprising all by itself, but it also implies that the remaining axioms in the removals exhibit a protective effect, that is, if we add not only those 13 axioms but all removals to O80, then the classification time is roughly the same as O79, i.e., RT(O80 ∪ Rem) ≈ RT(O79). The problem with such brute-force approaches as the one above is that they are computationally expensive, and indeed, in some cases, it may be wasted ef- fort: if all axioms in an ontology “consume” roughly the same amount of reasoning time, then our approach would exhaust the search space without returning any axioms. So, prior to performing a potentially expensive search for performance- degrading axioms (so-called “hot spots”), it would be sensible to predict whether there might exist hot spots in the first place. That is, we need a means to deter- mine whether a given ontology is likely to contain hot spots for some reasoner, thus called a performance-heterogeneous ontology/reasoner combination. If no hot spots are predicted, then there is little point in performing any search, and the ontology/reasoner combination is said to be performance-homogeneous. In

2Most of these axioms contained universal restrictions, which we found, via diffing, had been converted into existential restrictions in the subsequent version. 8.2 Specification 127

Section 8.2 we formalise this theory of performance heterogeneity and homogene- ity, and go on to implementing and testing it in Section 8.3 and 8.4, respectively.

8.2 Specification

Our general hypothesis is that, if an ontology/reasoner combination is performance-homogeneous, then we would see a roughly linear correlation be- tween classification time and the size of the ontology’s subsets. This hypothesis arose from the observation that certain ontologies are seemingly hard to classify due to sheer size, and that, in some cases, by extracting a module from such ontology, the classification time of that module is reduced by a factor comparable to that of the size reduction, suggesting (though in particular cases) a directly proportional relation between number of axioms and classification time. On the other hand, we expect to find that the performance of certain ontology/reasoner combinations (O,R) is rather fragile, in the sense that the performance of R is sensitive to small changes in O. If there are (significant) discrepancies in classifi- cation times of equal-sized ontology subsets, then the performance of R is indeed (highly) susceptible to changes in O, and (O,R) is performance-heterogeneous. Otherwise, if the performance of R is roughly stable across equal-sized subsets of O, and the overall performance growth of increasingly bigger subsets is roughly linear, then (O,R) is performance-homogeneous.

Definition 12. Given an ontology O and reasoner R, we say that O is performance-homogeneous for R if there is a linear factor L and variable k such that for all M ⊆ O and for k · |M| = |O|, we have that L · k · RT(M) ≈ RT(O). Otherwise O is said to be performance-heterogeneous for R.

The naive approach to determine performance heterogeneity of a given ontol- ogy/reasoner combination is to enumerate and test subsets of the ontology. But seeing as realistic ontologies can be large, reaching up to hundreds of thousands of axioms, this approach is obviously infeasible. Instead, we design a performance-profile detection mechanism that splits the ontology into equal-sized partitions containing randomly selected axioms, and measures the performance of these subsets as increments – building up to the whole ontology. This simulates an ontology evolution scenario, albeit one where all changes are additions. If the performance over increments is highly variable, 8.3 Implementation 128 then we can immediately determine that the given ontology/reasoner combination is performance-heterogeneous.

8.3 Implementation

Given Definition 12, our next step is to present a feasible mechanism to ver- ify whether an ontology/reasoner combination is performance-heterogeneous or homogeneous. For this purpose, we implement the incremental ontology evolu- tion scenario set out in the previous section with two different partition sizes; one coarse-grained, which we expect to reveal some, even if not so evident, per- formance variability within increments, or a linear tendency throughout, and the other fine-grained, which should either confirm the coarse-grained division results, or straighten out any cases for which the coarse-grained division was not resolute enough (to capture more intricate performance variations). For the coarse-grained partitioning, we assume that splitting the ontology in halves will not be very informative, as there would be a single point in the growth curve where we might observe any performance variations. So we split the ontology in quarters, in order to have exactly two intermediate points between the first and final partitions (i.e., RT(O)), where we presume that performance will be rather stable across the board. Naturally, that we find a particular performance growth pattern in a single execution of our prediction method is insufficient to determine, at least with statistical significance, that the same would occur if the partitions were different, i.e., if they contained other axioms. So we need to perform this test as many times as are needed (for statistical significance, for example) for the given input. The implementation of our prediction framework is described in Algorithm 9, which begins by executing the sub-procedure described in Algorithm 10a specified number of times. In the latter, we start by shuffling the axioms in the ontology into a list. Then, the ontology is partitioned into the specified number of partitions i, where each partition pi contains the axioms in the preceding partition pi−1. The classification times of these partitions are measured and added, together with the partition sizes, into a map. This map is returned to Algorithm9, which, having gathered the values of each map produced via Algorithm 10, verifies for each partition size, whether the classification times of partitions of that size are consistently stable, and growing according to the linear factor L. In other 8.3 Implementation 129 words, if the classification times of all partitions grow roughly linearly according to size, then the given ontology/reasoner combination is performance-homogeneous, otherwise it is performance-heterogeneous.

Algorithm 9 checkHeterogeneity Input: Ontology O, reasoner R, integer nrP arts, nrRuns, L, function ≈ Output: true if (O,R) is performance-heterogeneous, false if performance- homogeneous 1: map ← ∅ . map each partition size to a set of classification times 2: for 1 ≤ i ≤ nrRuns do 3: map ← map ∪ splitAndClassify(O, R, nrP arts) 4: end for 5: hmg = true 6: for partition size p ∈ map do 7: k = |O| ÷ p 8: for classification time t ∈ map(p) do 9: if L × k × t 6≈ RT (O) then 10: hmg = false 11: break 12: end if 13: end for 14: end for 15: return hmg

Algorithm 10 splitAndClassify Input: Ontology O, reasoner R, integer nrP arts Output: Map of partition sizes to classification times 1: axioms[] ← shuffle(O)3 2: map ← ∅ 3: for 0 ≤ i < nrP arts do 4: n = |O| × (i + 1) ÷ nrP arts − 1 5: Oi ← axioms[0 .. n] 6: t ← RT(Oi,R) . classify partition 7: Add (|Oi| → t) to map 8: end for 9: return map

Note that while both splitting and testing are merged in Algorithm 10, in our implementation we first split and serialise the ontology partitions, and then each

2Our implementation uses Java’s built-in Collections.shuffle() method, which implements the Fisher-Yates shuffle algorithm [28]. 8.4 Empirical Evaluation 130 partition is tested on a new JVM instance. This is done to ensure that all system resources used by previous tasks are appropriately freed, and available for the classification test.

8.4 Empirical Evaluation

The key aspect of our evaluation is forming a suitable test corpus; it is somewhat futile to verify the performance profile of an ontology/reasoner combination that is acceptably fast in all reasoners. So, generally, we focus on ontologies that are hard for at least one reasoner. In Section 8.4.1 we discuss in more detail how our approach is evaluated. Subsequently in Section 8.4.2, the results of our experiment are presented and discussed.

8.4.1 Materials and Methods

In order to test our methods we derived a corpus of “problematic” ontologies from the NCBO BioPortal. To start off, we performed a reasoner performance benchmark across all 216 ontologies in BioPortal. The reasoner versions used are Pellet v2.2.2, HermiT v1.3.6, FaCT++ v1.5.3, and JFact v0.9, and Machine 1 was used for the benchmark. From the entire BioPortal corpus, we discard all ontologies with classification times, for all reasoners, below 60 seconds (i.e., the “easy” ontologies). This leaves us with 13 ontologies (25 ontology/reasoner combinations), 3 of which did not classify within our timeout of 10 hours: the IMGT4 ontology with Pellet, GALEN5 with all reasoners, and GO-Ext (Gene Ontology Extension),6 with FaCT++ and JFact. In this experiment, after each input ontology is divided into 4 and 8 subsets, the classification times of these subsets as increments are measured (i.e., for the

4-part division, RT(O1), RT(O1 ∪ O2), RT(O1 ∪ O2 ∪ O3), and RT(O), where

O1, O2, O3 are subsets of O). We use a classification timeout of 5 hours per ontology/partition throughout the experiment. In order to be able to generalise (with a confidence level of 95%, and a 5% confidence interval) to all subsets of a specific size, we need to measure the classification time of the randomly incremented subsets (of the same size) a total of n = 384 times. This sample

4http://www.imgt.org/IMGTindex/ontology.html 5http://www.co-ode.org/galen/ 6http://www.geneontology.org/ 8.4 Empirical Evaluation 131 size was fixed based on the ontology with the least axioms (IMGT, with 1,112 axioms); for any bigger sized ontology n will not change.

8.4.2 Results

The set of ontology/reasoner combinations used in this experiment are those cells marked in grey in Table 8.1. Note that a number of reasoners timed-out for some input (marked as “timeout” in Table 8.1), and there were also a few errors (marked as “err”), namely unsupported datatypes by FaCT++ and a malformed literal exception thrown by HermiT.

RT(O) OL(O) |O| |O|e HermiT Pellet FaCT++ JFact ChEBI∗ ALE+ 60,085 28,869 10.3 65.8 7.9 11 CCLO ALCHI 139,004 28,084 8.3 923.5 156.1 154.8 EFO SHOIF 7,493 4,143 2.98 61.1 2.3 8 Gazetteer∗ ALE+ 263,725 225,870 143.7 47.2 12,248.5 12,551.2 GO-Ext∗ ALEH+ 60,293 30,282 12.5 268.4 timeout timeout ICF ALCHOIF 17,726 1,596 103.1 1.2 0.4 1.7 NCIt SH 11,6273 83,722 430.1 57.1 28.6 48.1 NEMO SHIQ 2,405 1,422 76.3 7.9 err 13.6 OBI SHOIN 25,257 3,060 61.6 119.8 err 72.1 PRPPO SH 121,383 87,576 err 118.9 err 58.4 VO SHOIN 8,488 3,530 17 4,275.9 err 5.98 IMGT ALCIN 1,112 112 80.4 timeout 0.2 0.2 GALEN ALEHIF+ 37,696 23,136 timeout timeout timeout timeout

Table 8.1: Basic metrics and classification times (in seconds) of selected BioPortal ontologies. Ontologies marked with ∗ are in the OWL 2 EL profile.

A number of ontology/reasoner combinations were unfeasible to work with: even half of GALEN gave reasoning times of over 10 hours, using any one of the tested reasoners, and so it was discarded from the test set. Additionally, IMGT with Pellet, and GO-Ext with FaCT++, JFact and Pellet, all have highly variable classification times within the random subsets tested, often timing out (when the whole ontology does not). Therefore, these are clearly performance- heterogeneous combinations. Note that the classification times reported in Table 8.1 were measured using Machine 1, and for the remainder of the experiment both machines are used; where the classification time of an ontology in subsequent 8.4 Empirical Evaluation 132

figures does not roughly coincide with that reported in Table 8.1, it is because Machine 2 is used. In the ontology/reasoner combinations with roughly linear growth (Figures 8.4 to 8.7), the curves are similar in both the 4 partition and the 8 partition ex- periments. In the case of VO (in Figure 8.10) the 4 partition test reveals a smooth non-linear curve, while the 8 partition test shows a non-monotonic growth, though both tests indicate a performance heterogenous ontology/reasoner combination. In the remaining (not mentioned) cases, the 8 partition curves are also consis- tent with the 4 partition ones. So clearly the coarse-grained partitioning suffices to determine whether the given ontology/reasoner combination is performance- heterogeneous or homogeneous. For example, Figure 8.8a shows that even the coarse, 4-part division method can detect strange performance patterns, although the more fine grained, predictably, is more detailed (Figure 8.8b). Contrariwise, Figure 8.10a shows a rather smooth, though non-linear curve. One might think that this smoothness indicates a somewhat predictable performance profile, but as we see in the more fine grained view (Figure 8.10b) this is not true; the 8-part division reveals highly variable performance towards the 6th and 7th increments, which go by undetected using the 4-part division (particularly those partitions that take much longer to classify than the original ontology). Overall 4 out of 13 ontologies (8 out of 25 ontology/reasoner combinations) exhibit roughly linear performance growth in our tests (Figures 8.4 to 8.7), which is highly surprising given the worst case complexity of deciding subsumption in expressive logics. The remaining 8 ontologies (13 ontology/reasoner combina- tions) exhibited non-linear (Figures 8.1 to 8.3) and sometimes highly variable performance behaviour (Figures 8.8 to 8.11). While one would expect non-linear performance behaviour, given the computational complexity of the problem, it is striking to find such variable non-linear performance patterns as those found in Figure 8.8. During the execution of this experiment we noted a curious phenomenon: while in all other ontologies we managed to achieve agreement on the overall classification time on each run, in the GO-Ext ontology this did not happen. Strikingly, the classification time of GO-Ext with Pellet, under precisely the same experimental conditions, varies from seconds to hours; more specifically, it ranges from 27 seconds to 1 hour and 14 minutes (see Figure 8.12). Note that the classification time of the whole ontology is measured without any alteration to 8.4 Empirical Evaluation 133

70 70

60 60

50 50

40 40

30 30

20 20

10 10

0 0 1 2 3 4 1 2 3 4 5 6 7 8 (a) ChEBI 4-part division (Pellet) (b) ChEBI 8-part division (Pellet)

Figure 8.1: ChEBI: Chemical Entities of Biological Interest (times in seconds).

1000 1000

900 900

800 800

700 700

600 600

500 500

400 400

300 300

200 200

100 100

0 0 1 2 3 4 1 2 3 4 5 6 7 8 (a) NCIt 4-part division (HermiT) (b) NCIt 8-part division (HermiT)

Figure 8.2: NCIt: National Cancer Institute Thesaurus (times in seconds).

100 90

90 80

80 70

70 60 60 50 50 40 40 30 30 20 20

10 10

0 0 1 2 3 4 1 2 3 4 5 6 7 8 (a) IMGT 4-part division (HermiT) (b) IMGT 8-part division (HermiT)

Figure 8.3: IMGT Ontology (times in seconds). the original ontology file, that is, we take the original file, parse it with the OWL API, and pass it on to the reasoner. A unique case as it may be (at least in 8.4 Empirical Evaluation 134

160 160

140 140

120 120

100 100

80 80

60 60

40 40

20 20

0 0 1 2 3 4 1 2 3 4 5 6 7 8 (a) CCLO 4-part division (JFact) (b) CCLO 8-part division (JFact)

Figure 8.4: CCLO: Coriell Cell Line Ontology (times in seconds).

180 180

160 160

140 140

120 120

100 100

80 80

60 60

40 40

20 20

0 0 1 2 3 4 1 2 3 4 5 6 7 8 (a) Gazetteer 4-part division (HermiT) (b) Gazetteer 8-part division (HermiT)

Figure 8.5: Gazetteer ontology (times in seconds).

140 120

120 100

100 80

80 60 60

40 40

20 20

0 0 1 2 3 4 1 2 3 4 5 6 7 8 (a) ICF 4-part division (HermiT) (b) ICF 8-part division (HermiT)

Figure 8.6: ICF: International Classification of Functioning-Disability and Health (times in seconds). our corpus), it suffices to illustrate not only the need for performance analysis solutions, but also the difficulty of the problem in cases such as this one. 8.4 Empirical Evaluation 135

140 140

120 120

100 100

80 80

60 60

40 40

20 20

0 0 1 2 3 4 1 2 3 4 5 6 7 8 (a) PRPPO 4-part division (Pellet) (b) PRPPO 8-part division (Pellet)

Figure 8.7: PRPPO: Patient Research Participant Permissions Ontology (times in seconds).

140

200 120

100 150

80

100 60

40 50 20

0 0 1 2 3 4 1 2 3 4 5 6 7 8 (a) EFO 4-part division (Pellet) (b) EFO 8-part division (Pellet)

Figure 8.8: EFO: Experimental Factor Ontology (times in seconds).

300 1000

900 250 800

700 200 600

150 500

400 100 300

200 50 100

0 0 1 2 3 4 1 2 3 4 5 6 7 8 (a) NEMO 4-part division (HermiT) (b) NEMO 8-part division (HermiT)

Figure 8.9: NEMO: Neural Electromagnetic Ontologies (times in seconds).

We speculate that this may be due to 1) triggering of the EL classifier built 8.5 Conclusions 136

6000 10000

9000

5000 8000

7000 4000 6000

3000 5000

4000 2000 3000

2000 1000 1000

0 0 1 2 3 4 1 2 3 4 5 6 7 8 (a) VO 4-part division (Pellet) (b) VO 8-part division (Pellet)

Figure 8.10: VO: Vaccine Ontology (times in seconds). into Pellet in many (presumably fast), but not all, runs, 2) different reasoner- internal ordering of axioms, or 3) differing choices during GCI absorption (note that GO-Ext contains 4,407 hidden GCIs). Our experiments revealed three patterns of performance growth: (1) mono- tonic and linear, (2) monotonic, but non-linear, and (3) non-monotonic. The performance phenomena (2) and especially (3) show that some ontology/reasoner combinations are highly sensitive to changes, while (1) confirms that there exist ontologies for which reasoner performance is roughly dependent on the number of axioms. Taking into account the different reasoners tested per ontology, that is, across the tested combinations of the same ontology with different reasoners, none of the ontologies fall into all three patterns listed above. However, there is one (the Vaccine Ontology, in Figure 8.10) whose performance growth shifts between pattern (2) and (3) from the 4-part division to the 8-part division, accordingly. Generally there is no identifiable relation between expressivity or OWL 2 profile and the performance patterns mentioned above. Indeed just considering the OWL 2 EL ontologies, we have that ChEBI fits into pattern (2), Gazetteer fits into (1), and finally GO-Ext fits into (3), yet their expressivity varies only in the use of role hierarchies. The remaining DL ontologies also span all three performance growth patterns.

8.5 Conclusions

In our attempt to better understand reasoner performance, we discovered an in- teresting phenomenon: while one would expect the performance of classification 8.5 Conclusions 137

4500 350

4000 300 3500 250 3000

2500 200

2000 150

1500 100 1000

50 500

0 0 1 2 3 4 1 2 3 4 5 6 7 8 (a) OBI 4-part division (Pellet) (b) OBI 8-part division (Pellet)

80 80

70 70

60 60

50 50

40 40

30 30

20 20

10 10

0 0 1 2 3 4 1 2 3 4 5 6 7 8 (c) OBI 4-part division (HermiT) (d) OBI 8-part division (HermiT)

350 100

90 300 80

250 70

60 200 50 150 40

100 30

20 50 10

0 0 1 2 3 4 1 2 3 4 5 6 7 8 (e) OBI 4-part division (JFact) (f) OBI 8-part division (JFact)

Figure 8.11: OBI: Ontology for Biomedical Investigations (times in seconds). to grow at least quadratically, the fact that there are ontologies with a linear performance growth curve (i.e., (1) from the growth patterns discussed earlier) is particularly surprising. Almost as surprising is finding such highly variable performance in the performance-heterogeneous combinations with growth pat- tern (3), such as EFO/Pellet or NEMO/HermiT. These reasoner performance 8.5 Conclusions 138

4500 4000 3500 3000 2500 2000 1500 1000 500 0 0 10 20 30 40 50 60 70 (a) Times in chronological order.

5000

4000

3000

2000

1000

0 0 10 20 30 40 50 60 70 (b) Times in ascending order.

Figure 8.12: Classification times (in seconds) of the GO-Ext ontology with Pellet. phenomena were previously unknown to the ontology community. And the de- vised mechanism to uncover them has several potential applications (as discussed next), both for reasoner developers and ontology engineers. This kind of analysis is of interest to reasoner developers attempting to op- timise for, or understand why the reasoner is performing so poorly on some input; if the ontology/reasoner combination is performance-homogeneous, then the slow performance could be due to scalability issues, whereas if the input is performance-heterogeneous, it is possible that there exist (small) sets of axioms that are a performance bottleneck for that reasoner. The question of how these can be found will be addressed in the following chapter. Ontology engineers could also benefit from this service; it allows one to verify, for instance, whether the performance profile of an ontology changed from one version to another. Additionally, for one that encounters an ontology for the first time and wishes to reuse it, performing this heterogeneity detection across reasoners can aid in selecting a reasoner that “reacts best” to changes, or has a more desirable performance (profile) for the input ontology. For instance, given an ontology O and reasoners R1 and R2, if (O,R1) falls under the pattern (2), and (O,R2) under pattern (3), then one could infer that (O,R1) is more robust to change than (O,R2). 8.5 Conclusions 139

Based on our prediction mechanism, we know that ontology/reasoner combi- nations with performance growth patterns (2) and (3), as discussed beforehand, contain hot spots, while those with a performance growth pattern (1) have no hot spots. In the subsequent chapter we address the problem of finding (small) hot spots in the ontology/reasoner combinations which were found to be performance- heterogeneous. Chapter 9

Performance Hot Spots

Previously in Chapter8 we introduced the notion of performance heterogene- ity and homogeneity, and investigated means to test whether a given ontolo- gy/reasoner combination is performance-homogeneous or heterogeneous. In this chapter, we explore the following hypothesis: If an ontology is performance- heterogeneous for some reasoner, then it contains (small) subsets whose interac- tion with the remainder is performance-degrading – so-called hot spots, and it is possible to find them. In the absence of a hot spot finding service, different users have developed varying coping mechanisms, one of which is to remove all dis- jointness axioms: this might improve reasoning time, but we also know of cases where it dramatically degrades it. In performance-homogeneous combinations, on the other hand, the performance of ontology subsets is roughly proportional to the size of the ontology, so it is unlikely that modifying the ontology will cause a major performance change. In this chapter we design and evaluate methods to identify, and remove or approximate hot spots in such a way that reasoner performance over the resulting “repaired” ontology improves significantly.

9.1 Motivation

Based on the findings reported in Chapter8, we learned that there exist subsets of ontologies which, by themselves, perform worse than the whole ontology. This suggests that in such performance-heterogeneous cases, there could exist hot spots with a positive or negative effect (that is, subsets whose removal improves or degrades reasoner performance, accordingly), though we are only concerned with positive hot spots. We use the term “hot spot” by way of analogy with program 9.1 Motivation 141 profilers though, unlike in program profilers, we cannot tell upfront whether the hot spots by themselves take a long time to classify, or whether they have a more intricate performance-degrading effect. A naive approach to finding hot spots would be an exhaustive search over “small” sized ontology subsets. However, the search space is unmanageable; for a number of axioms n, variable k, and considering only subsets of size below 20% of n, the possible subsets are all unique combinations of n of size k, for 1 6 k 6 0.2n. The verification of each such “hot spot candidate”, i.e., a subset that is possibly a hot spot, is a costly operation: we have to measure RT(O\M). So we need some other method for producing good “candidate hot spots”. In [102], the authors suggest that the satisfiability checking time of an atomic concept is an indicator of the total time the reasoner spends on or “around” those atomic concepts during classification. In particular they observe that, in their examined ontologies, the satisfiability checking time of a few concepts (2- 10 out of thousands) took much longer than the remaining concepts, which was also noted beforehand in [47]. Taking this into account, the authors were able to “repair” their sample ontologies by removing a small number of axioms based on guidance from satisfiability checking times. Since subsumption testing in expressive logics such as SROIQ is, as a last re- sort, reduced to satisfiability testing, it is perfectly plausible that the standalone satisfiability checking time of a concept C is correlated with a hot spot. Especially seeing as classification typically finds far more non-subsumptions (i.e. satisfiable concepts of the form C u ¬D) than subsumptions, it is reasonable to presume that the work put into determining whether C is satisfiable may be repeated; for a given ontology O, the worst case would be testing the satisfiability of C u ¬D, for all D ∈ O\{e C}. Obviously it could happen that C is rarely approached during classification (for example, if unsatisfiable), or it could be that its satis- fiability in combination with other atomic concepts is easier. Though given the evidence discussed above, it seems reasonable to hypothesise that (high) satis- fiability checking times can be effective hot spot indicators, that is, hints as to possible sources of performance degradation. 9.2 Specification 142

9.2 Specification

Previously we described the general intuition of a hot spot. Now we formalise this notion in Definition 13.

Definition 13 (Hot spot). Given an ontology O and reasoner R, we say that a subset M ( O is a hot spot if: |M|  |O|, and RT(O\M,R)  RT(O,R).

In Section 9.2.1 we present a hot spot finding method partly inspired by the findings reported in [102]. Subsequently, in Section 9.2.2, we specify a variety of mechanisms that exploit hot spots in order to speed up classification of a given ontology/reasoner combination.

9.2.1 Finding Hot Spots

As briefly described in Section 9.1, we begin hot spot finding by measuring the satisfiability checking times of all atomic concepts in a given ontology O. Just knowing “hard” concepts does not give us a corresponding set of axioms. Using a concept C ∈ Oe as a starting point, we start by extracting the set of terms that co-occur with C in axioms of O, i.e., the so-called usage of C. This gets us the explicit presence of the concept, which simulates what a user might do: identify the problem (C) and then cast aside axioms that involve it. Additionally, one might want to remove the implicit presence of the concept as well. For that purpose, we use the >⊥∗-module for the usage signature as the module “around” C. We rely on >⊥∗-modules as these were shown to be the smallest kind of syntactic locality-based module [89]. The module or usage of a concept is a valid hot spot candidate if it is acceptably small (hence the use of >⊥∗-modules). In general, we consider subsets of less than 20% the size of the ontology, and give precedence to subsets containing both explicit and implicit presence of a given concept, since these are guaranteed to preserve all entailments between the seed term C and those terms that co-occur with C in axioms. If the size of the >⊥∗-module is not below our threshold, and the size of the usage is within that threshold, we take the usage as our hot spot candidate. In terms of performance, we regard a hot spot candidate M as a (verified) hot spot if it gives rise to a classification speedup by at least 80% over the original time, such that RT(O\ M) ≤ 0.2 × RT(O). We discuss the implementation of our technique in Section 9.3. 9.2 Specification 143

9.2.2 Reasoning with Hot Spots

The hot spot finding mechanism presented in the previous section is, by itself, po- tentially useful for reasoner developers. However, for end users, the applicability of our hot spot finding method is largely dependent on whether and how much information they are willing to lose. Indeed in a realistic edit-compile-deploy scenario, users may be wary to dispose of parts of their ontology. Thus, in order to avoid this predicament, we define a series of approximation and knowledge compilation techniques, in Section 9.2.2.1 and 9.2.2.2, respectively. Whether one uses the former or the latter, it is necessary to “maintain” the property of interest (that is, acceptable performance) as the ontology is altered over time. One way to cope with this is to split the ontology into two files; the remainder and the hot spot, where the remainder imports the hot spot via owl:imports. Subsequently one could continue to engineer the remainder alone,1 while being able to approx- imate or compile the full concept hierarchy based on our techniques. This would naturally require using a reasoner that implements the appropriate algorithms to deal with hot spots.

9.2.2.1 Approximation Techniques

We designed a series of techniques to approximate, soundly but incompletely, the concept hierarchy of a given ontology. The intuition is that we classify the parts (i.e., hot spot and remainder) separately and union the results. Recall that Con(O) stands for the set of all atomic subsumptions inferred from O.

Definition 14. Given an ontology O and a hot spot M, we define the following approximations:

n-Ap(O) := O\M c-Ap(O) := Con(O\M) ∪ Con(M) nc-Ap(O) := O \ M ∪ Con(M)

Based on a hot spot M from an ontology O we have the immediate approx- imation n-Ap(O); it is guaranteed (by Definition 13 of a hot spot) to be much

1In which case, the ontology editor should allow choosing to instantiate a reasoner on select elements of the imports closure. 9.2 Specification 144 easier to classify than the original ontology, though possibly too incomplete with respect to Con(O), that is, it may miss a substantial number of inferred atomic subsumptions from O. Then we have the c-Ap(O) approximation, which should be more complete than n-Ap(O) alone,2 and nc-Ap(O), where we expect that the interaction between inferred subsumptions in M and the remainder O\M will bring us closer to Con(O). In order to provide a basis of comparison for our approximations, we use a known approximate reasoning method based on a reduction of the input into a polynomial time fragment of SROIQ: EL, which is used in TrOWL [86]. We implemented the EL reduction algorithm so as to delegate the approximation to any given reasoner other than REL (the EL reasoner used within TrOWL), and thus provide a fair basis of comparison between approximations. Our EL reduction-based approximation is denoted tr-Ap(O).

9.2.2.2 Compilation Techniques

Another way to deal with slow performance is by compiling a given ontology into a form that is faster to query (in our case, we are only concerned with atomic subsumptions). Compiling, in the typical sense, means translating the ontology into a tractable form, from which query answering should be faster than querying the original ontology, while giving the same answers [13, 91]. In this sense, compiling, rather than approximating, would certainly be preferable to end users (so long as the classification time of the compilation is not (much) higher than that of the approximation) since classification results would be complete. As such, we devise a series of knowledge compilation techniques based on hot spots, in order to verify whether these benefit from similar performance speedups as the approximations specified above. The rationale behind these is that by adding inferred knowledge, for example, from a hot spot, to the original ontology, reasoners might not need to perform certain, possibly expensive subsumption tests, and, as a consequence, should (at least intuitively) perform faster. Our knowledge compilations are specified in Definition 15.

Definition 15. Given an ontology O and a hot spot M, we define the following

2In Section 9.4.2.1 we find out by how much. 9.3 Implementation 145 knowledge compilations:

om-Kc(O) := O ∪ Con(M) or-Kc(O) := O ∪ Con(O\M) orm-Kc(O) := O ∪ Con(M) ∪ Con(O\M)

In the om-Kc(O) compilation we simply add to O the consequences of the hot spot, while in or-Kc(O) we add the consequences of the remainder. The final compilation consists of the sets of consequences from both hot spot and remainder, and the original ontology O. These techniques also serve to demystify, albeit in a restricted sense, the claim that asserting in O inferred knowledge from O, i.e., O ∪ Con(O), makes reasoning faster. It is restricted in this case because we do not assert the full set Con(O), but rather subsets of that.

9.3 Implementation

The hot spot finding technique is described in Algorithm 11, where we start by collecting satisfiability checking times of each atomic concept, and sort these in descending order of time. Then, beginning with the concept whose satisfiability check took longest, we extract both the usage and module for the usage signature. If the size of the module is within our size threshold, we proceed with that as a hot spot candidate. Otherwise, if the size of the usage is acceptable, we take the usage as the candidate. Finally we verify whether the classification time of the ontology without the hot spot candidate is much faster than the whole ontology, in which case a hot spot is found. While not specified in Algorithm 11, the first step is attempting to classify the given ontology O until a timeout is reached, in order to set the value of RT(O); so O is never classified more than once, as line 18 of Algorithm 11 might suggest. In cases where the classification of O times-out, RT(O) gets the timeout value. Note that both the size and performance speedup thresholds are easily param- eterisable, though for exposition purposes we hard-coded these as defined earlier in Section 9.2, and as used throughout our experiments. The algorithms to produce the specified approximations and knowledge com- pilations are straightforwardly derivable from the corresponding definitions. 9.4 Empirical Evaluation 146

Algorithm 11 findHotspots Input: Ontology O, reasoner R, number of hot spots n Output: Set of hot spots 1: hotspots ← ∅, candidates ← ∅, times ← ∅ 2: for atomic concept C ∈ Oe do 3: Add hC, SAT time(C)i to times 4: end for 5: ordT imes ← times sorted in descending order of SAT time(C) 6: for concept C ∈ ordT imes do 7: M ← ∅ 8: usage ← {α ∈ O | C ∈ αe} 9: mod ← >⊥*-mod({t | t ∈ usage^}, O) 10: if |mod| ≤ 0.2 × |O| then 11: M ← mod 12: else 13: if |usage| ≤ 0.2 × |O| then 14: M ← usage 15: end if 16: end if 17: if M= 6 ∅ then 18: if RT(O\M) ≤ 0.2 × RT(O) then 19: Add M to hotspots 20: if |hotspots| = n then 21: break for-loop 22: end if 23: end if 24: end if 25: end for 26: return hotspots

9.4 Empirical Evaluation

The general hypothesis for the evaluation is that, if an ontology/reasoner com- bination is performance-heterogeneous, then our hot spot finding technique can find hot spots, and do so much faster than an approach based on random picking of atomic concepts as candidates, i.e., one where in line 6 of Algorithm 11 an atomic concept C is randomly picked from the set of all atomic concepts: we call this the “random baseline approach”. These hot spot finding approaches are compared in Section 9.4.1, followed by a brief analysis of the hot spots in Section 9.4.1.1. 9.4 Empirical Evaluation 147

In Section 9.4.1.2 we compare the results of our hot spot finding approach with those retrieved by running the Pellint profiling tool on the same set of ontologies. Finally, in Section 9.4.2, we explore a series of hot spot based approximation and knowledge compilation techniques to speed up classification, and compare the former with the approximate reasoner TrOWL.

9.4.1 Finding Hot Spots

To test whether our satisfiability-based indicator is effective, that is, whether the candidate hot spots generated from indicated concepts are indeed hot spots, we compare it to candidate subsets generated using the random baseline approach described above. The corpus we use consists of the 12 “hard” ontology/reasoner combinations from BioPortal used in Chapter8. For each ontology/reasoner combination we attempt to find 3 hot spots, while testing no more than 1,000 hot spot candidates. In each case, we select candidate hot spots using both the satisfiability-guided and the random baseline methods. These results are shown in Table 9.1, where “nr. tests” is the number of candidate hot spots tested before either finding 3 hot spots or exhausting the search space (either the number of concepts in the ontology or 1000, whichever is smaller). The first striking result is that we found hot spots in all performance- heterogeneous combinations; if an ontology had a linear performance growth curve, then neither method found a hot spot, whereas if the growth curve was non-linear then we found at least 1 hot spot, and usually 3. Of course, this could be because our hot spot size or speedup criteria are too constraining for linear- growth ontologies, or that we failed to find the significant hot spots in those cases. Note that, for the GO-Ext ontology, we use (in Table 9.1 and subsequent ones) the median time value from the wide range of obtained classification times. Both techniques were able to find hot spots most of the time, though the random approach failed in two cases. For the NEMO/HermiT combination, both approaches failed to find 3 hot spots within the 1,000 test limit, which suggests that hot spots are scarce. Contrariwise, for NCIt/HermiT, while the random approach failed to find any hot spots, the SAT-guided approach found 3 in 7 tests. In general, though not always, the SAT-guided approach found 3 hot spots in far fewer tests than the random approach (on average, respectively, in 129 versus 426 tests), validating concept satisfiability as an efficient indicator. Note that, at this point, we only present classification time speedups. The completeness 9.4 Empirical Evaluation 148

Nr. Nr. Hot Avg. Avg. Nr. Avg. Avg. Avg. O axioms concepts R RT(O) spots RT(O\M) speedup tests |M| %|O| RT(M) 3 12.3 5.8x 3 186 0.3% 0.55 ChEBI 60,085 28,869 Pellet 65.8 3 3.5 18.9x 89 522 1% 0.72 3 9.6 5.3x 128 68 1% 0.13 EFO 7,493 4,143 Pellet 61.1 3 10.9 5.6x 863 70 1% 0.14 3 29.6 9.5x 36 98 0.2% 0.08 GO-Ext 60,293 30,282 Pellet 268.4 3 31.9 8.5x 419 17 0.03% 0.06 1 26.1 2072x 112 98 9% 0.09 Pellet >54,000 1 26.1 2072x 112 98 9% 0.09 IMGT 1,112 112 3 7.8 15.9x 86 35 3% 8.86 HermiT 80.4 3 7.1 21.7x 103 36 3% 10.4 1 5.5 13.9x 1,000 44 2% 4.63 NEMO 2,405 1,422 HermiT 76.3 0 - - 1,000 - - - 3 2.3 25.9x 3 570 2% 1.56 HermiT 61.6 3 4.3 14.2x 189 576 2% 1.48 3 1.1 14x 3 570 2% 1.12 OBI 25,257 3,060 JFact 72.1 3 7.4 9.8x 57 576 2% 1.19 3 11.1 12.5x 29 708 3% 2.05 Pellet 119.8 3 21.6 5.7x 133 593 2% 1.76 3 30.4 142x 11 322 4% 1.56 VO 8,488 3,530 Pellet 4275.9 3 371.7 17.8x 725 262 3% 0.61 3 16.1 8.1x 7 3,611 3% 16.14 NCIt 116,587 83,722 HermiT 430.1 0 - - 1,000 - - -

Table 9.1: Comparison of hot spots found via SAT-guided (white rows) and random (grey rows) concept selection approach. CPU times in seconds. of classification results is discussed in Section 9.4.2. A difficulty of the SAT-guided approach is the time to test all concepts for satisfiability. For example, we were unable to retrieve precise satisfiability check- ing times for the GO-Ext ontology with FaCT++ and JFact. Instead, we used a timeout on each concept satisfiability test of 60 seconds. One way to overcome this difficulty is to integrate hot spot finding as a sub-process within SAT-based reasoners, in such a way that when a concept with (unusually) high satisfiability testing time is found, a hot spot test is performed concurrently with classification. This strategy could be exploited for “anytime reasoning”, whereby when a hot spot is found, the reasoner can return preliminary results while full classification is still running; thus the longer one waits the more complete the results will be.

9.4.1.1 Hot Spot Analysis

Having the hot spots in hand, we now investigate whether the removal of each hot spot from the original ontology happened to shift expensive constructs from 9.4 Empirical Evaluation 149 the main input to the subset. This could help explaining why such a performance speedup occurs when the hot spot is removed. To start with, in Table 9.2 we show the expressivity of the ontology, hot spots and the ontologies without the hot spots. Notice that, in several cases, the removal of the hot spot does not change the expressivity of the remainder with respect to the whole ontology, for example in ChEBI. However in other, few cases there is a reduction of expressivity, for instance, the hot spots found in EFO leave the remainder without nominals. Similarly, in NEMO the remainder no longer has qualified cardinality restrictions.

Ontology L(O) L(O\Mi) L(Mi) ChEBI ALE+ ALE+ ALE+ EFO SHOIFSHIFSHOIF GO-Ext ALEH ALEH AL, ALEH, ALE IMGT ALCIN ALC, ALCIN ALCI, ALCIN NCIt SH ALCH S NEMO SHIQSHIFSHIQ OBI SHOINSHOINSHOIF, SHOIN VO SHOINSHOINSHOIF

Table 9.2: Expressivity of each original ontology (O), its various hot spots (Mi, for 1 ≤ i ≤ 3) and corresponding remainders (O\Mi).

Overall, there is no conclusive evidence of any (significant) relation between expressivity alone and poor reasoner performance, at least not within the anal- ysed corpus. So we now move on to investigating a different metric: GCIs. Such axioms, particularly those that cannot be efficiently dealt with via lazy unfold- ing [5, 54],3 are a known potential source of hardness. An optimisation technique called absorption [43] is typically employed (not only in tableau-based algorithms, but also in other calculi, for example, in resolution-based algorithms for proposi- tional and first order logics) before classification and coherence checking, which transforms these GCIs into primitive concept inclusions (that is, given a GCI C v D, it is transformed (when possible) into CIs of the form A v D0, where A is a primitive atomic concept) so that lazy unfolding can be applied. Though a highly significant optimisation, GCI absorption gives rise to disjunctions in trans- formed axioms, which is also a potential source of hardness. That said, it seems reasonable to expect some relation between the loss of GCIs in O after removing

3For instance, axioms with complex concepts on both sides, such as A u B v ∃r.C. 9.4 Empirical Evaluation 150 some hot spot, and the performance speedup. So we verify whether the removal of the hot spots from each ontology happens to shift a substantial number of GCIs from the ontology into the hot spot. The results gathered are shown in Table 9.3.

Average Ontology OO\M1 O\M2 O\M3 M1 M2 M3 reduction EFO 172 163 164 164 9 8 8 4.8% GO-Ext 4407 4398 4382 4382 9 25 16 0.4% NCIt 42 37 36 36 5 6 6 13.5% NEMO 31 30 - - 1 - - 3.2% OBI 227 182 193 193 44 33 33 16.6% VO 235 196 201 197 39 34 38 15.7% IMGT 38 0 0 0 38 38 38 100%

Table 9.3: Number of GCIs contained in each ontology, its hot spots, and their corresponding remainders. The “average reduction” represents the percentage of GCIs removed from O into O\Mi, for 1 ≤ i ≤ 3.

The obvious point to notice here is that the removal of each of the 3 hot spots found within IMGT (for HermiT) leaves the remainder with no GCIs at all. In the remaining cases there is no other such radical difference. Overall, on average 22% of GCIs are shifted over from O into one of the remainders upon removing a hot spot. In NEMO and NCIt only 1 to 6 GCIs are removed from the original ontology (the lowest absolute number of removed GCIs), which could be because these few GCIs could not be absorbed. It is clear that we cannot immediately consider the sheer number of removed GCIs as indication enough of performance variations, especially since these GCIs were removed in combination with other axioms in the hot spots. We expect that a glass box approach, where the “handling” of such GCIs is traced during the execution of the classification algorithm, may help disentangle performance diffi- culties in specific reasoners, and in particular the relation between the removed GCIs and the resulting performance variation.

9.4.1.2 Comparison with Pellint

As a final check, we compared our hot spot finding technique with Pellint [72]. Recall from Section 4.2 that Pellint identifies potentially performance-degrading axioms, so-called “lints”, and then attempts to repair the ontology, typically 9.4 Empirical Evaluation 151 by weakening those axioms. If the number of axioms altered by Pellint’s repair strategies is sufficiently small, and the gain sufficiently large, then Pellint will have identified a hot spot (though, at most 1). Since we presume that performance- homogeneous ontology/reasoner combinations have no hot spots (indeed we found none), we would expect that, while possibly improving performance with its re- pairs, Pellint would not identify a hot spot. Similarly, for non-Pellet reasoners, we would expect no improvements at all. To check these hypotheses, we ran Pellint on all our ontologies and compared reasoning times for both the Pellint-altered versions, and by removing the lints identified from O (thus providing a direct comparison with Table 9.1). The results are shown in Table 9.4, where those ontologies for which Pellint found no lints are omitted (5, in total). If Pellint found lints but could not repair them, then the number of altered axioms will read as 0 and no tests performed.

Nr. axioms %|O| Altered(O) O\{lints} O R RT(O) altered (lints) altered RT() Speedup RT() Speedup ChEBI Pellet 65.8 0 - - - - - EFO Pellet 61.1 172 2% 3.7 16.7x 3.1 19.7x GO-Ext Pellet 268.4 4407 7% 19.4 13.8x 5.85 45.9x VO Pellet 4275.9 231 3% 119.7 35.7x 3.32 1288x NCIt HermiT 430.1 42 0.04% 443.4 0.13x 448.1 0.13x Pellet 923.5 642.3 1.4x 631.0 1.5x Coriell FaCT++ 156.146 0.03% 159.2 0.98x 159.1 0.98x JFact 154.8 154.2 1.004x 143.9 1.08x PRPPO Pellet 118.9 0 - - - - -

Table 9.4: Ontology/reasoner combinations for which Pellint found lints.

A first note is that Pellint was unable to find any hot spots in the performance- homogeneous ontology/reasoner combinations, though for one (Coriell/Pellet) it was able to provide a significant performance speedup (32%). This further con- firms our linear/homogeneous hypothesis. Second, Pellint found hot spots in 3 out of 8 performance-heterogeneous ontology/reasoner combinations, performing worse than even the random baseline approach. When found, the hot spots were competitive, but not all repaired lints improved performance (for instance, in NCIt/HermiT). In general, Pellint failed to find hot spots in our experiments due 9.4 Empirical Evaluation 152 to finding no lints (in 5 ontologies), having no repairs (in 2 ontologies),4 or just failing to produce a significant enough (or any) effect (in 4 ontology/reasoner combinations, with most being non-Pellet). As expected, Pellint found no hot spots or performance improvements for reasoners other than Pellet. Of course, this might be just due to its overall poor hot spot finding. Finally, Pellint’s alterations had a noticeable negative effect on reasoning time compared to simple removal. Whether these approximations significantly pre- vent entailment loss needs to be investigated, though given the high development and maintenance costs of Pellint, it hardly seems viable compared to (reasoner- independent) search based methods such as the one we devised.

9.4.2 Hot Spot Based Reasoning

In this section we compare the approximation and knowledge compilation tech- niques specified in Section 9.2.2, i.e. Sections 9.2.2.1 and 9.2.2.2, accordingly.

9.4.2.1 Approximations

To start with, a comparison of the approximations is shown in Table 9.5, con- taining, for each of the 3 approximations as well as tr-Ap(O) (with the respective reasoner): the classification times and completeness with respect to the full clas- sification results, i.e. Con(O). In order to verify the latter, we compute Con(O) with the respective reasoner, or, where that takes too long, we use all other tested reasoners to compute Con(O), and then take the set which is most “agreed-upon” as the result. The completeness of classification results of each approach is de- noted “compl.” in Table 9.5. Overall the closest approximation is nc-Ap(O), which yields an average com- pleteness of 99.84%, and an average speedup of 89.3% over the original times. tr-Ap(O) is typically more complete than our approximations, though in sev- eral cases classifying an ontology with tr-Ap(O) is much slower than the orig- inal RT(O), for example, tr-Ap(O) failed to classify the NCIt within 5 hours, compared to ≈7 minutes originally. Similarly with ChEBI and OBI, the approx- imation is no faster than the original times. The average speedup provided by

4The set of suspect axioms might be a hot spot (or a subset thereof), but without access to them we cannot test. 12The classification of the NCIt was interrupted after running for 5 hours, well above the original classification time. 9.4 Empirical Evaluation 153

n-Ap(O) c-Ap(O) nc-Ap(O) tr-Ap(O) O R Compl. Speedup Compl. Speedup Compl. Speedup Compl. Speedup ChEBI Pellet 55% 9.4x 55% 9.3x 100% 6.3x 100% 0.3x EFO Pellet 78% 7.2x 79% 7.2x 100% 5.3x 100% 2.7x NCIt HermiT 75% 9.8x 80% 7.8x 100% 9.4x -12 0.04x NEMO HermiT 97% 26.1x 98% 12.6x 100% 27.8x 99.94% 13.7x HermiT 51% 22.8x 55% 17.2x 100% 18x 100% 1.2x OBI JFact 51% 11.3x 55% 10.4x 99.92% 6.4x 99.95% 0.9x Pellet 50% 8.7x 54% 8.5x 100% 7.1x 100% 2.2x Pellet 68% 10,412x 76% 10,330x 100% 26,382x 100% 47,979x IMGT HermiT 92% 12.4x 97% 4.7x 99.92% 12.3x 100% 284x VO Pellet 50% 51.2x 52% 50.7x 98.36% 15.7x 100% 39.5x GO-Ext Pellet 95% 10.1x 96% 10.1x 100% 5.4x 100% 1.5x

Table 9.5: Reasoning times and degree of completeness of the devised approxi- mations, and tr-Ap(O). The degree of completeness is denoted “compl.”. tr-Ap(O) is non-existent, particularly due to the NCIt case. By excluding that one case, tr-Ap(O)’s average classification time speedup is of 33.7%, which is still nowhere near the speedups provided by nc-Ap(O). From the cases of OBI (with any reasoner), ChEBI and VO, we observe that the n-Ap(O) and c-Ap(O) approximations are much worse than the nc-Ap(O). The latter performs just as well as n-Ap(O), while being far more complete. Applying the original TrOWL system, with its internal reasoner REL, is not so much better than using standard DL reasoners on the EL approximations, particularly since some DL reasoners (like Pellet or FaCT++) are finely tuned to the EL fragment of OWL. Nevertheless, we analysed those results only to find that TrOWL has the exact same problem with the NCIt, and only out-performs tr-Ap(O) in 4 out of 7 cases by mere seconds.

9.4.2.2 Compilations

Moving on to compilation techniques, we compare these in Table 9.6 with respect to their reasoning time (ergo speedup). Overall we observe that adding the inferred concept hierarchy of the parts does not necessarily improve classification time over the whole. There are cases, such as OBI/JFact, where the results of all compilations took much longer to 9.5 Conclusions 154

om-Kc(O) or-Kc(O) orm-Kc(O) O R RT() Speedup RT() Speedup RT() Speedup ChEBI Pellet 74.5 1.2x 73.1 1.2x 73.6 1.2x EFO Pellet 51.3 1.4x 63 1.2x 62.9 1.2x NCIt HermiT 616.1 1.1x 603.2 1.1x 614.5 1.1x NEMO HermiT 94.9 1.1x 94.6 1.1x 98.6 1.02x HermiT 71 0.97x 69.1 1x 70.7 0.98x OBI JFact >5hrs 0x >5hrs 0x >5hrs 0x Pellet 264 0.6x 207.5 0.76x 276.6 0.6x Pellet 36000 0.5x 36000 0.5x 36000 0.5x IMGT HermiT 94.8 0.96x 94.9 0.96x 94.8 0.96x VO Pellet 1704.4 2.5x 1066.2 4x 2136.2 2x GO-Ext Pellet 161.4 2.3x 30.1 12.3x 30.6 12x

Table 9.6: Compilation results for the devised compilation techniques (time in seconds, where unit is not shown). classify than the original ontology (note that the operation timed-out after 5 hours). On the other hand, there are cases where there is mild to noteworthy improvement, for instance VO classifies 75% faster when we use the or-Kc(O) compilation technique, which is a significant improvement with no loss of clas- sification results. Similarly the GO-Ext ontology classifies 92% faster with both or-Kc(O) and orm-Kc(O) compilation techniques. Nevertheless, the results gath- ered are not nearly as stable with respect to classification time improvement as our approximations, and the improvements obtained are also not as high as those shown in Section 9.4.2.1.

9.5 Conclusions

Based on the results gathered and discussed in Section 9.4, we have shown that our hot spot finding mechanism (described in Algorithm 11) is feasible, and successfully identified hot spots in all ontology/reasoner combinations deemed performance-heterogeneous. Using satisfiability checking time of concepts in an ontology as a “hardness” indicator typically allows us to form a valid candidate 9.5 Conclusions 155 set of hot spot axioms, though at a (time) cost. Although the otherwise random selection of concepts did not succeed in finding hot spots within as few tests as the goal-directed approach, it bypasses a potentially very time-consuming step of the algorithm. However, the reward for this step is shown to be worthy (par- ticularly in those cases where random selection found no hot spots), especially seeing as it can be incorporated into any reasoner as a form of dynamic optimi- sation; essentially the reasoner should keep track of concept satisfiability testing times, and when a concept takes “unusually” long, or much longer than previous concepts, then attempt to find a hot spot based on that concept in parallel with classification on the whole ontology. The run time of our entire procedure (from the start of satisfiability tests to outputting the approximated concept hierarchy) is lower than the original classi- fication time in 4 out of 11 cases, including one case (IMGT/Pellet) where classi- fication did not terminate within 15 hours. In general, the hot spots found were extremely good; typically much smaller than our size limit (only IMGT/Pellet was above 5% of the ontology), and often giving massive speedups (for instance IMGT/Pellet). There is no indication that hot spots, on their own, are particu- larly hard, which suggests an interaction effect as expected. So, unlike with hot spots in programs, there is no straightforward relationship between the perfor- mance of a hot spot in isolation, and the effect it has on the ontology as a whole (see the last column in Table 9.1). This is somewhat similar to the fact that, in a program, if an individually fast function is called sufficiently often, it may be the performance bottleneck for that program. That is, looking at the performance of the function in isolation is not sufficient to determine its impact on the overall run time. However, in our case, there are many possible ways that a performance hot spot might affect overall run time (we investigated the classification time of hot spots, as well as expressivity and GCI variations between the original ontology and the remainders and hot spots), and yet not exhibit pathological behaviour on its own. Of course, simply finding hot spots does not provide any explanation of per- formance patterns, it merely provides tools for investigating them. On the other hand, it is a purely black box technique, thus, unlike Pellint, does not require such insight to be effective, as demonstrated. For end-users, the envisioned ser- vice is straightforward: present the user with a selection of hot spots, and let 9.5 Conclusions 156 them select the most appropriate one to “set aside” (permanently or temporar- ily) or to rewrite into a less damaging approximation (for example, using one of the approximation or compilation techniques specified in Section 9.2.2). Although we aimed at finding hot spots resulting in substantial speedups to reasoning time, one might also want to consider hot spots with different proper- ties, for instance, that the remainder ontology is a module rather than the hot spot. Or that both remainder and hot spot are modules. Indeed the latter would allow us to engineer a complete classifier in a manner similar to MORe, described in [87]. Our techniques could benefit reasoner developers as well; for example, a hot spot gives the developer a pair of almost identical ontologies with vastly different performance behaviour. By comparing profiling reports on their rea- soners processing these inputs, the developer might gain additional insight into what causes the performance variation. Additionally, since we found that in ev- ery case there are GCIs shifted over from the original ontology into the hot spot (in IMGT, all of them), it seems reasonable to hypothesise that there is some interaction effect worthy of investigating, both in isolation and in combination with the remaining hot spot axioms. Chapter 10

Conclusions

This chapter outlines the contributions of this thesis, as well as their respective significance, in Section 10.1. Then, a series of possible future work directions are discussed in Section 10.2.

10.1 Contributions and Significance

The contributions of this thesis to the area of impact analysis in DL ontologies can be broadly grouped into the following topics, respectively: analysis of the evolution of the NCIt, logical change analysis methods for detecting and aligning axiom and term changes, and finally techniques for reasoner performance analysis and optimisation. Since logical changes can easily (and sometimes surprisingly) cause significant reasoner performance variations, having performance analytics services at hand is important for impact analysis in ontologies, allowing ontology engineers as well as reasoner developers to disentangle (undesirable) reasoner performance phenomena.

Diachronic ontology analysis In Chapter3 we describe a diachronic study of the NCIt across 10 years’ worth of monthly versions; an unprecedented study in ontology research that revealed surprising logical and reasoner performance change phenomena, which would not have been found by standard single-version studies. In particular, it showed some surprising performance ‘jumps’ between NCIt versions. Our findings suggest that reasoner developers would benefit from inspecting ontologies where the performance of a reasoner changed significantly, but the other reasoners’ performance did not vary. Such diachronic studies are 10.1 Contributions and Significance 158 also helpful to repair errors in previous versions, in other words, as a means to curate a version set. Indeed during the course of our study several errors were reported to and fixed by NCIt developers, for instance, problems with the serialisations or incoherent versions.

Logical change analysis The work presented in Chapter5 advances the state of the art in axiom-based diffing with a coarse and fine-grained change cate- gorisation mechanism that aligns (source and target) axiom changes between ontologies, without relying on change logs. Our change alignment allows ontol- ogy engineers to see the changes, as well as what they are a change of, similar to text document diffs aligning changes between phrases. Furthermore, because the categories designed follow from common editing patterns (for example, de- scribing a new term, or adding constraints to existing terms), we expect that ontology engineers will find that they facilitate both browsing and understanding changes between ontologies. In our study of the NCIt we found instances of ev- ery category that, even for non domain experts, were sufficient to identify editing patterns employed by NCIt developers; for instance, over time many axioms en- coding necessary conditions for some term are turned into equivalences, typically with further conditions; we were able to extrapolate this simply from analysing instances of the ‘strengthening’ category. The diff notion presented in Chapter6 advances the state of the art in term- based diffing with several approximations of the deductive difference for SROIQ ontologies, as well as a method that distinguishes terms having their meaning affected directly or indirectly (or both) between ontologies. Such distinction allows one to hide or ignore potentially many uninteresting changes in the change set. In hiding these, ontology engineers can focus on the (purely) direct term changes which should be of more interest. Our diff tool ecco implements the combined detection of axiom and term changes, according to the method specified in Chapter7, hence bringing together the axiom and term based notions discussed beforehand. Based on our com- bination of axiom and term changes, ontology engineers can effectively inspect changed axioms together with the terms that each of those affects directly (or indirectly), and how each of those terms is affected. Our diff is not only useful for developers of an ontology, but also for users who are examining ontologies in the wild, for instance, searching for an ontology to reuse in their system; by 10.1 Contributions and Significance 159 comparing versions of an ontology, or different ontologies covering the same do- main (or overlapping domains),1 the user gains insight into, for example, editing methodologies, what terms are most recently altered (hence giving a sense of “latest activity”), as well as what domains one ontology models that the other does not (and vice-versa).

Reasoner performance analysis Having formulated a theory of performance heterogeneity and homogeneity for ontology/reasoner combinations (in Chapter 8), its verification revealed three, previously undiscovered phenomena: ontolo- gies where reasoner performance is monotonic and linear, monotonic but non- linear, and non-monotonic with respect to their size. Ontology/reasoner combi- nations that exhibit one of the latter two growth patterns are called performance- heterogeneous, and hypothesised to contain small hot spots which can be found. This hypothesis is subsequently verified in Chapter9, where our satisfiability- guided hot spot finding method is able to identify several hot spots in each performance-heterogenous ontology/reasoner combination. As an alternative to removing the hot spot from the ontology, we evaluate several approximation and compilation techniques for classification, one of which showed particular promise in significantly speeding up classification time while maintaining near- completeness of classification results (the minimum being 98.3%, though in most cases the results are complete). In a realistic scenario however, one would not be able to determine the degree of incompleteness, that is, which subsumptions are missing from the results, without fully classifying the ontology; this is a major inherent drawback of incomplete reasoning. This work has various applications, both for ontology engineering practice and reasoner development/optimisation. For ontology engineers, the key appli- cation area is coping with an unacceptably slow-performing ontology/reasoner combination: the first thing to determine is whether the pair is performance- heterogeneous; if so we can find a hot spot, otherwise we expect that the slow performance is due to scalability issues rather than complex interactions between axioms in the ontology. Upon finding one or more hot spots, the user is expected to select one for removal or approximation, allowing users to continue working on a faster performing ontology version that is nearly as complete as the origi- nal with respect to classification. For reasoner developers, our work provides, to

1For comparing these ontologies one might have to align them beforehand, using a tool such as LogMap [55]. 10.2 Research Impact and Future Directions 160 start with, an indication of whether the reasoner is not scaling well for a par- ticular input, or whether that input contains hot spots for the tested reasoner. By searching for hot spots, each one that is found gives the developer a pair of nearly identical ontologies with radically different performance. A subsequent comparison of profiling reports of the reasoner on these different input, that is, the original ontology and an alternative ontology either without, or with an ap- proximated hot spot, should allow the developer to trace how hot spots cause a performance bottleneck for the whole system.

10.2 Research Impact and Future Directions

The work presented in this thesis has the potential to be extended in a variety of ways. In what follows, we describe desirable extensions to our work as well as unexplored methods to address the problems we tackled.

Axiom-centric impact analysis The axiom-based diff method could benefit from a combination with a term renaming detection mechanism (such as the one presented in [59, 55]). Currently, if a term name is renamed between ontologies, all axioms the term appears in are identified as changes. Of course, one could detect renames and produce two ontologies where those are “undone”, after which diffing will be accurate with respect to (found) renames, though a combination is clearly more helpful. Secondly, we discovered that structural equivalence is sensitive to insignificant changes in concept nesting, for example, α := A v (B u C) and α0 := A v B uC are considered different according to the current implementation of (structural) equality in the OWL API. While our diff aligns α and α0 as rewrites of one another, this is an implementation issue that needs resolving. That is, stronger notions of ‘obviously equivalent’ need to be taken into account.

Term-centric impact analysis In our term-based diff notion, we have ex- perimented with different ways to approximate the deductive difference between OWL 2 DL ontologies. However, since deductive inseparability is decidable for less expressive logics such as ALCQI, it would be of interest to compute the (sound and complete) deductive difference for ALCQI-axioms before resorting to approximations. This would be easily integrated into our diff framework, and 10.2 Research Impact and Future Directions 161 as shown even by computing deductive differences over ELHr, could find addi- tional affected concepts that our approximations might miss.

As pointed out in Chapter6, computing GrDiff( O1, O2)Σ is computationally expensive. And although the reward in comparison with our simpler diff func- tions is rather obvious in some cases, the performance of this diff is a bottleneck (especially when dealing with big ontologies such as the NCIt). So it would be valuable to attempt to further optimise our implementation in order to be able to process larger inputs within reasonable time. One of the problems we addressed in our term-centric analysis work was min- imising the amount of change (specifically affected concepts) shown to users, by structuring the change set according to whether concepts are (purely) directly or indirectly affected between ontologies. In some cases though, there is still a high number of (directly) affected concepts found by our methods, and while this is somewhat expected from a month’s worth of changes in the NCIt case, it is suggestive that further structuring might be desirable for comparing ontology versions; especially if these versions are “far apart” in terms of change effort.

Aligning term and axiom changes The output of our diff tool ecco cur- rently emphasises axiom changes, showing affected terms by their side; an alter- native, term-centric view might be of interest for certain users whose develop- ment methodology is focused on terms (for the delivery of a term hierarchy, for example). Also, in order to suit a broader set of users, the tool is developed inde- pendently of any ontology editor. Though in addition to this editor-independent service, an editor-integrated diff (for instance, in the form of a Prot´eg´eplugin) could be highly desirable for ontology engineers. It is also possible to integrate ecco in Web-based services, such as ontology repositories. The NCBO BioPor- tal currently uses a diff that outputs an XML file with added or removed terms and axioms (according to structural equivalence), providing no user interface to browse these changes. Instead of outputting this file, the repository could easily request the diff from ecco and present their users with the resulting webpage.

Performance heterogeneity and homogeneity In our current approach, we partitioned the test ontologies into 4 and 8 equally sized subsets. And though this was sufficient to determine the performance profile of most ontology/reasoner combinations, a few, like GALEN (for all reasoners), took too long to classify 10.2 Research Impact and Future Directions 162 beyond the first or second partition. In such cases, a more fine-grained parti- tioning might be preferable to work with, especially because unlike performance- homogeneity, heterogeneity of an ontology/reasoner combination (O,R) can be potentially determined within fewer increments (typically, at least without test- ing RT(O,R)). So by verifying smaller partitions of such challenging ontologies, we could get hints of performance heterogeneity earlier on, thus reducing overall computation time. Our technique could also serve as a means to reduce the search space of the hot spot finder, by keeping track of increments that significantly degrade reasoner performance. Of course, for this purpose one might prefer a more fine-grained partitioning of the input ontology, but even the 8-part division already reveals striking performance variability phenomena with reasonably small sized incre- ments. In this scenario, instead of randomly testing satisfiability of all concepts in a given ontology, one could design an alternative approach that starts with concepts in the signature of (the most) performance-degrading increments.

Performance hot spots Currently, we have concentrated on standalone sat- isfiability checking time of atomic concepts as the indicator for hot spots. There are clearly alternatives, for example, small atoms [26] or brute force methods [16]. Additionally, the hot spot finder coupled with a concept hierarchy approx- imation could be exploited in a manner similar to an “anytime” system [25]. Consider the following scenario: a user requests the classification of an ontology and, while classifying, the reasoner passes on to a hot spot finder the satisfiabil- ity checking times of concepts as these tests are performed; by listening to these times and concurrently searching for and returning hot spots (and corresponding approximations), it would be possible to engineer an anytime reasoner for DL classification. Finally, it would be interesting to see whether it is possible to derive Pellint- like rules (for different reasoners) directly from hot spots extracted from a large number of ontologies. Bibliography

[1] Robert S. Arnold and Shawn A. Bohner. Software Change Impact Analysis. Practitioners Series. IEEE Computer Society Press, 1996. Cited on page 15.

[2] Michael Ashburner, Catherine A. Ball, Judith A. Blake, David Botstein, Heather Butler, J. Michael Cherry, Allan P. Davis, Kara Dolinski, Selina S. Dwight, Janan T. Eppig, et al. Gene Ontology: tool for the unification of biology. Nature Genetics, 25(1):25–29. Cited on page 14.

[3] Franz Baader, Sebastian Brandt, and Carsten Lutz. Pushing the EL enve- lope. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI), 2005. Cited on pages 24 and 28.

[4] Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Peter F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, 2003. Cited on pages 14, 21, and 22.

[5] Franz Baader, Enrico Franconi, Bernhard Hollunder, Bernhard Nebel, and Hans-J¨urgenProfitlich. An empirical analysis of optimization techniques for terminological representation systems, or: Making KRIS get a move on. Applied Artificial Intelligence, 4:109–132, 1994. Cited on page 149.

[6] Franz Baader and Philipp Hanschke. A schema for integrating concrete domains into concept languages. In Proceedings of the 12th International Joint Conference on Artificial Intelligence (IJCAI), 1991. Cited on page 29.

[7] Franz Baader and . Tableau algorithms for description logics. In Proceedings of the International Conference on BIBLIOGRAPHY 164

with Analytic Tableaux and Related Methods (TABLEAUX), 2000. Cited on page 23.

[8] Franz Baader and Ulrike Sattler. An overview of tableau algorithms for description logics. Studia Logica, 69:5–40, 2001. Cited on page 23.

[9] Franz Baader and Ulrike Sattler. Description logics with aggregates and concrete domains. Information Systems, 28(8):979–1004, 2003. Cited on page 29.

[10] Samantha Bail, Bijan Parsia, and Ulrike Sattler. Justbench: A framework for OWL benchmarking. In Proceedings of the 9th International Semantic Web Conference (ISWC), 2010. Cited on pages 48 and 49.

[11] Samantha Bail, Bijan Parsia, and Ulrike Sattler. Extracting finite sets of entailments from OWL ontologies. In Proceedings of the 24th International Workshop on Description Logics (DL), 2011. Cited on page 39.

[12] Sean Bechhofer, Raphael Volz, and Phillip Lord. Cooking the semantic web with the OWL API. In Proceedings of the 2nd International Semantic Web Conference (ISWC), 2003. Cited on page 30.

[13] Marco Cadoli and Francesco M. Donini. A survey on knowledge compi- lation. AI Communications—The European Journal for Artificial Intelli- gence, 10(3):137–150, 1997. Cited on page 144.

[14] Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenz- erini, and Riccardo Rosati. Tractable reasoning and efficient query an- swering in description logics: The DL-Lite family. Journal of Automated Reasoning, 39(3):385–429, 2007. Cited on page 29.

[15] Werner Ceusters, Barry Smith, and Louis Goldberg. A terminological and ontological analysis of the NCI Thesaurus. Methods of Information in Medicine, 44(4):498–507, 2005. Cited on page 34.

[16] Sadiq Charaniya. Facilitating DL reasoners through ontology partitioning. Master’s thesis, Nagpur University, India, 2006. Cited on pages 18, 48, and 162. BIBLIOGRAPHY 165

[17] Bernardo Cuenca Grau, Ian Horrocks, Yevgeny Kazakov, and Ulrike Sattler. A logical framework for ontology modularity. In Proceedings of the 20th In- ternational Joint Conference on Artificial Intelligence (IJCAI), 2007. Cited on page 27.

[18] Bernardo Cuenca Grau, Ian Horrocks, Yevgeny Kazakov, and Ulrike Sattler. Modular reuse of ontologies: Theory and practice. Journal of Artificial Intelligence Research, 31(1):273–318, 2008. Cited on pages 27 and 66.

[19] Bernardo Cuenca Grau, Ian Horrocks, Boris Motik, Bijan Parsia, Peter F. Patel-Schneider, and Ulrike Sattler. OWL 2: The next step for OWL. Journal of Web Semantics, 6(4):309–322, 2008. Cited on page 14.

[20] Bernardo Cuenca Grau, Bijan Parsia, and Evren Sirin. Working with multi- ple ontologies on the semantic web. In Proceedings of the 3rd International Semantic Web Conference (ISWC), 2004. Cited on page 29.

[21] Bernardo Cuenca Grau, Bijan Parsia, and Evren Sirin. Combining OWL ontologies using ε-connections. Journal of Web Semantics, 4(1):40–59, 2006. Cited on page 26.

[22] Bernardo Cuenca Grau, Bijan Parsia, Evren Sirin, and Aditya Kalyanpur. Modularity and Web ontologies. In Proceedings of the 10th International Conference on the Principles of Knowledge Representation and Reasoning (KR), 2006. Cited on pages 26 and 34.

[23] Sherri de Coronado, Margaret W. Haber, Nicholas Sioutos, Mark S. Tuttle, and Lawrence W. Wright. NCI Thesaurus: Using science-based terminol- ogy to integrate cancer research results. Studies in Health Technology and Informatics, 107(1):33–37, 2004. Cited on pages 31 and 34.

[24] Sherri de Coronado, Lawrence W. Wright, Gilberto Fragoso, Margaret W. Haber, Elizabeth A. Hahn-Dantona, Francis W. Hartel, Sharon L. Quan, Tracy Safran, Nicole Thomas, and Lori Whiteman. The NCI Thesaurus quality assurance life cycle. Journal of Biomedical Informatics, 42(3):530– 539, 2009. Cited on pages 31 and 35.

[25] Thomas Dean and Mark Boddy. An analysis of time-dependent planning. In Proceedings of the 7th National Conference on Artificial Intelligence (AAAI), 1988. Cited on page 162. BIBLIOGRAPHY 166

[26] Chiara Del Vescovo, Bijan Parsia, Ulrike Sattler, and Thomas Schneider. The modular structure of an ontology: Atomic decomposition. In Proceed- ings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI), 2011. Cited on pages 26 and 162.

[27] J´erˆomeEuzenat and Pavel Shvaiko. Ontology matching. Springer-Verlag, Heidelberg (DE), 2007. Cited on page 76.

[28] Ronald A. Fisher and Frank Yates. Statistical tables for agricultural, bio- logical and medical research. Edinburgh: Oliver & Boyd, 1953. Cited on page 129.

[29] Gilberto Fragoso, Sherri de Coronado, Margaret Haber, Frank Hartel, and Larry Wright. Overview and utilization of the NCI Thesaurus. Comparative and Functional Genomics, 5(8):648–654, 2004. Cited on pages 31 and 34.

[30] Silvio Ghilardi, Carsten Lutz, and Frank Wolter. Did I damage my ontol- ogy? A case for conservative extensions in description logics. In Proceedings of the 10th International Conference on the Principles of Knowledge Rep- resentation and Reasoning (KR), 2006. Cited on page 26.

[31] Birte Glimm, Ian Horrocks, Boris Motik, and Giorgos Stoilos. Optimising ontology classification. In Proceedings of the 9th International Semantic Web Conference (ISWC), 2010. Cited on page 24.

[32] Jennifer Golbeck, Gilberto Fragoso, Frank W. Hartel, , Jim Oberthaler, and Bijan Parsia. The National Cancer Institute’s Th´esaurus and ontology. Journal of Web Semantics, 1(1), 2003. Cited on pages 31 and 34.

[33] Rafael S. Gon¸calves, Nicolas Matentzoglu, Bijan Parsia, and Ulrike Sattler. The empirical robustness of description logic classification. In Proceedings of the 26th International Workshop on Description Logics (DL), 2013. Cited on page 17.

[34] Rafael S. Gon¸calves, Bijan Parsia, and Ulrike Sattler. Analysing multiple versions of an ontology: A study of the NCI Thesaurus. In Proceedings of the 24th International Workshop on Description Logics (DL), 2011. Cited on page 20. BIBLIOGRAPHY 167

[35] Rafael S. Gon¸calves, Bijan Parsia, and Ulrike Sattler. Analysing the evolu- tion of the NCI thesaurus. In Proceedings of the 24th IEEE International Symposium on Computer-Based Medical Systems (CBMS), 2011. Cited on page 20.

[36] Rafael S. Gon¸calves, Bijan Parsia, and Ulrike Sattler. Categorising logical differences between OWL ontologies. In Proceedings of the 20th ACM Inter- national Conference on Information and Knowledge Management (CIKM), 2011. Cited on page 20.

[37] Rafael S. Gon¸calves, Bijan Parsia, and Ulrike Sattler. Facilitating the analy- sis of ontology differences. In Proceedings of the Joint Workshop on Knowl- edge Evolution and Ontology Dynamics (EvoDyn), 2011. Cited on page 20.

[38] Rafael S. Gon¸calves, Bijan Parsia, and Ulrike Sattler. Concept-based se- mantic difference in expressive description logics. In Proceedings of the 11th International Semantic Web Conference (ISWC), 2012. Cited on page 20.

[39] Rafael S. Gon¸calves, Bijan Parsia, and Ulrike Sattler. Concept-based se- mantic difference in expressive description logics. In Proceedings of the 25th International Workshop on Description Logics (DL), 2012. Cited on page 20.

[40] Rafael S. Gon¸calves, Bijan Parsia, and Ulrike Sattler. Ecco: A hybrid diff tool for OWL 2 ontologies. In Proceedings of the 9th International Workshop on OWL: Experiences and Directions (OWLED), 2012. Cited on page 20.

[41] Rafael S. Gon¸calves, Bijan Parsia, and Ulrike Sattler. Performance het- erogeneity and approximate reasoning in description logic ontologies. In Proceedings of the 11th International Semantic Web Conference (ISWC), 2012. Cited on page 20.

[42] Benjamin N. Grosof, Ian Horrocks, Raphael Volz, and Stefan Decker. Description logic programs: Combining logic programs with description logic. In Proceedings of the 12th International Confer- ence (WWW), 2003. Cited on page 29. BIBLIOGRAPHY 168

[43] Volker Haarslev and Ralf M¨oller. High performance reasoning with very large knowledge bases: A practical case study. In Proceedings of the 17th In- ternational Joint Conference on Artificial Intelligence (IJCAI), 2001. Cited on page 149.

[44] Frank W. Hartel, Sherri de Coronado, Robert Dionne, Gilberto Fragoso, and Jennifer Golbeck. Modeling a description logic vocabulary for cancer research. Journal of Biomedical Informatics, 38(2):114–129, 2005. Cited on page 14.

[45] Matthew Horridge and Sean Bechhofer. The OWL API: A Java API for working with OWL 2 ontologies. In Proceedings of the 6th International Workshop on OWL: Experiences and Directions (OWLED), 2009. Cited on page 30.

[46] Matthew Horridge, Bijan Parsia, and Ulrike Sattler. Laconic and precise justifications in OWL. In Proceedings of the 7th International Semantic Web Conference (ISWC), 2008. Cited on page 25.

[47] Ian Horrocks. Optimising Tableaux Decision Procedures for Description Logics. PhD thesis, University of Manchester, 1997. Cited on page 141.

[48] Ian Horrocks, Oliver Kutz, and Ulrike Sattler. The even more irresistible SROIQ. In Proceedings of the 10th International Conference on the Prin- ciples of Knowledge Representation and Reasoning (KR), 2006. Cited on pages 14, 24, 28, and 125.

[49] Ian Horrocks and Peter F. Patel-Schneider. Reducing OWL entailment to description logic satisfiability. Journal of Web Semantics, 1(4):345–357, 2004. Cited on page 24.

[50] Ian Horrocks, Peter F. Patel-Schneider, and . From SHIQ and RDF to OWL: The making of a web ontology language. Journal of Web Semantics, 1(1):7–26, 2003. Cited on page 28.

[51] Ian Horrocks and Ulrike Sattler. A tableaux decision procedure for SHOIQ. Journal of Automated Reasoning, 39(3):249–276, 2007. Cited on page 66. BIBLIOGRAPHY 169

[52] Ian Horrocks, Ulrike Sattler, and Stephan Tobies. Practical reasoning for expressive description logics. In Proceedings of the 6th International Con- ference on Logic for Programming and Automated Reasoning (LPAR), 1999. Cited on page 28.

[53] Ian Horrocks, Ulrike Sattler, and Stephan Tobies. Practical reasoning for very expressive description logics. Logic Journal of IGPL, 8(3):239–264, 2000. Cited on page 28.

[54] Ian Horrocks and Stephan Tobies. Reasoning with axioms: Theory and practice. In Proceedings of the 7th International Conference on the Prin- ciples of Knowledge Representation and Reasoning (KR), 2000. Cited on page 149.

[55] Ernesto Jim´enez-Ruiz and Bernardo Cuenca Grau. LogMap: Logic-based and scalable ontology matching. In Proceedings of the 10th International Semantic Web Conference (ISWC), 2011. Cited on pages 76, 159, and 160.

[56] Ernesto Jim´enez-Ruiz, Bernardo Cuenca Grau, and Ian Horrocks. On the feasibility of using OWL 2 DL reasoners for ontology matching problems. In Proceedings of the OWL Reasoner Evaluation Workshop (ORE), 2012. Cited on page 76.

[57] Ernesto Jim´enez-Ruiz, Bernardo Cuenca Grau, Ian Horrocks, and Rafael Berlanga Llavori. Building ontologies collaboratively using ContentCVS. In Proceedings of the 22nd International Workshop on Description Logics (DL), 2009. Cited on page 44.

[58] Ernesto Jim´enez-Ruiz, Bernardo Cuenca Grau, Ian Horrocks, and Rafael Berlanga Llavori. Supporting concurrent ontology development: Frame- work, algorithms and tool. Data and Knowledge Engineering, 70(1):146– 164, 2011. Cited on pages 15, 17, 34, and 44.

[59] Ernesto Jim´enez-Ruiz,Bernardo Cuenca Grau, Yujiao Zhou, and Ian Hor- rocks. Large-scale interactive ontology matching: Algorithms and imple- mentation. In Proceedings of the 20th European Conference on Artificial Intelligence (ECAI), 2012. Cited on page 160. BIBLIOGRAPHY 170

[60] Aditya Kalyanpur, Bijan Parsia, Matthew Horridge, and Evren Sirin. Find- ing all justifications of OWL DL entailments. In Proceedings of the 6th International Semantic Web Conference (ISWC/ASWC), 2007. Cited on pages 25 and 66.

[61] Aditya Kalyanpur, Bijan Parsia, Evren Sirin, Bernardo Cuenca Grau, and James Hendler. Swoop: A web ontology editing browser. Journal of Web Semantics, 4(2):144–153, 2006. Cited on pages 16 and 46.

[62] Yong-Bin Kang, Yuan-Fang Li, and Shonali Krishnaswamy. Predicting reasoning performance using ontology metrics. In Proceedings of the 11th International Semantic Web Conference (ISWC), 2012. Cited on page 44.

[63] Yevgeny Kazakov. Consequence-driven reasoning for Horn SHIQ ontolo- gies. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI), 2009. Cited on page 34.

[64] Yevgeny Kazakov and Boris Motik. A resolution-based decision procedure for SHOIQ. Journal of Automated Reasoning, 40(2-3):89–116, 2008. Cited on page 23.

[65] Boris Konev, Michel Ludwig, Dirk Walther, and Frank Wolter. The logical difference for the lightweight description logic EL. Journal of Artificial Intelligence Research, 44(1):633–708, 2012. Cited on page 47.

[66] Boris Konev, Michel Ludwig, and Frank Wolter. Logical difference compu- tation with CEX 2.5. In Proceedings of the 6th International Joint Confer- ence on Automated Reasoning (IJCAR), 2012. Cited on page 47.

[67] Boris Konev, Carsten Lutz, Dirk Walther, and Frank Wolter. Logical dif- ference and module extraction with CEX and MEX. In Proceedings of the 21st International Workshop on Description Logics (DL), 2008. Cited on pages 27 and 47.

[68] Boris Konev, Dirk Walther, and Frank Wolter. The logical difference prob- lem for description logic terminologies. In Proceedings of the 4th Interna- tional Joint Conference on Automated Reasoning (IJCAR), 2008. Cited on pages 15, 17, 34, and 85. BIBLIOGRAPHY 171

[69] Petr K˘remen,Marek Sm´ıd,ˇ and Zdenek Kouba. OWLDiff: A practical tool for comparison and merge of OWL ontologies. In Proceedings of the 22nd International Conference on and Expert Systems Applica- tions (DEXA), 2011. Cited on pages 17, 34, and 45.

[70] Ralf K¨usters. Non-Standard in Description Logics, volume 2100 of Lecture Notes in Artificial Intelligence. Springer-Verlag, 2001. Cited on page 86.

[71] Michelle L. Lee. Change Impact Analysis Of Object-Oriented Software. PhD thesis, George Mason University, 1998. Cited on page 15.

[72] Harris Lin and Evren Sirin. Pellint - a performance lint tool for Pellet. In Proceedings of the 5th International Workshop on OWL: Experiences and Directions (OWLED-08EU), 2008. Cited on pages 18, 49, and 150.

[73] Carsten Lutz. Reasoning with concrete domains. In Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI), 1999. Cited on page 29.

[74] Carsten Lutz. Description logics with concrete domains—A survey. In Advances in Modal Logics, volume 4. Kings College Publications, 2003. Cited on page 29.

[75] Carsten Lutz, Dirk Walther, and Frank Wolter. Conservative extensions in expressive description logics. In Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI), 2007. Cited on pages 26, 27, 47, 85, and 87.

[76] Carsten Lutz and Frank Wolter. Conservative extensions in the lightweight description logic EL. In Proceedings of the 21st Conference on Automated Deduction (CADE), 2007. Cited on pages 85 and 87.

[77] James Malone, Ele Holloway, Tomasz Adamusiak, Misha Kapushesky, Jie Zheng, Nikolay Kolesnikov, Anna Zhukova, Alvis Brazma, and Helen Parkinson. Modeling sample variables with an experimental factor ontology. , 26(8):1112–1118, 2010. Cited on pages 17, 34, and 45.

[78] Jose L. V. Mejino and Cornelius Rosse. Symbolic modeling of structural relationships in the foundational model of anatomy. In Proceedings of the BIBLIOGRAPHY 172

Workshop on Formal Biomedical Knowledge Representation (KR-MED), 2004. Cited on page 14.

[79] Boris Motik, Peter F. Patel-Schneider, and Bijan Parsia. OWL 2 Web Ontology Language: Structural specification and functional-style syntax. W3C recommendation, 2009. Cited on page 26.

[80] Boris Motik, Rob Shearer, and Ian Horrocks. Hypertableau reasoning for description logics. Journal of Artificial Intelligence Research, 36:165–228, 2009. Cited on page 23.

[81] Natalya F. Noy, Sherri de Coronado, Harold Solbrig, Gilberto Fragoso, Frank W. Hartel, and Mark A. Musen. Representing the NCI Thesaurus in OWL: Modeling tools help modeling languages. Applied Ontology, 3(3):173– 190, 2008. Cited on pages 31 and 34.

[82] Natalya F. Noy and Mark A. Musen. PROMPTDIFF: A fixed-point algo- rithm for comparing ontology versions. In Proceedings of the 18th National Conference on Artificial Intelligence (AAAI), 2002. Cited on page 45.

[83] Natalya F. Noy, Nigam H. Shah, Patricia L. Whetzel, Benjamin Dai, Michael Dorf, Nicholas Griffith, Clement Jonquet, Daniel L. Rubin, Margaret-Anne Storey, Christopher G. Chute, and Mark A. Musen. Bio- portal: ontologies and integrated data resources at the click of a mouse. Nucleic Acids Research, 37:170–173, 2009. Cited on page 32.

[84] Natasha Noy and Timothy Redmond. Computing the changes between ontologies. In Proceedings of the Joint Workshop on Knowledge Evolution and Ontology Dynamics (EvoDyn), 2011. Cited on page 45.

[85] Alan Rector and Jeremy Rogers. Ontological and practical issues in using a description logic to represent medical concept systems: Experience from GALEN. In Reasoning Web, volume 4126 of Lecture Notes in Computer Science, pages 197–231, 2006. Cited on page 14.

[86] Yuan Ren, Jeff Z. Pan, and Yuting Zhao. Soundness preserving approxi- mation for TBox reasoning. In Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI), 2010. Cited on pages 31, 94, and 144. BIBLIOGRAPHY 173

[87] Ana Armas Romero, Bernardo Cuenca Grau, and Ian Horrocks. MORe: Modular combination of OWL reasoners for ontology classification. In Pro- ceedings of the 11th International Semantic Web Conference (ISWC), 2012. Cited on page 156.

[88] Ulrike Sattler. A concept language extended with different kinds of transi- tive roles. In Proceedings of the 20th Annual German Conference on Arti- ficial Intelligence (KI), 1996. Cited on page 22.

[89] Ulrike Sattler, Thomas Schneider, and Michael Zakharyaschev. Which kind of module should I extract? In Proceedings of the 22nd International Work- shop on Description Logics (DL), 2009. Cited on pages 26, 27, and 142.

[90] Stefan Schulz, Daniel Schober, Ilinca Tudose, and Holger Stenzhorn. The pitfalls of thesaurus ontologization – the case of the NCI Thesaurus. In Proceedings of the American Medical Informatics Association (AMIA) 2010 Annual Symposium, 2010. Cited on page 34.

[91] Bart Selman and Henry Kautz. Knowledge compilation and theory ap- proximation. Journal of the ACM, 43(2):193–224, 1996. Cited on page 144.

[92] Rob Shearer, Boris Motik, and Ian Horrocks. HermiT: A highly-efficient OWL reasoner. In Proceedings of the 5th International Workshop on OWL: Experiences and Directions (OWLED-08EU), 2008. Cited on page 31.

[93] FrantiˇsekSimanˇc´ık, Yevgeny Kazakov, and Ian Horrocks. Consequence- based reasoning beyond Horn ontologies. In Proceedings of the 22nd Inter- national Joint Conference on Artificial Intelligence (IJCAI), 2011. Cited on page 23.

[94] Nicholas Sioutos, Sherri de Coronado, Margaret W. Haber, Frank W. Har- tel, Wen-Ling Shaiu, and Lawrence W. Wright. NCI Thesaurus: A semantic model integrating cancer-related clinical and molecular information. Jour- nal of Biomedical Informatics, 40(1):30–43, 2007. Cited on pages 31 and 34.

[95] Evren Sirin, Bijan Parsia, Bernardo Cuenca Grau, Aditya Kalyanpur, and Yarden Katz. Pellet: A practical OWL-DL reasoner. Journal of Web Se- mantics, 5(2):51–53, 2007. Cited on pages 18 and 31. BIBLIOGRAPHY 174

[96] Kent A. Spackman. SNOMED RT and SNOMED CT. Promise of an inter- national clinical ontology. M.D. Computing, 17(6):29, 2000. Cited on page 14.

[97] Herman J. ter Horst. Completeness, decidability and complexity of en- tailment for RDF Schema and a semantic extension involving the OWL vocabulary. Journal of Web Semantics, 3(2-3):79–115, 2004. Cited on page 29.

[98] Stephan Tobies. Complexity Results and Practical Algorithms for Logics in Knowledge Representation. PhD thesis, RWTH Aachen, 2001. Cited on page 28.

[99] Dmitry Tsarkov and Ian Horrocks. FaCT++ description logic reasoner: System description. In Proceedings of the 3rd International Joint Confer- ence on Automated Reasoning (IJCAR), 2006. Cited on page 31.

[100] Wladyslaw M. Turski and Thomas S. E. Maibaum. The Specification of Computer Programs. Addison-Wesley, 1987. Cited on page 25.

[101] W3C OWL Working Group. OWL 2 Web Ontology Language: Docu- ment overview. W3C Recommendation, 2009. http://www.w3.org/TR/ owl2-overview/. Cited on page 14.

[102] Taowei David Wang and Bijan Parsia. Ontology performance profiling and model examination: First steps. In Proceedings of the 6th International Semantic Web Conference (ISWC/ASWC), 2007. Cited on pages 48, 141, and 142.

[103] Cathleen Wharton, John Rieman, Clayton Lewis, and Peter Polson. Us- ability Inspection Methods, chapter The Cognitive Walkthrough Method: A Practitioner’s Guide, pages 105–140. John Wiley & Sons, Inc., New York, NY, USA, 1994. Cited on page 78.