Semantic Similarity in Owl

SEMANTIC SIMILARITY IN OWL THIRD YEAR PROJECT REPORT May 2016 Project Supervisor Professor Uli Sattler By Olumide Mosinmileoluwa Adenmosun BSc.(Hons) Computer Science with Industrial Experience University of Manchester Contents Abstract v Declaration vi Acknowledgements vii 1 Introduction 1 1.1 Semantic Similarity . .1 1.1.1 Existing approaches . .1 1.2 Why is this tool needed? . .2 1.3 Project goals . .2 1.4 Report structure . .2 2 Background 4 2.1 The Semantic Web . .4 2.2 What is an ontology? . .4 2.3 Description Logics . .5 2.4 OWL . .6 2.4.1 Overview . .6 2.4.2 The OWL API . .7 2.4.3 OWL Reasoners . .7 2.5 Protégé . .8 2.6 The Similarity Measures . .9 2.6.1 Atomic Similarity (AtomicSim)............. 10 2.6.2 Subconcept Similarity (SubSim)............. 10 2.6.3 Grammar Similarity (GrammarSim).......... 10 i CONTENTS ii 3 Design 12 3.1 Design approach . 12 3.2 Project Requirements . 13 3.2.1 Functional requirements . 13 3.2.2 Non-functional requirements . 13 3.3 Project Architecture . 14 3.4 Development Decisions . 14 3.4.1 Protégé 4(P4) . 15 3.4.2 OWL API . 15 3.5 Plugin Architecture . 16 3.6 UI Designs . 17 3.6.1 Plugin interface design . 17 3.6.2 Debuggr! . 18 3.7 Summary . 18 4 Implementation 19 4.1 Implementing the Similarity Measures . 19 4.1.1 Atomic Similarity . 19 4.1.2 Subconcept Similarity . 21 4.2 Debuggr! . 23 4.3 Developing the Plugin UI . 24 4.4 Developing the manager class . 25 4.5 Challenges . 26 5 Results 27 5.1 Walkthrough and Functional Requirements . 27 5.2 Non-functional requirements . 30 6 Testing and Evaluation 31 6.1 Manual Testing . 31 6.2 Unit Testing . 31 6.3 Evaluating the Similarity Measures . 32 6.3.1 Results . 33 7 Conclusions 36 7.1 Summary . 36 7.2 Achievements . 36 7.2.1 Things I would have like to do . 37 7.3 Future work . 37 7.4 Final remarks . 37 A NCI Atomic Sim Output 40 List of Figures 2.1 Protégé user interface . .8 3.1 Component interactions . 14 3.2 Initial Plugin Architecture . 16 3.3 The plugin design . 17 3.4 The Debuggr! design . 18 4.1 The finished Debuggr! design . 23 5.1 Plugin tab view . 27 5.2 Reasoner not initialised . 28 5.3 Selecting classes using the class selection panel . 28 5.4 Classes not correctly selected . 28 5.5 Selecting a similarity measure . 29 5.6 Results of the similarity calculation . 29 6.1 Results from the NCI group . 34 6.2 Results from the SNOMED group . 35 List of Listings 4.1 Pseudo-code for getting subsuming classes . 20 4.2 Pseudo-code for AtomicSim calculations . 20 4.3 Approach 1 - Pseudo-code for collecting subsuming class ex- pressions . 21 4.4 Approach 1 - Pseudo-code for SubSim calculations . 22 iii 4.5 Approach 2 - Pseudo-code for adding subsuming class to the ontology . 22 4.6 Pseudo-code for selection listener . 24 4.7 Pseudo-code for update selection method . 25 4.8 Pseudo-code to parse class expression . 25 4.9 Example class expression in functional OWL syntax . 25 List of Tables 6.1 Table of Results . 33 6.2 Pearson coefficient between Expert values and similarity measures . 33 iv Abstract Different disciplines have taken varying views of the concept of similarity. With the increase in data, extension of the principles of the world wide web to the retrieval of data is the intention of the semantic web. There are many possible applications of semantic similarity that make a standard framework for calculation of semantic similarity measurements im- portant. This report details the design and implementation of the aforementioned similarity measurements in a plugin for Protégé as a tool for calculating the aforementioned similarity in OWL ontologies. This tool is built as a plugin in ontology editor, Protégé. An evaluation of the similarity measures is also carried out. Project Title: Semantic Similarity in OWL. Author: Olumide Mosinmileoluwa Adenmosun. Degree: BSc(Hons) in Computer Science with Industrial Experience. Supervisor: Uli Sattler. v Declaration No portion of the work referred to in the thesis has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning. vi Acknowledgements Thank you Uli for the encouragement, help and for explaining the tricky stuff. I would also like to give huge thanks to Nico for answering my questions and giving advice on implementing said tricky stuff. This has been an interesting academic year. I am so thankful for the help and support I have received from so many these last few months. Thanks to my project mates who made the project sessions fun and enjoyable, offering advice on everything from test cases to LATEXtemplates. Last and definitely not least I would like to say a huge thanks to Tiny, Beaw and M. Your love, support and encouragement got me through the year. P.S. I finished!!! vii Chapter 1 Introduction 1.1 Semantic Similarity Semantic Similarity between two terms or sets of documents is defined as the degree of "sameness" between the terms as measured by comparing the information describing their properties. 1.1.1 Existing approaches Existing semantic similarity measures are classified into two categories[MPL+14]: Path-based measures based on consideration of the shortest path between the terms in a taxonomy, i.e., a tree or a possibly acyclic directed graph[Als15, MPL+14]. • Rada et al.[RMBB89] is a measure that traverses the shortest path between two concepts. The length is calculated by counting the number of edges between the two concepts through their least common sub- sumer(LCS), 1 + sim(c1,c2) = minpath(c1,c2) [MPL 14]. The LCS is the closest common ancestor with the the minimum number of IS-A links with concepts c1 and c2), [Sli13]. • Wu and Palmer extends this considering the depth of the LCS in the calculation[MPL+14]. 2∗depthT oRoot(LCS) sim(c1,c2) = depthT oLCS(c1)+depthT oLCS(c2)+2∗depthT oRoot(LCS) [Sli13]. 1 CHAPTER 1. INTRODUCTION 2 Information content-based measures based on measurement of probability of a concept occurring concept in a hierarchy. The supposition being the more frequent a concept appears the less specific it is. Therefore, a concept with a high IC value is more specific to a topic than a low IC value. IC is defined as ICc = −logPc, where the Pc is the probability of encountering an instance of it.[Als15, MPL+14] In [Als15], Alsubait identifies the shortcomings of these similarity measures. Path-based approaches are imprecise as they only consider atomic subsump- tion ignoring any complex subsumers. The other measures impose requirements on the ontology such as low expressivity. A new family of measures is required to precisely measure the semantic similarity of terms in ontologies. These individual measures vary in their accuracy and computational cost based on what features they consider[APS15]. 1.2 Why is this tool needed? 1.3 Project goals The goals of the project are to: • Implement similarity tool as a Protégé plugin. The objective was to build a protégé plug in that would allow any users to calculate the semantic similarity between classes in an OWL ontology • A second objective was to test and evaluate the correctness of the results of the similarity measures implemented. 1.4 Report structure This section gives an overview of the structure of this document and a summary of information in the following chapters of the report Chapter 2 provides relevant background information on tools and technolo- gies used. I also further explain the similarity measures to be implemented as defined in [Als15] CHAPTER 1. INTRODUCTION 3 Chapter 3 gives information about design decisions, requirements(functional and non-functional), and the architecture of the deliverable. Chapter 4 gives details about the technical aspects behind the component development and implementation. Chapter 5 details the achievements of the project. I give a walkthrough of the finished product and compare it against the functional and non-functional requirements defined. Chapter 6 details the testing and evaluation of the finished product. Chapter 7 concludes the report giving my final view on the achievements of the project, possible future work and things I would have done differently. Chapter 2 Background This chapter provides a high level explanation of the tools and technology used in the development of the Semantic Similarity deliverable to achieve the objectives outlined in Chapter 1. Terms used in the report are defined and explained here. 2.1 The Semantic Web The intention of the Semantic Web1 is to the extend the principles of the World Wide Web[FH09] and make data easily accessible using general Web technology. The Semantic Web provides a common framework for data to be "shared and reused across application, enterprise, and community bound- aries"[FH09]. 2.2 What is an ontology? In the context of the Semantic Web, an ontology is a "machine-processable en- cyclopaedia"[Sat15]. It allows electronic agents to infer possible relationships between data and to resolve domain-specific user queries. Researchers and experts have developed specialised ontologies to annotate and share domain- specific information within communities[NM00, SRH10]. An example is the 1https://www.w3.org/standards/semanticweb/ 4 CHAPTER 2. BACKGROUND 5 National Cancer Institute Thesaurus(NCIt)2, a public ontology that contains over 100, 000 textual definitions of terms and over 400, 000 links between concepts. 2.3 Description Logics Description Logics(DLs)[KSH12] are "a family of knowledge representation languages that present the knowledge of an application domain in a struc- tured and formally well-understood way"[BHS08].

Semantic Similarity in Owl

A Logical Framework for Modularity of Ontologies∗

Deciding Semantic Matching of Stateless Services∗

Ontology-Based Methods for Analyzing Life Science Data

Description Logics

Proceedings of the 11Th International Workshop on Ontology

Justification Oriented Proofs In

Curriculum Vitae

Owl2, Rif, Sparql1.1)

Table of Contents - Part II

Data Complexity of Reasoning in Very Expressive Description Logics

Justification Masking In

Empirical Study of Logic-Based Modules: Cheap Is Cheerful