Enhanced Ontological Searching of Medical Scientific Information
Total Page:16
File Type:pdf, Size:1020Kb
University of Manchester School of Computer Science Degree Programme of Advanced Computer Science Enhanced Ontological Searching of Medical Scientific Information Christos Karaiskos A dissertation submitted to The University of Manchester for the degree of Master of Science in the Faculty of Engineering and Physical Sciences Master's Thesis 2013 2 Contents Abstract 7 Declaration 9 Intellectual Property Statement 11 Acknowledgements 13 List of Abbreviations 15 List of Tables 17 List of Figures 19 1 Introduction 25 1.1 Problem Context . 25 1.2 Motivation . 26 1.3 Contribution . 27 1.4 Thesis Organization . 29 2 Ontologies 31 2.1 Modern Ontology Definition . 31 2.2 Ontology vs. Terminology . 33 2.3 Notable Biomedical Ontologies and Terminologies . 34 2.3.1 SNOMED CT . 34 3 2.3.2 NDF-RT . 35 2.3.3 ICD-10 . 36 2.3.4 MedDRA . 37 2.3.5 NCI Thesaurus . 38 3 Similarity Metrics 39 3.1 Similarity Metric vs. Distance Metric . 39 3.2 Lexical Similarity . 40 3.2.1 Character-based Similarity Measures . 41 Longest Common Substring . 41 Hamming Similarity . 41 Levenshtein Similarity . 41 Jaro Similarity . 42 Jaro-Winkler Similarity . 42 N-gram Similarity . 43 3.2.2 Word-based Similarity Measures . 43 Dice Similarity . 43 Jaccard Similarity . 44 Cosine Similarity . 44 Manhattan Similarity . 44 Euclidean Similarity . 45 3.3 Ontological Semantic Similarity . 45 3.3.1 Intra-ontology Semantic Similarity . 45 Distance-based Metrics . 45 Information-Based Metrics . 48 Feature-Based Measures . 52 3.3.2 Inter-ontology Semantic Similarity . 52 4 Search Interfaces 55 4.1 Information Seeking Models . 55 4.2 Query Specification . 56 4 4.3 Presentation of Search Results . 60 4.4 Query Reformulation . 62 5 Requirements 65 5.1 Feature Specification . 65 6 Design 69 6.1 Stage I: Access to Medical Ontologies . 69 6.1.1 Database and Table Creation . 70 6.1.2 Populating the Database Tables . 72 6.2 Stage II: Computation of Semantic Similarity . 76 6.2.1 Term Neighborhoods . 76 6.2.2 Semantic Similarity Calculation . 77 6.3 Stage III: Interface Design Data Presentation . 79 6.4 Summary of Technology Choices . 80 7 Implementation 83 7.1 Structure . 83 7.2 Search Entry Form . 83 7.3 Handling the Input Query . 88 7.3.1 Typing Speed . 88 7.3.2 Querying the Database . 88 7.3.3 Ranking and Grouping of Search Results . 89 7.3.4 Return-key or Mouse-click Search . 91 7.3.5 Auto-completion Search . 91 7.4 Error Correction . 94 7.5 Term Information Presentation . 96 7.6 Navigation . 101 8 Evaluation 103 8.1 Testing the Failed Queries . 103 8.2 Comparison to BioPortal Search Services . 109 5 8.2.1 Auto-completion . 109 8.2.2 Results Ranking . 111 8.2.3 Error Correction . 113 8.2.4 Visualization . 114 8.3 Comments from an AstraZeneca Search Specialist . 117 9 Conclusions and Future Work 121 9.1 Conclusions . 121 9.2 Future Work . 122 Bibliography 123 Number of Words in the Document: 25648 6 University of Manchester School of Computer Science ABSTRACT OF Degree Programme of Advanced Computer Science MASTER'S THESIS Author: Christos Karaiskos Title: Enhanced Ontological Searching of Medical Scientific Information Supervisors: Prof. Andrew Brass (University of Manchester) Dr. Jennifer Bradford (AstraZeneca) Abstract: An enormous amount of biomedical knowledge is encoded in narra- tive textual format. In an attempt to discover new or hidden knowledge, exten- sive research is being conducted to extract and exploit term relationships from plain text, with the aid of technology. A common approach for the identification of biomedical entities in plain text involves usage of ontologies, i.e., knowledge bases which provide formal machine-understandable representations of domains of variable specificity. In addition to term extraction, ontologies may be used as controlled vocabularies or as a means for automatic knowledge acquisition through their inherent inference capabilities. Visualization of the content of on- tologies is, thus, very important for researchers in the biomedical domain. Un- fortunately, many of these researchers find it difficult to deal with formal logic and would prefer that ontology search interfaces completely hide any structural or functional references to ontologies. This thesis proposes a strategy for build- ing a web-based ontology search application that exploits ontologies behind the scene, transparently from the end user, and presents relevant concept informa- tion in such a way that searchers can successfully and quickly find what they are looking for. The proposed search interface features various search tools for enhanced ontological searching, including term auto-completion, error correction, clever results ranking, and similar term visualizations based on semantic similar- ity metrics. Evaluation of the developed application shows that its features can improve enterprise-strength ontology search applications, such as BioPortal. Keywords: search interface design, ontology hiding, biomedical ontology, semantic similarity, usability, data integration 7 8 Declaration No portion of the work referred to in the dissertation has been submitted in support of an application for another degree or qualification of this or any other university or other institute of learning. 9 10 Intellectual Property Statement i. The author of this dissertation (including any appendices and/or schedules to this dissertation) owns certain copyright or related rights in it (the `Copy- right') and he has given The University of Manchester certain rights to use such Copyright, including for administrative purposes. ii. Copies of this dissertation, either in full or in extracts and whether in hard or electronic copy, may be made only in accordance with the Copyright, Designs and Patents Act 1988 (as amended) and regulations issued under it or, where appropriate, in accordance with licensing agreements which the University has entered into. This page must form part of any such copies made. iii. The ownership of certain Copyright, patents, designs, trade marks and other intellectual property (the `Intellectual Property') and any reproductions of copyright works in the dissertation, for example graphs and tables (`Repro- ductions'), which may be described in this dissertation, may not be owned by the author and may be owned by third parties. Such Intellectual Property and Reproductions cannot and must not be made available for use with- out the prior written permission of the owner(s) of the relevant Intellectual Property and/or Reproductions. iv. Further information on the conditions under which disclosure, publication and commercialisation of this dissertation, the Copyright and any Intel- lectual Property and/or Reproductions described in it may take place is 11 available in the University IP Policy (see http://documents.manchester.ac. uk/display.aspx?DocID=487), in any relevant Dissertation restriction decla- rations deposited in the University Library, The University Library's reg- ulations (see http://www.manchester.ac.uk/library/aboutus/regulations) and in The University's Guidance for the Presentation of Dissertations. 12 Acknowledgements I am deeply grateful to my supervisors, Prof. Andrew Brass (University of Manch- ester) and Dr. Jennifer Bradford (AstraZeneca), for their invaluable guidance and support throughout the duration of this project. I have greatly benefited from experiencing the different perspectives of academia and industry, which have both contributed to shaping the final outcome of this project. I would like to thank Sebastian Philipp Brandt (University of Manchester), for his suggestions on making the search application even better. Also, I would like to express my gratitude to Julie Mitchell (AstraZeneca), for taking the time to evaluate the application, and Paul Metcalfe (AstraZeneca), for his advice on improving the performance and security of the application. Finally, I would like to thank Matina for her patience and love, and my par- ents, Ioannis and Stavroula, for always being there. 13 14 List of Abbreviations AI Artificial Intelligence AJAX Asynchronous JavaScript and XML API Application Programming Interface CSS Cascading Style Sheets DAG Directed Acyclic Graph HLGT High Level Group Term HLT High Level Term HTTP Hypertext Transfer Protocol IC Information Content ICD International Classification of Diseases JDBC Java Database Connectivity JSON JavaScript Object Notation LCS Least Common Subsumer MedDRA Medical Dictionary for Regulatory Activities NCIT National Cancer Institute Thesaurus NDF-RT National Drug File Reference Terminology 15 NHS UK National Health System NLP Natural Language Processing OBO Open Biomedical Ontologies OWL Web Ontology Language PHP PHP Hypertext Preprocessor PT Preferred Term RDF Resource Description Framework RDF-S Resource Description Framework Schema REST Representational State Transfer RF2 Release Format 2 SNOMED CT Systematized Nomenclature of Medicine Clinical Terms SNOMED RT Systematized Nomenclature of Medicine Reference Terminology SOC System Organ Class UMLS Unified Medical Language System URI Uniform Resource Identifier URL Uniform Resource Locator UX User Experience VA U.S. Department of Veterans Affairs WHO World Health Organization XHTML Extensible HyperText Markup Language XML Extensible Markup Language 16 List of Tables 5.1 Documented failed queries and suggested reasons for failure. 66 5.2 Documented failed queries and suggested reasons for failure (cont.). 67 6.1 `Ontologies' database table structure . 71 6.2 Examples of URI formats for BioPortal RESTful services. 73 6.3 Technology choices for the project. 81 7.1 PHP files used in the search application. 85 7.2 XHTML files used in the search application. 85 7.3 CSS files used in the search application. 86 7.4 JavaScript files used in the search application. 86 8.1 Testing previously failed queries. 105 17 18 List of Figures 2.1 The structure of the MedDRA terminology comprises a fixed-depth hierarchy. 37 4.1 The google search engine entry form. 57 4.2 Facebook uses grayed-out descriptive text to help in the formula- tion of user queries. 57 4.3 Bing's search interface features a powerful dynamic search sugges- tion, where prefixes are highlighted with grayed-out font and the remaining text is in bold.