Semi-Automated Ontology Generation for Biocuration and Semantic Search
Total Page:16
File Type:pdf, Size:1020Kb
Semi-automated Ontology Generation for Biocuration and Semantic Search Dissertation zur Erlangung des akademischen Grades Doktoringenieur (Dr.-Ing.) vorgelegt an der Technischen Universität Dresden Fakultät Informatik von Dipl.-Inf. Thomas Wächter geboren am 25. August 1978 in Jena Betreuender Hochschullehrer: Prof. Dr. Michael Schroeder Technische Universität Dresden Gutachter: Prof. Dr. Lawrence Hunter University of Colorado, Denver/Boulder, USA Tag der Einreichung: 04. 06. 2010 Tag der Verteidigung: 27. 10. 2010 Abstract Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely miss- ing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontolo- gies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child re- lations from text in PubMed, the web, and PDF repositories. The system is seam- lessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern- based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org. Acknowledgements The research presented in this thesis has been carried out at the Biotechnology Cen- ter (BIOTEC) of the Technische Universität Dresden. I am extremely fortunate and grateful to have enjoyed the benefits of this excellent environment. Over the years, I received help, support, and friendship from a number of great people. It is a pleasure to thank those who made this thesis possible. First of all, I would like to thank Michael Schroeder, my excellent supervisor, for his guidance, ideas, and support on this journey. His support was in excess of anything that I could have wished and made everything achieved possible. Special thanks to Heiko Dietze, Andreas Doms, and Loic A. Royer for compan- ionship, advice, trust, and professionalism right from the beginning. It is a pleasure for me to thank my colleagues Dimitra Alexopulou, Rainer Winnenburg, and An- dreas Henschel for the pleasant, effective, and successfull joint work, as well as all members of the bioinformatics group for the valuable feedback and support they always provided. I would like to express my deepest respect and gratitude to Ursula Sauer and Bar- bare Grune for their dedication and the confidence they showed during the creation of the Go3R search engine. In this regard I also want to thank all former and current developers and em- ployees of Transinsight GmbH and in particular Matthias Zschunke for his great support for this project. It has been a pleasure for me to work in such a creative and productive team. Furthermore, I feel I must thank Dr. Monica Sturm and Prof. Michael Göttfert who during my studies gave me this insight beyond the curriculum, which made this little but so important difference. Finally, I dedicate this work to my family. I am thankful and deeply appreciate and acknowledge their continuing support, patience, and understanding, each in their own way. Especially thank you Sylvia, thank you Elisa, and thank you Elena. Contents Table of Contents ......................................................... iv List of Tables ............................................................. viii List of Figures ............................................................ xii Publications .............................................................. xvii 1 Introduction .......................................................... 1 1.1 Methods for semi-automated ontology generation...................1 1.2 Automated ontology generation support for biocuration.............3 1.3 Semantic search for alternative methods to animal testing............3 1.4 Overview.........................................................5 2 Background ........................................................... 7 2.1 Introduction......................................................7 2.1.1 Development of biomedical ontologies........................8 2.1.2 Biocuration..................................................8 2.1.3 Ontology-based literature search..............................9 2.1.4 Alternative methods to animal testing......................... 10 2.2 Preliminaries...................................................... 11 2.2.1 Ontology.................................................... 11 2.2.2 Biomedical ontologies and controlled vocabularies............. 12 2.2.3 Biomedical text-mining...................................... 18 2.2.4 Evaluation methodologies.................................... 22 2.3 Related Work: Ontology Learning.................................. 26 2.3.1 Automatic term recognition methods......................... 27 2.3.2 Finding synonyms........................................... 37 2.3.3 Abbreviation detection....................................... 40 2.3.4 Generating textual definitions................................ 43 2.3.5 Taxonomy generation........................................ 48 2.3.6 Availability of ontology generation methods and tools......... 53 2.4 Summary and Discussion.......................................... 55 vi Contents 3 Terminology Generation ............................................... 61 3.1 DOG4DAG Term Generation Method............................... 64 3.2 Proof of concept................................................... 68 3.3 LMO Benchmark: lipoprotein metabolism ............................. 72 3.4 Evaluation of the quality of generated terms........................ 72 3.4.1 Noun phrases as term candidates............................. 73 3.4.2 Comparison of different term generation methods............. 75 3.4.3 Comparison of term ranking measures: C-Value vs. tf-idf....... 77 3.4.4 Quality of terms in dependence of the scoring method......... 78 3.5 Stability of the ranking............................................. 84 3.5.1 Dependency on the part-of-speech tagger..................... 85 3.5.2 Dependency on the global corpus: Google vs. PubMed......... 90 3.5.3 Dependency on the ranking score: tf-idf vs. probability of occurrence.................................................. 96 3.6 Summary and Discussion.......................................... 101 3.7 Future Work...................................................... 104 4 Definition Extraction .................................................. 107 4.1 DOG4DAG Definition Extraction Method........................... 108 4.2 Evaluation: Answering TREC2003 definitional questions............. 113 4.3 Evaluation: Generation of GO and MeSH definitions................. 116 4.4 Summary and Discussion.......................................... 118 4.5 Future Work...................................................... 120 5 Taxonomy Generation ................................................. 121 5.1 DOG4DAG Taxonomy Generation Method.......................... 122 5.2 Evaluation: Taxonomy generation based on generated definitions..... 123 5.3 Pattern-based relation extraction – Superstring prediction............ 124 5.4 Co-occurrence analysis – Algorithm by Heymann et. al............... 128 5.5 Summary and Discussion.......................................... 139 5.6 Future Work...................................................... 142