Strategies for Amassing, Characterizing, and Applying Third-Party Metadata in Bioinformatics
Total Page:16
File Type:pdf, Size:1020Kb
STRATEGIES FOR AMASSING, CHARACTERIZING, AND APPLYING THIRD-PARTY METADATA IN BIOINFORMATICS by BENJAMIN MCGEE GOOD B.Sc., The University of California at San Diego, 1998 M.Sc., The University of Sussex, 2000 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Bioinformatics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) April 2009 © Benjamin McGee Good, 2009 Abstract Bioinformatics resources on the Web are proliferating rapidly. For biomedical researchers, the vital data they contain is often difficult to locate and to integrate. The semantic Web initiative is an emerging collection of standards for sharing and integrating distributed information resources via the World Wide Web. In particular, these standards define languages for the provision of the metadata that facilitates both discovery and integration of distributed resources. This metadata takes the form of ontologies used to annotate information resources on the Web. Bioinformatics researchers are now considering how to apply these standards to enable a new generation of applications that will provide more effective ways to make use of increasingly diverse and distributed biological information. While the basic standards appear ready, the path to achieving the potential they entail is muddy. How are we to create all of the needed ontologies? How are we to use them to annotate increasingly large bodies of information? How are we to judge the quality of these ontologies and these proliferating annotations? As new metadata generating systems emerge on the Web, how are we to compare these to previous systems? The research conducted for this dissertation seeks new answers to these questions. Specifically, it investigates strategies for amassing, characterizing, and applying metadata (the substance of the semantic Web) in the context of bioinformatics. The strategies for amassing metadata orient around the design of systems that motivate and guide the actions of many individual, third-party contributors in the formation of collective metadata resources. The strategies for characterizing metadata focus on the derivation of fully automated protocols for evaluating and comparing ontologies and related metadata structures. New applications demonstrate how distributed information sources can be dynamically integrated to facilitate both information visualization and analysis. Brought together, these different lines of research converge towards the genesis of systems that will allow the biomedical research community to both create and maintain a semantic Web for the life sciences and to make use of the new capabilities for knowledge sharing and discovery that it will enable. ii Table of Contents Abstract .......................................................................................................................ii Table of Contents...........................................................................................................iii List of Tables ................................................................................................................vii List of Figures..............................................................................................................viii Glossary .......................................................................................................................x References for glossary ............................................................................................xiii Preface ....................................................................................................................xiv Acknowledgements....................................................................................................... xv Co-authorship statement..............................................................................................xvii 1 Introduction.............................................................................................................1 1.1 Dissertation overview ......................................................................................1 1.2 Informatics and bioinformatics ........................................................................2 1.3 Metadata..........................................................................................................3 1.4 Third-party metadata........................................................................................4 1.5 The World Wide Web......................................................................................4 1.5.1 Unique identification................................................................................6 1.6 The semantic Web ...........................................................................................6 1.6.1 The Resource Description Framework......................................................7 1.6.2 Ontology..................................................................................................8 1.6.3 The Web ontology language.....................................................................9 1.6.4 Ontologies in bioinformatics ....................................................................9 1.7 Prospects for a semantic Web for the life sciences ......................................... 10 1.8 Background on crowdcasting for knowledge acquisition................................ 13 1.9 Passive distributed metadata generation via social tagging............................. 15 1.10 Dissertation objectives and chapter summaries............................................... 16 References................................................................................................................. 21 2 Fast, cheap and out of control: a zero-curation model for ontology development ... 24 2.1 Introduction ................................................................................................... 24 2.2 Experimental context and target application for the YI Ontology ................... 25 2.3 Motivation and novelty of conference-based knowledge capture.................... 25 2.4 Interface design ............................................................................................. 26 2.4.1 Specific challenges faced in the conference domain ............................... 27 2.5 Methods - introducing the iCAPTURer.......................................................... 27 2.5.1 Preprocessing......................................................................................... 27 2.5.2 Priming the knowledge acquisition templates - term selection ................ 27 2.5.3 Term evaluation ..................................................................................... 28 2.5.4 Relation acquisition................................................................................ 28 2.5.5 Volunteer recruitment and reward .......................................................... 29 2.6 Observations.................................................................................................. 29 2.7 Quantitative results ........................................................................................ 30 2.7.1 Volunteer contributions.......................................................................... 30 2.7.2 Composition of the YI Ontology ............................................................ 30 2.7.3 Relationships in the YI Ontology ........................................................... 31 2.8 Quality assessment ........................................................................................ 31 iii 2.9 Summary ....................................................................................................... 32 2.10 Discussion ..................................................................................................... 33 References................................................................................................................. 37 3 Ontology engineering using volunteer labour......................................................... 38 3.1 Introduction ................................................................................................... 38 3.2 Creating an OWL version of MeSH ............................................................... 39 3.3 Experiment .................................................................................................... 39 3.3.1 Test data ................................................................................................ 39 3.4 Results........................................................................................................... 40 3.4.1 Performance of aggregated responses..................................................... 40 3.5 Discussion and future work............................................................................ 41 References................................................................................................................. 43 4 OntoLoki: an automatic, instance-based method for the evaluation of biological ontologies on the semantic Web .................................................................................... 44 4.1 Background ................................................................................................... 44 4.2 Results........................................................................................................... 45 4.2.1 OntoLoki method................................................................................... 45 4.2.2 Implementation