Strategies for Amassing, Characterizing, and Applying Third-Party Metadata in Bioinformatics

Total Page:16

File Type:pdf, Size:1020Kb

Strategies for Amassing, Characterizing, and Applying Third-Party Metadata in Bioinformatics STRATEGIES FOR AMASSING, CHARACTERIZING, AND APPLYING THIRD-PARTY METADATA IN BIOINFORMATICS by BENJAMIN MCGEE GOOD B.Sc., The University of California at San Diego, 1998 M.Sc., The University of Sussex, 2000 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES (Bioinformatics) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) April 2009 © Benjamin McGee Good, 2009 Abstract Bioinformatics resources on the Web are proliferating rapidly. For biomedical researchers, the vital data they contain is often difficult to locate and to integrate. The semantic Web initiative is an emerging collection of standards for sharing and integrating distributed information resources via the World Wide Web. In particular, these standards define languages for the provision of the metadata that facilitates both discovery and integration of distributed resources. This metadata takes the form of ontologies used to annotate information resources on the Web. Bioinformatics researchers are now considering how to apply these standards to enable a new generation of applications that will provide more effective ways to make use of increasingly diverse and distributed biological information. While the basic standards appear ready, the path to achieving the potential they entail is muddy. How are we to create all of the needed ontologies? How are we to use them to annotate increasingly large bodies of information? How are we to judge the quality of these ontologies and these proliferating annotations? As new metadata generating systems emerge on the Web, how are we to compare these to previous systems? The research conducted for this dissertation seeks new answers to these questions. Specifically, it investigates strategies for amassing, characterizing, and applying metadata (the substance of the semantic Web) in the context of bioinformatics. The strategies for amassing metadata orient around the design of systems that motivate and guide the actions of many individual, third-party contributors in the formation of collective metadata resources. The strategies for characterizing metadata focus on the derivation of fully automated protocols for evaluating and comparing ontologies and related metadata structures. New applications demonstrate how distributed information sources can be dynamically integrated to facilitate both information visualization and analysis. Brought together, these different lines of research converge towards the genesis of systems that will allow the biomedical research community to both create and maintain a semantic Web for the life sciences and to make use of the new capabilities for knowledge sharing and discovery that it will enable. ii Table of Contents Abstract .......................................................................................................................ii Table of Contents...........................................................................................................iii List of Tables ................................................................................................................vii List of Figures..............................................................................................................viii Glossary .......................................................................................................................x References for glossary ............................................................................................xiii Preface ....................................................................................................................xiv Acknowledgements....................................................................................................... xv Co-authorship statement..............................................................................................xvii 1 Introduction.............................................................................................................1 1.1 Dissertation overview ......................................................................................1 1.2 Informatics and bioinformatics ........................................................................2 1.3 Metadata..........................................................................................................3 1.4 Third-party metadata........................................................................................4 1.5 The World Wide Web......................................................................................4 1.5.1 Unique identification................................................................................6 1.6 The semantic Web ...........................................................................................6 1.6.1 The Resource Description Framework......................................................7 1.6.2 Ontology..................................................................................................8 1.6.3 The Web ontology language.....................................................................9 1.6.4 Ontologies in bioinformatics ....................................................................9 1.7 Prospects for a semantic Web for the life sciences ......................................... 10 1.8 Background on crowdcasting for knowledge acquisition................................ 13 1.9 Passive distributed metadata generation via social tagging............................. 15 1.10 Dissertation objectives and chapter summaries............................................... 16 References................................................................................................................. 21 2 Fast, cheap and out of control: a zero-curation model for ontology development ... 24 2.1 Introduction ................................................................................................... 24 2.2 Experimental context and target application for the YI Ontology ................... 25 2.3 Motivation and novelty of conference-based knowledge capture.................... 25 2.4 Interface design ............................................................................................. 26 2.4.1 Specific challenges faced in the conference domain ............................... 27 2.5 Methods - introducing the iCAPTURer.......................................................... 27 2.5.1 Preprocessing......................................................................................... 27 2.5.2 Priming the knowledge acquisition templates - term selection ................ 27 2.5.3 Term evaluation ..................................................................................... 28 2.5.4 Relation acquisition................................................................................ 28 2.5.5 Volunteer recruitment and reward .......................................................... 29 2.6 Observations.................................................................................................. 29 2.7 Quantitative results ........................................................................................ 30 2.7.1 Volunteer contributions.......................................................................... 30 2.7.2 Composition of the YI Ontology ............................................................ 30 2.7.3 Relationships in the YI Ontology ........................................................... 31 2.8 Quality assessment ........................................................................................ 31 iii 2.9 Summary ....................................................................................................... 32 2.10 Discussion ..................................................................................................... 33 References................................................................................................................. 37 3 Ontology engineering using volunteer labour......................................................... 38 3.1 Introduction ................................................................................................... 38 3.2 Creating an OWL version of MeSH ............................................................... 39 3.3 Experiment .................................................................................................... 39 3.3.1 Test data ................................................................................................ 39 3.4 Results........................................................................................................... 40 3.4.1 Performance of aggregated responses..................................................... 40 3.5 Discussion and future work............................................................................ 41 References................................................................................................................. 43 4 OntoLoki: an automatic, instance-based method for the evaluation of biological ontologies on the semantic Web .................................................................................... 44 4.1 Background ................................................................................................... 44 4.2 Results........................................................................................................... 45 4.2.1 OntoLoki method................................................................................... 45 4.2.2 Implementation
Recommended publications
  • Semantic Web Tools: an Overview 7Th International CALIBER 2009 Semantic Web Tools: an Overview
    Semantic Web Tools: An Overview 7th International CALIBER 2009 Semantic Web Tools: An Overview D Shivalingaiah Umesha Naik Abstract The WWW is the largest single information resource humanity. Unfortunately, despite its dependence on computers to operate at all, most of the information is only understandable by humans and not by computers. While computers can use the syntax of HTML documents to display in a web browser, Web computers can’t understand the content the semantics. Human beings are capable of using the Web to carry out tasks such as finding the information. However, a computer cannot accomplish the same tasks without human direction because web pages are designed to be read by public, not machines. The semantic web is a vision of information that is understandable by computers, so that public can perform more of the tedious work involved in finding, sharing and combining information on the web. The paper emphasizes the semantic tools available. Keywords: Semantic Web, Ontology, Resource Description Framework, Web Ontology Language, Web Service Modeling Language, Web Service Modeling Framework, DARPA 1. Introduction the resulting network of Linked Data the Giant Global Graph, in contrast to the HTML-based The Semantic Web takes the solution further. The WWW. Semantic Web is not a separate entity from the WWW. It is an extension to the Web that adds new Web 2.0 is focused on people the Semantic Web is data and metadata to existing Web documents, focused on machines. The Web requires a human extending those documents into data. This operator, using computer systems to perform the extension of Web documents to data is what will tasks required to find, search and aggregate its enable the Web to be processed automatically by information.
    [Show full text]
  • Gbrowse Moby: a Web-Based Browser for Biomoby Services Mark Wilkinson*
    Source Code for Biology and Medicine BioMed Central Software review Open Access Gbrowse Moby: a Web-based browser for BioMoby Services Mark Wilkinson* Address: Assistant Professor, Department of Medical Genetics, University of British Columbia, James Hogg iCAPTURE Centre for Cardiovascular and Pulmonary Research, St. Paul's Hospital, Rm. 166, 1081 Burrard St., Vancouver, BC, V6Z 1Y6, Canada Email: Mark Wilkinson* - [email protected] * Corresponding author Published: 24 October 2006 Received: 08 July 2006 Accepted: 24 October 2006 Source Code for Biology and Medicine 2006, 1:4 doi:10.1186/1751-0473-1-4 This article is available from: http://www.scfbm.org/content/1/1/4 © 2006 Wilkinson; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Abstract Background: The BioMoby project aims to identify and deploy standards and conventions that aid in the discovery, execution, and pipelining of distributed bioinformatics Web Services. As of August, 2006, approximately 680 bioinformatics resources were available through the BioMoby interoperability platform. There are a variety of clients that can interact with BioMoby-style services. Here we describe a Web-based browser-style client – Gbrowse Moby – that allows users to discover and "surf" from one bioinformatics service to the next using a semantically-aided browsing interface. Results: Gbrowse Moby is a low-throughput, exploratory tool specifically aimed at non- informaticians. It provides a straightforward, minimal interface that enables a researcher to query the BioMoby Central web service registry for data retrieval or analytical tools of interest, and then select and execute their chosen tool with a single mouse-click.
    [Show full text]
  • Biological Data Integration Using Semantic Web Technologies
    Biological data integration using Semantic Web technologies Pasquier C Phone: +33 492 07 6947 Fax: +33 492 07 6432 Email: [email protected] Institute of Signaling, Developmental Biology & Cancer CNRS - UMR 6543, University of Nice Sophia-Antipolis Parc Valrose, 06108 NICE cedex 2, France. Summary Current research in biology heavily depends on the availability and efficient use of information. In order to build new knowledge, various sources of biological data must often be combined. Semantic Web technologies, which provide a common framework allowing data to be shared and reused between applications, can be applied to the management of disseminated biological data. However, due to some specificities of biological data, the application of these technologies to life science constitutes a real challenge. Through a use case of biological data integration, we show in this paper that current Semantic Web technologies start to become mature and can be applied for the development of large applications. However, in order to get the best from these technologies, improvements are needed both at the level of tool performance and knowledge modeling. Keywords Data integration, Semantic Web, OWL, RDF, SPARQL, Knowledge Base System (KBS) Introduction Biology is now an information-intensive science and research in genomics, transcriptomics and proteomics heavily depend on the availability and the efficient use of information. When data were structured and organized as a collection of records in dedicated, self-sufficient databases, information was retrieved by performing queries on the database using a specialized query language; for example SQL (Structured Query Language) for relational databases or OQL (Object Query Language) for object databases.
    [Show full text]
  • Using Ontology and Semantic Web Services to Support Modeling in Systems Biology
    Centre for Mathematics & Physics in the Life Sciences and EXperimental Biology University College London Using Ontology and Semantic Web Services to Support Modeling in Systems Biology Zhouyang Sun Submitted for the degree of Doctor of Philosophy At University College London December 2008 Revised for the final submission 2009 I, Zhouyang Sun, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. Signature: Date: Abstract This thesis addresses the problem of collaboration among experimental biologists and modelers in the study of systems biology by using ontology and Semantic Web Services techniques. Modeling in systems biology is concerned with using experimental information and mathematical methods to build quantitative models across different biological scales. This requires interoperation among various knowledge sources and services. Ontology and Semantic Web Services potentially provide an infrastructure to meet this requirement. In our study, we propose an ontology-centered framework within the Semantic Web infrastructure that aims at standardizing various areas of knowledge involved in the biological modeling processes. In this framework, first we specify an ontology-based meta-model for building biological models. This meta-model supports using shared biological ontologies to annotate biological entities in the models, allows semantic queries and automatic discoveries, enables easy model reuse and composition, and serves as a basis to embed external knowledge. We also develop means of transforming biological data sources and data analysis methods into Web Services. These Web Services can then be composed together to perform parameterization in biological modeling.
    [Show full text]
  • Ontology Based Data Integration in Life Sciences
    Ontology based data integration in Life Sciences Dmitry Repchevskiy A questa tesi doctoral està subjecta a la llicència Reconeixement 3.0. Espanya de Creative Commons . Esta tesis doctoral está sujeta a la licencia Reconocimi ento 3.0. España de Creative Commons . Th is doctoral thesis is licensed under the Creative Commons Att ribution 3.0. Spain License . UNIVERSITY OF BARCELONA FACULTY OF BIOLOGY Doctorate Program: Biomedicine Research Line: Bioinformatics 2010-2015 Ontology based data integration in life sciences Submitted by Dmitry Repchevskiy in fulfillment of the requirements for the doctoral degree by the University of Barcelona Supervisor: Dr. Josep Lluís Gelpí Buchaca Department of Biochemistry and Molecular Biology University of Barcelona Dmitry Repchevskiy Barcelona Supercomputing Center National Institute of Bioinformatics ACKNOWLEDGEMENTS “It is not the consciousness of men that determines their being, but, on the contrary, their social being that determines their consciousness.” Karl Marx Without any doubts this thesis wouldn’t be possible without many people that I had a pleasure to work with. If only I tried to enumerate all their invaluable help, all the chats and communications we had, this thesis would require a second volume. First and foremost, I would like to thank my supervisor Josep Lluís Gelpí, who gave me a lot of freedom in defining projects I worked on. Not all of them have been included into this thesis and some of them never reached the end, but the real experience gathered in these years allowed me to go get into this final point. I would also like to thank my colleague José María Fernández from Spanish National Cancer Research Centre (CNIO), who always found a time to discuss technological aspects of my projects and to our entire group just to be with me all these long years.
    [Show full text]
  • The Semantic Automated Discovery and Integration (SADI) Web Service
    Wilkinson et al. Journal of Biomedical Semantics 2011, 2:8 http://www.jbiomedsem.com/content/2/1/8 JOURNAL OF BIOMEDICAL SEMANTICS DATABASE Open Access The Semantic Automated Discovery and Integration (SADI) Web service Design-Pattern, API and Reference Implementation Mark D Wilkinson*, Benjamin Vandervalk and Luke McCarthy * Correspondence: Abstract [email protected] Department of Medical Genetics, Background: The complexity and inter-related nature of biological data poses a Heart + Lung Institute at St. Paul’s Hospital, University of British difficult challenge for data and tool integration. There has been a proliferation of Columbia, Vancouver, BC, Canada interoperability standards and projects over the past decade, none of which has been widely adopted by the bioinformatics community. Recent attempts have focused on the use of semantics to assist integration, and Semantic Web technologies are being welcomed by this community. Description: SADI - Semantic Automated Discovery and Integration - is a lightweight set of fully standards-compliant Semantic Web service design patterns that simplify the publication of services of the type commonly found in bioinformatics and other scientific domains. Using Semantic Web technologies at every level of the Web services “stack”, SADI services consume and produce instances of OWL Classes following a small number of very straightforward best-practices. In addition, we provide codebases that support these best-practices, and plug-in tools to popular developer and client software that dramatically simplify deployment of services by providers, and the discovery and utilization of those services by their consumers. Conclusions: SADI Services are fully compliant with, and utilize only foundational Web standards; are simple to create and maintain for service providers; and can be discovered and utilized in a very intuitive way by biologist end-users.
    [Show full text]