Proquest Dissertations
Total Page:16
File Type:pdf, Size:1020Kb
Modeling Uncertainty in Data Integration for Improving Protein Function Assignment Brenton E. Louie A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2008 Program Authorized to Offer Degree: Medical Education and Biomedical Informatics UMI Number: 3318214 INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 3318214 Copyright 2008 by ProQuest LLC. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest LLC 789 E. Eisenhower Parkway PO Box 1346 Ann Arbor, Ml 48106-1346 University of Washington Graduate School This is to certify that I have examined this copy of a doctoral dissertation by Brenton E. Louie and have found that it is complete and satisfactory in all respects, and that any and all revisions required by the final examining committee have been made. Chair of the Supervisory Committee: Peter Tar^zv-Hornocfey- h Reading Committee: &/&L f OMA^J "("fcf^^cM Peter Tarczy-Hornoch Dan Suciu V In presenting this dissertation in partial fulfillment of the requirements for the doctoral degree at the University of Washington, I agree that the Library shall make its copies freely available for inspection. I further agree that extensive copying of the dissertation is allowable only for scholarly purposes, consistent with "fair use" as prescribed in the U.S. Copyright Law. Requests for copying or reproduction of this dissertation may be referred to ProQuest Information and Learning, 300 North Zeeb Road, Ann Arbor, MI 48106-1346, 1-800-521-0600, to whom the author has granted "the right to reproduce and sell (a) copies of the manuscript in microform and/or (b) printed copies of the manuscript made from microform." Signature ^CVy^-OWiVu Date vUW University of Washington Abstract Modeling Uncertainty in Data Integration for Improving Protein Function Assignment Brenton E. Louie Chair of the Supervisory Committee: Professor Peter Tarczy-Hornoch Department of Medical Education and Biomedical Informatics In this work we describe the development and evaluation of the BioMiner system for protein functional annotation. BioMiner is the implementation of a novel uncertainty model for annotation and is based on the Uncertainty in Information Integration (UII) system, a general-purpose data integration system with extended functionality to handle uncertainty in data. The informatics contributions of our work are as follows: 1) we develop and implement a first-in-class uncertainty model for annotation and illustrate the validity of the model, 2) we show that the uncertainty model is reliable by evaluating its robustness through a principled methodology, and 3) we demonstrate that the uncertainty model performs better that existing, commonly utilized, approaches through a rigorous performance evaluation. The application of BioMiner also contributes to the expansion of domain knowledge by accurately identifying functions for proteins of unknown function, a problem of utmost importance to biology. TABLE OF CONTENTS Page LIST OF FIGURES iii LIST OF TABLES x GLOSSARY xiii Chapter 1 : INTRODUCTION 1 1.1 Biological Background and Significance 1 1.2 Research Questions 6 1.3 Related Work 7 1.4 Contributions of this Dissertation 8 1.5 Outline of this dissertation 9 Chapter 2 : ASSIGNING FUNCTION TO PROTEINS 10 2.1 Background: Protein Annotation 10 2.2 Related Work: Current Approaches In Protein Annotation 15 2.3 Challenges for Computational Annotation Systems 21 2.4 Discussion 26 Chapter 3 : BIOLOGICAL DATA INTEGRATION 29 3.1 Overview Of Data Integration Concepts 29 3.2 Related Work: Existing Data Integration Systems 33 3.3 The BioMediator Data Integration System 37 3.4 Uncertainty: The Uncertainty In Information Integration (UII) Proj ect 43 3.5 Discussion 55 Chapter 4 : THE BIOMINER SYSTEM 56 4.1 Design of the BioMiner System for Protein Annotation 56 4.2 Related Work 63 4.3 Implementation of the BioMiner System 64 I 4.4 Evolution of the BioMiner system 72 4.5 Discussion 73 Chapter 5 : THE ROBUSTNESS OF BIOMINER ....76 5.1 The stability of results from the BioMiner system 76 5.2 Related Work: Sensitivity analyses in Bayesian Networks 79 5.3 A Sensitivity Analysis of the BioMiner System 81 5.4 Study Evaluation Protocol 87 5.5 Results .....95 5.6 Discussion 106 Chapter 6 : EVALUATION OF BIOMINER 110 6.1 Multiple Evaluations of the BioMiner System 110 6.2 Related Work: Evaluating Computational Annotation Systems 112 6.3 Proof of Concept Evaluation #1: Relevance Ranking 113 6.4 Proof of Concept Evaluation #2: protein annotation 117 6.5 BioMiner Evaluation Study: hypothetical protein annotation 124 6.6 Discussion 137 Chapter 7 : SUMMARY AND CONCLUSIONS 141 7.1 Research Contributions 141 7.2 Limitations 145 7.3 Summary of Research 148 Bibliography 150 Appendix A: Annotation Datasets 162 Appendix B: Annotation Method 167 ii LIST OF FIGURES Figure Number Page Figure 1.1: A genome is the blueprint for creating an organism. It is encoded in DNA, a base-four sequence of nucleic acids (A, T, G, and C). Within a genome are specific subsequences called genes. A gene is a template which encodes a base- twenty sequence of amino acids, or a protein. Proteins carry out all the biological processes in an organism. The term gene and protein are sometimes used interchangeably in the biomedical literature. This is understandable given the "gene-codes-for-protein" relationship, although this is confusing to non-experts. In this dissertation, the terms gene and protein are generally interpreted as synonyms 1 Figure 1.2: Novel proteins in the bacterium S. oneidensis allow it to "breathe" hexavalent uranium. This converts hexavalent uranium to uranitite which is insoluble in water and much easer to clean-up (courtesy of Eugene Kolker) 3 Figure 1.3: The number of biological databases has continued to increase over time, creating a significant data integration challenge in computational protein annotation (courtesy of [12]) 5 Figure 2.1: The glycolysis pathway, which provides an illustration of protein function. The hexokinase protein is an enzyme which catalyzes the first reaction of the process by which the human body converts sugar to ATP, a form of energy usable by cells (www.biocarta.com) 10 Figure 2.2: To annotate proteins in the current paradigm, biological researchers manually query and compile data and information from multiple, non-interoperable biological data sources which greatly slows the pace of annotation. (Figure adapted and modified from [12]) 13 Figure 2.3: A new paradigm for annotation using data integration. Users query software which handles the specific querying and compiling of results from the individual data sources. To the user, all the individual data sources appear as a single database (adapted from [28]) 14 Figure 2.4: An illustration of the proportions of various types of annotations in the organism Shewanella oneidensis. Only a small percentage of annotations can be ascribed to actual experiments (about 5%). Most proteins have been annotated by computational means (sequence comparisons) or are of unknown function (Courtesy of Eugene Kolker, PhD) 17 in Figure 2.5: Systems for computational protein annotation generally follow this three- tiered approach. Tier one is a data-management system, Tier 2 includes a way to query the data, and Tier 3 provides views of the data. This figure is courtesy of [27] 20 Figure 3.1: Architecture of the BioMediator system. The core components are the data interfaces or wrappers, the Source Knowledge Base which contains the mediated schema, the Query Processor, and the User Query Interfaces (courtesy of [55]).. 38 Figure 3.2: BioMediator result viewed as a graph with nodes akin to mediated schema concepts and edges referring to relationships between concepts. Results are derived from an initial seed query for the HK1 gene (Gene:symbol="HKl",) after 1 expansion 42 Figure 3.3: The same result set as in Figure 3.4 after 4 expansions. The size of the result set (e.g. nodes and edges) becomes large very quickly. Users have difficulty analyzing and selecting relevant information from result sets of this size 43 Figure 3.4: Illustration of the Ps metric. The Ps metric is a user's belief in the quality of a particular mediated schema entity from a particular source, which is interpreted probabilistically (0.0-1.0) value 49 Figure 3.5: Illustration of the Qs metric. The Qs metric is a user's belief in the quality of a particular relationship, as defined in the mediated schema, between two sources, which is interpreted probabilistically (0.0-1.0 value) 49 Figure 3.6: Illustration of the Pr metric. This metric is a users belief in a particular data record of the same mediated schema type and data source. It is interpreted probabilistically as a 0.0-1.0 value, which is dynamically determined when a record is retrieved 50 Figure 3.7: Illustration of the Qr metric. This metric is a user's belief in a particular cross-reference, or link, between two data records. It is interpreted probabilistically as a 0.0-1.0 value, and is dynamically determined when records are retrieved, much like the Pr metric 50 Figure 3.8: An illustration of a result graph from the UII system annotated with uncertainty metrics. Ps and Pr metrics are assigned to the nodes, which correspond to mediated schema entities.