Towards Data Cleaning in Large Public Biological Databases
Total Page:16
File Type:pdf, Size:1020Kb
Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations 2021 Towards data cleaning in large public biological databases Hamid Bagheri Iowa State University Follow this and additional works at: https://lib.dr.iastate.edu/etd Recommended Citation Bagheri, Hamid, "Towards data cleaning in large public biological databases" (2021). Graduate Theses and Dissertations. 18448. https://lib.dr.iastate.edu/etd/18448 This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Towards data cleaning in large public biological databases by Hamid Bagheri A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Major: Computer Science Program of Study Committee: Hridesh Rajan, Major Professor James Reecy Samik Basu David Fernandez-Baca Xiaoqiu Huang The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred. Iowa State University Ames, Iowa 2021 Copyright © Hamid Bagheri, 2021. All rights reserved. ii DEDICATION To my family and friends. Without their support, this thesis would have not been completed. iii TABLE OF CONTENTS Page LIST OF TABLES . vii LIST OF FIGURES . viii ACKNOWLEDGMENTS . ix ABSTRACT . xi CHAPTER 1. INTRODUCTION . .1 1.1 Contributions . .2 1.2 Outline . .3 CHAPTER 2. RELATED WORK . .4 2.1 Domain-Specific Language and Large-scale infrastructure for genomics analysis . .4 2.2 Data parallelization framework and public databases in biology . .5 2.3 Taxonomic misclassification . .6 2.4 Functional misannotations . .7 CHAPTER 3. SHARED DATA SCIENCE INFRASTRUCTURE FOR GENOMICS DATA . .8 3.1 Background . .8 3.2 Potential for data parallelization framework in biology . .9 3.3 Choice of Biological repository for prototype implementation . 10 3.4 Design and implementation considerations . 11 3.4.1 Genomics-specific Language and data schema . 12 3.4.2 Output Aggregators in BoaG . 13 3.4.3 BoaG database and new data type integration . 13 3.4.4 Data availability . 14 3.4.5 Run BoaG on Docker container and Jupyter . 14 3.5 Application of BoaG to the RefSeq database . 14 3.6 Results on the RefSeq database . 14 3.6.1 Summary statistics of RefSeq . 15 3.6.2 The largest and smallest genome in the RefSeq database . 15 3.6.3 Study the changes of average number of exons per gene over time . 16 3.6.4 Popularity of bacterial genome assembly programs . 18 3.6.5 Study the quality of metazoan assembly programs over time . 19 3.7 Discussion and Future work . 20 3.7.1 Database storage efficiency and computational efficiency with Hadoop . 20 3.7.2 Comparison between MongoDB and BoaG . 22 iv 3.7.3 Comparison between Python and BoaG . 23 3.8 Conclusion . 25 CHAPTER 4. A CYBERINFRASTRUCTURE FOR ANALYZING LARGE-SCALE BIOLOGI- CALDATA .............................................. 27 4.1 Introduction . 27 4.2 Methods and Materials . 30 4.2.1 Overview architecture . 30 4.2.2 BoaG domain-specific language . 30 4.2.3 Cluster NR at different level of sequence similarity . 31 4.2.4 Generate BoaG database from the raw dataset . 33 4.2.5 Submit queries on the BoaG infrastructure . 34 4.2.6 Interpreting BoaG’s outputs . 34 4.3 Results . 34 4.3.1 NR Proteins are not evenly distributed across tree of life . 35 4.3.2 Proteins in NR vary greatly in length . 37 4.3.3 Clustering of similar protein sequences indicate a much lower number of unique proteins in NR . 38 4.3.4 Almost as many Taxa as proteins . 38 4.3.5 Highly conserved protein functions . 40 4.3.6 Provenance of annotations . 40 4.3.7 Redundancy and ambiguity of annotations . 41 4.4 Discussion . 42 4.4.1 Storage and computational efficiency in BoaG . 42 4.4.2 Programming efficiency in BoaG . 43 4.5 Conclusion . 44 CHAPTER 5. DETECTING AND CORRECTING MISCLASSIFIED SEQUENCES IN THE LARGE-SCALE PUBLIC DATABASES . 45 5.1 Introduction . 45 5.2 Materials and methods . 48 5.2.1 An overview of the method . 49 5.2.2 Approach to detect taxonomic misclassification . 52 5.2.3 The most probable taxonomic assignment for detected misclassifications . 54 5.2.4 Simulated and literature dataset . 55 5.2.5 Sensitivity analysis . 55 5.3 Results . 56 5.3.1 Detected taxonomically misclassified proteins . 56 5.3.2 Performance on simulated and real-world dataset . 58 5.3.3 Detected misassignments in clusters . 60 5.3.4 Correcting Taxonomic Misclassification . 61 5.3.5 Running time . 62 5.4 Discussion and conclusion . 62 5.4.1 Applications and limitations . 63 5.4.2 Conclusion . 63 v CHAPTER 6. IMPROVING DATA QUALITY OF TAXONOMIC ASSIGNMENTS IN LARGE- SCALE PUBLIC DATABASES . 64 6.1 Introduction . 64 6.2 Materials and methods . 67 6.2.1 Dataset generation and definitions . 67 6.2.2 Improve data quality of clusters . 70 6.2.3 Approach to give suspicious or reliable label to assignments . 71 6.2.4 Proposing the most reliable assignments for detected sequences . 73 6.2.5 Simulated and literature dataset . 74 6.3 Results . 74 6.3.1 Improve data quality of clusters at different sequence similarity . 74 6.3.2 Identified suspicious or reliable taxonomic assignments . 75 6.3.3 Propose taxonomic assignment for the identified mislabeled sequences . 76 6.3.4 Performance on simulated and real-world dataset . 77 6.3.5 Manual study . 77 6.3.6 Running time . 78 6.4 Discussion and conclusion . 79 6.4.1 Applications and limitations . 79 6.4.2 Conclusion . 79 CHAPTER 7. DATA CLEANING TECHNIQUE FOR PROTEIN FUNCTIONS IN THE LARGE- SCALE PUBLIC DATABASES . 81 7.1 Introduction . 81 7.2 Methods . 83 7.2.1 Dataset generation . 83 7.2.2 Detecting mispredicted functions . 85 7.2.3 Correcting mispredicted functions . ..