Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations

2021

Towards data cleaning in large public biological databases

Hamid Bagheri Iowa State University

Follow this and additional works at: https://lib.dr.iastate.edu/etd

Recommended Citation Bagheri, Hamid, "Towards data cleaning in large public biological databases" (2021). Graduate Theses and Dissertations. 18448. https://lib.dr.iastate.edu/etd/18448

This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Towards data cleaning in large public biological databases

by

Hamid Bagheri

A dissertation submitted to the graduate faculty

in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Major: Computer Science

Program of Study Committee: Hridesh Rajan, Major Professor James Reecy Samik Basu David Fernandez-Baca Xiaoqiu Huang

The student author, whose presentation of the scholarship herein was approved by the program of study committee, is solely responsible for the content of this dissertation. The Graduate College will ensure this dissertation is globally accessible and will not permit alterations after a degree is conferred.

Iowa State University

Ames, Iowa

2021

Copyright © Hamid Bagheri, 2021. All rights reserved. ii

DEDICATION

To my family and friends. Without their support, this thesis would have not been completed. iii

TABLE OF CONTENTS

Page

LIST OF TABLES ...... vii

LIST OF FIGURES ...... viii

ACKNOWLEDGMENTS ...... ix

ABSTRACT ...... xi

CHAPTER 1. INTRODUCTION ...... 1 1.1 Contributions ...... 2 1.2 Outline ...... 3

CHAPTER 2. RELATED WORK ...... 4 2.1 Domain-Specific Language and Large-scale infrastructure for genomics analysis ...... 4 2.2 Data parallelization framework and public databases in biology ...... 5 2.3 Taxonomic misclassification ...... 6 2.4 Functional misannotations ...... 7

CHAPTER 3. SHARED DATA SCIENCE INFRASTRUCTURE FOR GENOMICS DATA . . . . .8 3.1 Background ...... 8 3.2 Potential for data parallelization framework in biology ...... 9 3.3 Choice of Biological repository for prototype implementation ...... 10 3.4 Design and implementation considerations ...... 11 3.4.1 Genomics-specific Language and data schema ...... 12 3.4.2 Output Aggregators in BoaG ...... 13 3.4.3 BoaG database and new data type integration ...... 13 3.4.4 Data availability ...... 14 3.4.5 Run BoaG on Docker container and Jupyter ...... 14 3.5 Application of BoaG to the RefSeq database ...... 14 3.6 Results on the RefSeq database ...... 14 3.6.1 Summary statistics of RefSeq ...... 15 3.6.2 The largest and smallest genome in the RefSeq database ...... 15 3.6.3 Study the changes of average number of exons per gene over time ...... 16 3.6.4 Popularity of bacterial genome assembly programs ...... 18 3.6.5 Study the quality of metazoan assembly programs over time ...... 19 3.7 Discussion and Future work ...... 20 3.7.1 Database storage efficiency and computational efficiency with Hadoop ...... 20 3.7.2 Comparison between MongoDB and BoaG ...... 22 iv

3.7.3 Comparison between Python and BoaG ...... 23 3.8 Conclusion ...... 25

CHAPTER 4. A CYBERINFRASTRUCTURE FOR ANALYZING LARGE-SCALE BIOLOGI- CALDATA ...... 27 4.1 Introduction ...... 27 4.2 Methods and Materials ...... 30 4.2.1 Overview architecture ...... 30 4.2.2 BoaG domain-specific language ...... 30 4.2.3 Cluster NR at different level of sequence similarity ...... 31 4.2.4 Generate BoaG database from the raw dataset ...... 33 4.2.5 Submit queries on the BoaG infrastructure ...... 34 4.2.6 Interpreting BoaG’s outputs ...... 34 4.3 Results ...... 34 4.3.1 NR Proteins are not evenly distributed across tree of life ...... 35 4.3.2 Proteins in NR vary greatly in length ...... 37 4.3.3 Clustering of similar protein sequences indicate a much lower number of unique proteins in NR ...... 38 4.3.4 Almost as many Taxa as proteins ...... 38 4.3.5 Highly conserved protein functions ...... 40 4.3.6 Provenance of annotations ...... 40 4.3.7 Redundancy and ambiguity of annotations ...... 41 4.4 Discussion ...... 42 4.4.1 Storage and computational efficiency in BoaG ...... 42 4.4.2 Programming efficiency in BoaG ...... 43 4.5 Conclusion ...... 44

CHAPTER 5. DETECTING AND CORRECTING MISCLASSIFIED SEQUENCES IN THE LARGE-SCALE PUBLIC DATABASES ...... 45 5.1 Introduction ...... 45 5.2 Materials and methods ...... 48 5.2.1 An overview of the method ...... 49 5.2.2 Approach to detect taxonomic misclassification ...... 52 5.2.3 The most probable taxonomic assignment for detected misclassifications ...... 54 5.2.4 Simulated and literature dataset ...... 55 5.2.5 Sensitivity analysis ...... 55 5.3 Results ...... 56 5.3.1 Detected taxonomically misclassified proteins ...... 56 5.3.2 Performance on simulated and real-world dataset ...... 58 5.3.3 Detected misassignments in clusters ...... 60 5.3.4 Correcting Taxonomic Misclassification ...... 61 5.3.5 Running time ...... 62 5.4 Discussion and conclusion ...... 62 5.4.1 Applications and limitations ...... 63 5.4.2 Conclusion ...... 63 v

CHAPTER 6. IMPROVING DATA QUALITY OF TAXONOMIC ASSIGNMENTS IN LARGE- SCALE PUBLIC DATABASES ...... 64 6.1 Introduction ...... 64 6.2 Materials and methods ...... 67 6.2.1 Dataset generation and definitions ...... 67 6.2.2 Improve data quality of clusters ...... 70 6.2.3 Approach to give suspicious or reliable label to assignments ...... 71 6.2.4 Proposing the most reliable assignments for detected sequences ...... 73 6.2.5 Simulated and literature dataset ...... 74 6.3 Results ...... 74 6.3.1 Improve data quality of clusters at different sequence similarity ...... 74 6.3.2 Identified suspicious or reliable taxonomic assignments ...... 75 6.3.3 Propose taxonomic assignment for the identified mislabeled sequences ...... 76 6.3.4 Performance on simulated and real-world dataset ...... 77 6.3.5 Manual study ...... 77 6.3.6 Running time ...... 78 6.4 Discussion and conclusion ...... 79 6.4.1 Applications and limitations ...... 79 6.4.2 Conclusion ...... 79

CHAPTER 7. DATA CLEANING TECHNIQUE FOR PROTEIN FUNCTIONS IN THE LARGE- SCALE PUBLIC DATABASES ...... 81 7.1 Introduction ...... 81 7.2 Methods ...... 83 7.2.1 Dataset generation ...... 83 7.2.2 Detecting mispredicted functions ...... 85 7.2.3 Correcting mispredicted functions ...... 87 7.2.4 Manual study ...... 88 7.2.5 Literature and simulated dataset ...... 88 7.3 Results ...... 88 7.3.1 Detecting potential functional misannotations ...... 89 7.3.2 Correcting mispredicted functional annotations ...... 89 7.3.3 Common PRO name for NR clusters at different similarity ...... 90 7.3.4 Manual analysis ...... 90 7.3.5 Provenance of mispredicted annotations ...... 91 7.3.6 Case Study ...... 91 7.3.7 Performance on the simulated and literature dataset ...... 91 7.3.8 Running time ...... 92 7.3.9 Discussions ...... 92

CHAPTER 8. CONCLUSION AND FUTURE WORK ...... 93 8.1 Future Work ...... 93 8.1.1 Language and Infrastructure Extension ...... 93 8.1.2 Data cleaning for Different Metadata ...... 94 8.1.3 Crowd-Source Data cleaning ...... 94 8.1.4 Machine Learning Model in the Infrastructure ...... 94 vi

8.1.5 Recommendation for Public Databases ...... 94

BIBLIOGRAPHY ...... 95 vii

LIST OF TABLES

Page Table 3.1 Domain types for genomics data in BoaG ...... 12 Table 3.2 The BoaG aggregators list ...... 12 Table 3.3 Exon Statistics for years >= 2016 ...... 17 Table 3.4 Exon Statistics for years < 2016 ...... 18 Table 3.5 List of top three most used assembly programs for Metazoa (Year > =2016) . . . . 20 Table 3.6 List of top three most used assembly programs for Metazoa (Year < 2016) . . . . . 20 Table 3.7 Kingdoms and average summary statistics (Years > =2016)...... 21 Table 3.8 Kingdoms and average summary statistics (Years <= 2015)...... 22 Table 3.9 Comparison between MongoDB and BoaG ...... 25 Table 4.1 Domain specific types for the NR database and its clustering information . . . . . 33 Table 4.2 proteins that have the large numbers of taxonomic assignments ...... 39 Table 4.3 Examples of protein functions and their appearances in sequences that have more than 10 distinct taxa...... 40 Table 5.1 Detected misclassified taxonomic proteins in the NR database...... 56 Table 5.2 Accuracy of detecting misassignments and the comparison with work presented in SATIVA [40]...... 59 Table 5.3 Proposed taxa for the detected misclassified sequences in NR...... 60 Table 6.1 Identified misclassified sequences at different similarity level...... 80 Table 6.2 Potentially misclassified proteins in the NR database that have a single assignment. 80 Table 6.3 Proposed taxa for detected misclassified sequences as shown in Figure 6.2..... 80 Table 7.1 Number of records and types of protein sequences form different public databases based on primary keys...... 89 Table 7.2 Protein functions misassignments and the proposed one ...... 90 viii

LIST OF FIGURES

Page Figure 3.1 BoaG Architecture and Data Generation ...... 11 Figure 3.2 Code to find the smallest and largest genomes in RefSeq...... 16 Figure 3.3 Number of exons, genes, and exons per gene after 2016...... 17 Figure 3.4 Bacterial assembly programs popularity over time...... 18 Figure 3.5 Assembler programs for over the years ...... 19 Figure 3.6 Assembly statistics for genomes for years after 2016...... 21 Figure 3.7 Scalability of BoaG programs (time is in Log base 2 (sec))...... 22 Figure 3.8 The BoaG database size comparison with the raw data in the RefSeq as well as the JSON version of the dataset...... 23 Figure 3.9 Line of code comparison between BoaG and MongoDB ...... 23 Figure 3.10 Comparison of Line of Code(LOC) and performance ...... 24 Figure 3.11 Example of BoaG programs to compute different tasks on the full RefSeq dataset. 26 Figure 4.1 BoaG infrastructure and web-interface ...... 32 Figure 4.2 Frequencies of taxonomic assignments for each protein cluster at different se- quence similarity ...... 32 Figure 4.3 The protein frequency by log(2) of protein length...... 35 Figure 4.4 Distributions of proteins in the tree of life...... 36 Figure 4.5 Frequency of protein sequences with different taxonomic assignments...... 39 Figure 4.6 Provenance and frequency of annotations from each database ...... 40 Figure 4.7 The BoaG dataset is compared with the raw data and the equivalent of MongoDB. . 44 Figure 4.8 Scalability of Boa programs (time in log 2 seconds)...... 44 Figure 5.1 Overview architecture of the proposed method to detecting misclassifications . . . 49 Figure 5.2 Phylogenetic tree generated for sequence ID NP_001026909...... 57 Figure 5.3 Compare running time of the proposed work with the SATIVA method...... 60 Figure 6.1 Single vs multiple assignments...... 65 Figure 6.2 Examples of NR95 clusters with different misclassification levels...... 67 Figure 6.3 Frequency of taxonomic assignment for identifying taxonomically misclassified protein sequences in the NR95...... 68 Figure 6.4 Lineage from a cluster that contains several taxonomic assignments...... 72 Figure 6.5 Compare sequence assignment with the most frequent assignments in the respec- tive clusters...... 76 Figure 6.6 Compare running time of the proposed work with the previous works...... 78 Figure 7.1 BoaG script for the list of protein functions for proteins that have more than 10 distinct taxonomic assignments...... 84 Figure 7.2 Ontology graph generated from the list of functional annotations...... 84 Figure 7.3 Ontology graph generated from the list of functional annotations of protein se- quence ID WP_000184067 ...... 85 Figure 7.4 An overview of the approach.Each node in the graph will be represented as a vector that preserves its semantic...... 86 ix

ACKNOWLEDGMENTS

I would like to take this opportunity to express my thanks to Dr. Hridesh Rajan, my major advisor, for the incredible support, guidance, and patience throughout my Ph.D. This dissertation would not have been possible without him, especially during the hard time of my work. I would like to thank my committee members Dr. James Reecy, Dr. Samik Basu, Dr. David Fernandez-Baca, and Dr. Xiaoqiu Huang, for their efforts and contributions to this dissertation. I would also like to thank Dr. Andrew Severin for all the training in biology and computational insight toward this dissertation. I also want to thank other Genome

Information Facility members at Iowa State University for their feedback and comments. I would like to thank Dr. Robert Dyer for his support in building the Boa infrastructure for genomics. I would also like to thank my colleagues in the Laboratory of Software Design: Md Johirul Islam, Shibbir Ahmed, Rangeet

Pan, Samantha Khairunnesa, Mohammad Warda, Sumon Biswas, Giang Nguyen, and Yijia Huang. They all provided insightful and timely feedback during my graduate study. I would also like to thank my senior colleagues Dr. Hoan Nguyen, Dr. Mehdi Bagherzadeh, and Dr. Ganesha Upadhyaya, who provided support and guidance during my early years of graduate study. I would also like to thank Nicole Lewis, the graduate student advisor at Computer Science, for her incredible assistance with official matters. Finally, I would like to thank my friends and family for their support during my graduate research. This dissertation was supported in part by grants from the National Science Foundation (NSF) under Grant CCF-15-18897, CNS-

15-13263, and the VPR office at Iowa State University.

The content of this dissertation is based on the published and resubmitted papers. Chapter3 is based on the published BMC Bioinformatics paper [12]. Chapter4 is currently being revised to submit to the

Bioinformatics journals. Chapter5 is based on the published paper at Oxford Bioinformatics [13]. Chapter

6, is being revised to submit to the Bioinformatics journals. Finally, chapter7, is a work in-progress that addresses data quality in the protein functions at scale. I also had the chance to conduct a bug study of data x science programs in R and Python libraries. This work has been submitted to the Transaction of Software

Engineering journal. xi

ABSTRACT

As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. As sequencing data continues to accumulate in the online repositories, scientists can increasingly use multi-tiered data to better answer biological questions. One main challenge that the public biological repositories have is the problem of data quality of the metadata. Unfortunately, most public databases do not have methods for identifying errors in their metadata, leading to the potential for error propagation.

In order to do the cleaning at the large scale, scalable infrastructure and algorithms are needed to be developed. In this dissertation, we built a domain-specific language and large-scale infrastructure, called

BoaG, to analyze the wealth of genomics data. We used the BoaG’s interface to reason about the provenance, frequencies, and quality of annotations.

The second part of the dissertation focuses on the cleaning of the public repositories at scale. Most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of taxonomic misclassification in the entire database has not been quantified. We proposed and developed an automatic approach to detect and remove the suspicious taxonomic assignments and mispredicted functional annotations.

We also addressed widely used sequence clustering information of the public databases. The usefulness of clusters to explore different biological analyses has been shown for functional annotation, family classi-

fication, systems biology, structural genomics, and phylogenetic analysis [71]. We utilized CD-HIT [31] to cluster NR sequences at different similarity level, i.e. 95%, 90%, 85%, down to 65%. To improve the data quality of the clusters, we removed anomalies and then provided a confidence score based on the lineage of all sequences within each cluster. xii

For the functional annotations, we utilized protein ontology (PRO) [56] and Gene Ontology [? ] that are knowledge-based graphs to detect potentially mispredicted functions. Ontologies have been utilized to express knowledge. In this dissertation, we leveraged them to improve the quality of the public genomics databases. We proposed a computational method that abstracts ontology graphs into a lower-dimensional network representation that makes reasoning for inconsistencies among the list of functional annotations easier.

We found that the BoaG infrastructure provided fewer lines of code, reduced storage size, and provided automatic parallelization for the large-scale analyses on the NR dataset. The BoaG’s web-interface is also implemented and is made publicly available for researchers to test different hypotheses and share them among others. We have identified “29,175,336" proteins in the NR database that have more than one distinct taxonomic assignment, among which “2,238,230" (7.6%) are potentially taxonomically misclassified. We also found that the total number of potential misclassifications in clusters at 95% similarity, above the genus level, is “3,689,089" out of 88M clusters, which are 4% of the total clusters. This percentage of misclassifications in NR has a significant impact due to the potential for error propagation in the downstream analysis. This method proposed in this dissertation will be a valuable tool in cleaning up large-scale public databases. The technique we proposed could be extended to address other kinds of annotation errors of the public databases at scale. 1

CHAPTER 1. INTRODUCTION

The amount of biological sequencing data generated every year continues to grow exponentially as evidenced by the growth of the sequencing database. As sequencing data continues to accumulate in the online public repositories of biological data [65], scientists can increasingly use multi-tiered data to better answer biological questions. A critical challenge that these public repositories have is the problem of data quality of the metadata. For example, the NCBI’s NR metadata that contains the comprehensive knowledge of proteins including taxonomic and functional annotations. Unfortunately, most public repositories do not have methods for identifying errors in their metadata, leading to the potential for error propagation. There are three root causes of annotation errors in these databases: user metadata submission, contamination error in the biological samples, and computational methods [66].

Therefore, it will be beneficial for researchers to utilize a quality control method to detect misclassified sequences and mispredicted functions and propose the most reliable annotations. However, exploring public sequence databases and curating annotations at large-scale is challenging. Previous research on the NR database focused on a small subset or a manual study of the NR database [48, 66] and analyzed annotation errors. To the best of our knowledge, the amount of misannotations in the entire database has not been well quantified. Thus, there is an urgent need to address the quality of metadata in large-scale biological databases to improve data quality and reduce error propagation. To address large-scale data cleaning, it is crucial to utilize a scalable infrastructure and new algorithms are needed to be developed to significantly reduce manual labor costs. A robust distributed infrastructure will significantly improve our ability to detect and correct the misannotations of the NR database.

We built efficient and more accessible query interface for the NR database, provided large-scale support to detect and correct misannotations, and improved the quality of public databases. The BoaG’s web- interface is also implemented and made publicly available for researchers to test different hypotheses and share them among others. We also built an automatic approach to detect and correct taxonomic misassign- 2 ment and mispredicted functions in the public databases. We showed that there are more than 2.3 millions protein potential misclassified sequences in the NR database.

1.1 Contributions

In this dissertation, to address large-scale curation in the public biological databases, we implemented

Boa for genomics, BoaG, to analyze NR’s functional and taxonomic annotations at scale. Our specific contributions are:

In chapter3, we implemented Boa for genomics, BoaG, to analyze RefSeq’s 153,848 annotations and assembly file metadata. We showed that BoaG outperforms current solutions such as Python and MongoDB and requires fewer lines of code. Such an efficient infrastructure will enable us to analyze, quantify, and improve data quality of the large-scale public genomics databases.

In chapter4, we built the infrastructure and made it publicly available for researchers to test different hypotheses. We extended the BoaG infrastructure by integrating the sequencing data of the NR database, and its clustering information to illustrate the potential of the infrastructure to analyze the information con- tained in large public sequence databases. To that end, the BoaG database and schema were generated, and the compiler has been modified. Then took this clustering information and combine it with the sequence metadata corresponding to protein function and taxonomic assignment. Using this information, we are able to better quantify the content, taxonomic distribution of proteins, and protein functions in the NR database.

In chapter5, we proposed an automatic approach for detecting and correcting taxonomic misassignments at large-scale for the NR database that is low-cost and easy to use.

In chapter6, we proposed a scalable algorithm that improves data quality of NR clusters at different similarity levels, i.e., NR95, NR90,..., NR65 clusters. We also provided a scalable identification of sus- picious taxonomic assignments for proteins that have a single assignment. We proposed the most reliable assignments for the suspicious sequences by utilizing the consensus information of the clusters at 95% and

90% sequence similarity. 3

In chapter7, we presented a computational method that abstracts protein and gene ontology graphs into a lower-dimensional network representation that makes reasoning for functional inconsistencies easier. We have utilized network embedding to detect inconsistencies in protein functions.

1.2 Outline

In the next chapter, we provide a detailed literature review on the large-scale genomics infrastructure, data cleaning technique for taxonomic misclassification, functional annotations in the public databases, and their limitations at large-scale. In chapter3, we present a prototype of BoaG to address large-scale analysis of genomics data. In chapter4, we made BoaG infrastructure publicly available. We also extended the BoaG infrastructure by integrating the sequencing data of the NR database, and its clustering information. In chap- ter5, we present a heuristic-based approach to detect and correct taxonomic misassingment at large-sclae.

In chapter6, we will discuss a scalable algorithm that addresses NR clusters at different similarity levels, i.e., NR95, NR90,..., NR65. In chapter7, we present a large-scale algorithm on detecting and correcting functional misprediction in the NR database. 4

CHAPTER 2. RELATED WORK

2.1 Domain-Specific Language and Large-scale infrastructure for genomics analysis

Genomics-specific languages are also common in high-throughput sequencing analysis such as S3QL, which aims to provide biological discovery by harnessing Linked Data [26]. In addition, there are libraries like BioJava [59], Bioperl [68], and Biopython [22] that provide tools to process biological data. MongoDB is an open source NoSQL database that also supports many features of traditional databases like sorting, grouping, aggregating, indexing, etc. MongoDB has been used to handle large scale semi-structured or

NoSQL data. Datasets are stored in a flexible JSON format and therefore can support data schema that evolves over time. MapReduce [24] is a framework that has been used for scalable analysis in scientific data. Hadoop is an open source implementation of MapReduce. In the MapReduce programming model, mappers and reducers are considered as the data processing primitives and and are specified via user-defined functions. A mapper function takes the key-value pairs of input data and provides the key-value pairs as an output or input for the reduce stage, and a reducer function takes these key-values pairs and aggregates data based on the keys and provide the final output. There are organizations that have used the power of Mon- goDB and Hadoop framework together [3] to address challenges in Big Data. Genomics England [2] runs the 100,000 Genomes Project [75] using MongoDB to harness huge amount of data in bioinformatics. There are also several tools in the field of high-throughput sequencing analysis that use the power of Hadoop and

MapReduce programming model. Heavy computation applications like BLAST, GSEA and GRAMMAR have been implemented in Hadoop [72]. SARVAVID [44] has implemented five well-known applications for running on Haddop: BLAST, MUMmer, E-MEM, SPAdes, and SGA. BLAST [9] was also rewritten for

Hadoop by Leo et.al. [43]. In addition to these programs, there are other efforts based on Hadoop to address

RNA-Seq and sequence alignment [42, 57, 64].

A significant barrier to utilize the Hadoop framework in bioinformatics is the difficulty of the interface and the amount of expertise that are needed to write a MapReduce programs [8]. The proposed work 5 tries to abstract away details of these complexities and open a door for more bioinformatics application.

Most applications could be called from MapReduce rather than reimplementing them. Unfortunately, there currently does not exist a tool that combines the ability to query databases, with the advantage of a domain specific language and the scalability of Hadoop into a Shared Data Science Infrastructure for large biology datasets. Boag, on the other hand is such a tool but is currently only implemented for mining very large software repositories like GitHub and Sourceforge. It recently has been applied to address potentials and challenges of Big Data in transportation [37].

2.2 Data parallelization framework and public databases in biology

There are several very large data repositories in biology that could take advantage of a biology specific implementation of BoaG: The National Center for Biotechnology Information (NCBI), The Cancer Genome

Atlas (TCGA), and the Encyclopedia of DNA Elements (ENCODE). NCBI hosts 45 literature/molecular biology databases and is the most popular resource for obtaining raw data for analysis. NCBI and other web resources like Ensembl are data warehouses for storing and querying raw data, sequences, and genes. TCGA contains data that characterizes changes in 33 types of cancer. This repository contains 2.5 petabytes of data and metadata with matched tumor and normal tissues from more than 11,000 patients. The repository is comprised of eight different data types: Whole exome sequence, mRNA sequence, microRNA sequence,

DNA copy number profile, DNA methylation profile, whole genome sequencing and reverse-phase protein array expression profile data. ENCODE is a repository with a goal to identify all the functional elements contained in human, mouse, fly and worm. This repository contains more than 600 terabytes (personal communication with @EncodeDCC and @mike_schatz) of data with more than 40 different data types with the most abundant data types being ChIP-Seq, DNase-Seq and RNA-Seq. These databases represent only the tip of the iceberg of potential large data repositories that could benefit from the BoaG framework.

While it is common to download and analyze small subsets of data (tens of Terabytes for example) from these repositories, analyses on the larger subsets or the entire repository is currently computationally and logistically prohibitive for all but the most well funded and staffed research groups. While BioMart [67],

Galaxy, and other web-based infrastructures provide an easy to use tool for users without any knowledge 6 in programming to download subsets of the data, the needs of the advanced users using the entire database aren’t met as evidenced by a plethora of bash scripts, R scripts and Python scripts that are widely utilized and reinvented by bioinformaticians. Retrieving the genomics data and performing data-intensive computation can be challenging using existing APIs. Biomartr [27] is an R package to retrieve raw genomics data that tries to minimize some of this complexity.

2.3 Taxonomic misclassification

To address these taxonomic misclassification problems, there are two approaches in the literature: phylogenetic-based approach and functional approach. For the first approach, Kozlov et al. ([40]) have proposed a phylogeny-aware method to detect and correct misclassified sequences in public databases.

They utilized the Evolutionary Placement Algorithm (EPA) to identify mislabeled taxonomic annotation.

Edgar ([30]) has studied annotation error in rRNA databases. They showed that the annotation er- ror rate in SILVA and Greengenes databases is about 17%. They also used the phylogenetic-based approach.

In the second approach, it is a common technique for quality control and data cleaning to utilize domain knowledge in the form of ontologies ([21]). Gene Ontology (GO) ([10]) has been suggested to infer aspects of protein function based on sequence similarity ([34]). The MisPred [53] and FixPred ([54]) programs are used to address the identification and correction of misclassified sequences in the public databases. The

FixPred and MisPred methods are based on the principle that an annotation is likely to be erroneous if its feature violates our knowledge about proteins ([52]). MisPred ([53]) is a tool developed to detect incom- plete, abnormal, or mispredicted protein annotations. There is a web interface to check the protein sequence online. MisPred uses protein-coding genes and protein knowledge to detect erroneous annotations at the protein function level. For example, they have found for a subset of protein databases that violation of domain integrity accounts for the majority of mispredictions. Modha et al. have proposed a pipeline to pinpoint taxonomic error as well as identifying novel viral species ([50]). There is another web-server for exploratory analysis and quality control of proteome-wide sequence search ([47]) that requires a protein se- quence in a FASTA format. European Bioinformatics Institute (EMBL-EBI) developed InterPro (InterPro; http://www.ebi.ac.uk/interpro/) to classify protein sequences at the superfamily, family and 7 subfamily levels. UniProt has also developed two prediction systems, UniRule and the Statistical Automatic

Annotation System (SAAS) (SAS; https://www.uniprot.org/help/saas), to annotate UniPro- tKB/TrEMBL protein database automatically. CDD is a Conserved Domain Database for the functional annotation of proteins ([45]).

2.4 Functional misannotations

SPARQL has been used to query Ontology ([33]); however, this approach is time-consuming to reason about large-scale data cleaning of functional annotations. Different strategies have been proposed in the literature. For example,Bengtsson-Palme et al. ([14]) proposed cleaning strategy to flag the manual anno- tations. However, this requires significant human effort, and it will be very challenging to address a huge number of new sequences. 8

CHAPTER 3. SHARED DATA SCIENCE INFRASTRUCTURE FOR GENOMICS DATA

In this chapter, we present the background of computational infrastructure, potential benefits to biology, and also introduce BoaG language and infrastructure.

3.1 Background

As sequencing data continues to accumulate in the online repositories [65], scientists can increasingly use multi-tiered data to better answer biological questions. A major barrier to these analyses lies with attain- ing a scalable computational infrastructure that is available to domain experts with minimal programming knowledge. The lengthy-time investment required for data wrangling tasks like organization, extraction, and analysis is increasing and is a well-known problem in Bioinformatics [73]. As this trend continues, a more robust system for reading, writing and storing files and metadata will be needed.

This can be achieved by borrowing methods and approaches from computer science. BoaG is a lan- guage and infrastructure that abstracts away details of parallelization and storage management by providing a domain-specific language and simple syntax [49]. The main features of BoaG are inspired by existing languages for data-intensive computing. These features include robust input/output, querying of data using types/attributes and efficient processing of data using functions and aggregators. BoaG can be implemented inside a Docker container or as a Shared Data Science Infrastructure (SDSI) [62]. Running on a Hadoop cluster [29], it manages the distributed parallelization and collection of data and analyses. BoaG can pro- cess and query terabytes of raw data. It also has been shown to substantially reduce programming efforts, thus lowering the barrier of entry to analyze very large data sets and drastically improve scalability and reproducibility [29]. Raw data files are described to BoaG with attribute types so that all the information contained in the raw data file can be parsed and stored in a binary database. Once complete, the reading, writing, storing and querying the data from these files is straightforward and efficient as it creates a dataset 9 that is uniform regardless of the input file standard (GFF, GFF3, etc). The size of the data in binary format is also smaller.

3.2 Potential for data parallelization framework in biology

There are several very large data repositories in biology that could take advantage of a biology specific implementation of Boag: The National Center for Biotechnology Information (NCBI), The Cancer Genome Atlas (TCGA), and the Encyclopedia of DNA Elements (ENCODE). NCBI hosts 45 literature/molecular biology databases and is the most popular resource for obtaining raw data for analysis. NCBI and other web resources like Ensembl are data warehouses for storing and querying raw data, sequences, and genes. TCGA contains data that characterizes changes in 33 types of cancer. This repository contains 2.5 petabytes of data and metadata with matched tumor and normal tissues from more than 11,000 patients. The repository is comprised of eight different data types: Whole exome sequence, mRNA sequence, microRNA sequence,

DNA copy number profile, DNA methylation profile, whole genome sequencing, and reverse-phase protein array expression profile data. ENCODE is a repository with a goal to identify all the functional elements contained in human, mouse, fly, and worm. This repository contains more than 600 terabytes (personal communication with @EncodeDCC and @mike_schatz) of data with more than 40 different data types with the most abundant data types being ChIP-Seq, DNase-Seq, and RNA-Seq. These databases represent only the tip of the iceberg of potentially large data repositories that could benefit from the BoaG framework.

While it is common to download and analyze small subsets of data (tens of Terabytes for example) from these repositories, analyses on the larger subsets or the entire repository are currently computationally and logistically prohibitive for all but the most well funded and staffed research groups. While BioMart [67],

Galaxy, and other web-based infrastructures provide an easy to use tool for users without any knowledge in programming to download subsets of the data, the needs of the advanced users using the entire database aren’t met as evidenced by a plethora of bash scripts, R scripts and Python scripts that are widely utilized and reinvented by bioinformaticians. Retrieving the genomics data and performing data-intensive computation can be challenging using existing APIs. Biomartr [27] is an R package to retrieve raw genomics data that try to minimize some of this complexity. 10

Here we discuss an initial implementation of Boa for genomics on a small test dataset, NCBI Refseq, a database containing data and metadata for 153,848 genome annotation files (GFF). We show the potential of BoaG in a comparative context with Python and MongoDB by assessing various statistics of the Refseq database and answer the following four questions.

• What is the smallest and largest genome in RefSeq?

• How has the average number of exons per gene in genomes of a clade changed for genomes deposited

before and after 2016?

• How has the popularity of the top five assembly programs in bacteria changed over time?

• How has assembly quality changed for genomes deposited before and after 2016?

This chapter presents our methods of dataset choice, dataset generation, and design and implementa- tion choices of BoaG. This chapter also shows the overview architecture of the proposed works. We will also describe how BoaG could be integrated with current general-purpose languages like Python in Jupyter notebooks.

3.3 Choice of Biological repository for prototype implementation

RefSeq is a relatively small dataset containing information on well-annotated sequences spanning the tree of life: plants, animals, fungi, archaea, and bacteria. The smaller database size permits rapid iterations of BoaG applied to biology and illustrates the benefits of a genomics specific language. RefSeq also has a decent amount of metadata about genome assemblies and their annotations for which as far as we know has not been explored as a whole. Unfortunately, due to the rapid advancement of sequencing technologies and genome assembly/annotation programs, deriving biologically meaningful information from comparisons of assembly stats across the entire dataset is not possible; however, as a demonstration of the usefulness of a BoaG infrastructure, we show how straightforward it is to ask questions about how the database and the metadata have changed over time which gives insight into how improvements in sequencing technology and assembly/annotation programs have affected the data contained in this repository. These types of informa- tion would be challenging to procure directly from the online repository. 11

3.4 Design and implementation considerations

As a domain-specific language, careful consideration must be taken in its design for Hadoop based infrastructure implementation for RefSeq data. The overall workflow for BoaG requires a program written in BoaG that is submitted to the BoaG infrastructure (Figure 3.1 (a)). The infrastructure takes the submitted program and compiles with the BoaG compiler and executes the program on a distributed Hadoop cluster using a BoaG formatted database of the raw data. BoaG has aggregators, which are functions that run on the entire or a large subset of the database to take advantage of the BoaG’s database, which is designed to distribute both data and compute across a Hadoop cluster.

(a) An overview of BoaG Architecture (b) An overview of Data Generation in BoaG

Figure 3.1 BoaG Architecture and Data Generation

A BoaG infrastructure provides the following benefits for exploring large datasets

• A computational framework on top of Hadoop that can query large datasets in minutes.

• An efficient data schema that provides storage efficiency and parallelization.

• An expandable database integration.

• A domain-specific language that can be incorporated in a container, Galaxy framework or along with

any language like R or Python in a Juypter notebook. 12

3.4.1 Genomics-specific Language and data schema

Table 3.1 Domain types for genomics data in BoaG

Type Attributes Details taxid Taxonomy ID of each species Genome refseq Refseq ID of the GFF file Sequence List of sequence reads in each GFF file[1]. AssemblerRoot List of assembly programs associated with this genome accession Accession number Sequence header Header of Sequence FeatureRoot List of features including exon,gene,mRNA, and CDS associated with this sequence seq Actual DNA sequences from FASTA files FeatureRoot refseq This field shows the key ID feature This field is the list of features associated with this ID accession Accession code of the Sequence seqid Sequence ID source A text qualifier that describes the algorithm or procedure that generated this feature. ftype Type of the feature start starting point of the feature Feature end End point of the feature score Score of the feature. This is a floating point number. strand + and - for positive and negative strand respectively phase Phase of the feature. The phase is one of the integers 0, 1, or 2 Attribute List of attributes for each feature parent Shows the parent of the attribute id Attribute ID Attribute tag Attribute tag including gbkey etc. value Value of the tag Assembler List of assembly programs total-length Total length or genome size(base pair) total-gap-length Total gap length after genome assembly AssemblerRoot scaffold-N50 Scaffold N50 metric scaffold-count Scaffold count metric contig-N50 Contig N50 metric contig-count Contig count metric Assembler name Assembly program used to assemble the genome desc Program attributes: program name, program version, etc.

Table 3.2 The BoaG aggregators list

Aggregator Description MeanAggreagtor Calculates the average MaxAggreagtor Finds the maximum value SumAggregator Calculates the sum of the emitted values to the reducer MinAggregator Finds the minimum value TopAggregator Takes an integer argument and returns the top elements for the given argument StDevAggregator Calculates the standard deviation 13

To create the domain-specific language for biology in Boag, we created domain types, attributes and functions for the RefSeq dataset that includes the following raw file types: FASTA, GFF and associated metadata, as shown in Table 3.1, Genome, Sequence, Feature, and Assembler are types in BoaG language and taxid, RefSeq, etc are attributes of the genome type. We created the data schema based on the Google protocol buffer, which is an efficient data representation of genomic data that provides both storage efficiency and efficient computation for BoaG.

3.4.2 Output Aggregators in BoaG

Table 3.2 shows the predefined aggregators in the BoaG language, for example, top, mean, maximum, minimum, etc. These aggregators are also available in traditional RDBS and MongoDB [20]; however,

BoaG is flexible enough to define new aggregators. BoaG provides a specific type called output types that collect and aggregate data and provide a single result. When a BoaG script is running in parallel, it emits values to the output aggregator that collects all data and provides the final output. Aggregators can also contain indices that would be a grouping operation similar to traditional query languages.

3.4.3 BoaG database and new data type integration

The BoaG infrastructure is designed to fully utilize data parallelization facilities in Hadoop infrastruc- ture. The raw data for file types and metadata was parsed into a BoaG database on top of a Hadoop sequence

file (Figure 3.1 (b)). A compiler, file reader, and converter were written in Java to generate this database and are provided in the GitHub repository (https://github.com/boalang/bio/tree/master/compiler). To integrate a new dataset the data schema in protocol buffer format needs to be modified and a data reader in Java that reads the raw data, for example, GFF, TXT, Fastq, etc, is needed that can convert it to a binary format of

BoaG database. An additional example is provided in the GitHub repository.

BoaG efficiency was tested on a shared Hadoop cluster on Bridges (https://portal.xsede.org/psc-bridges) with 5 nodes and up to 256 map tasks. 14

3.4.4 Data availability

All scripts, step by step process of scientific discovery, and additional examples of Boa queries used in this dissertation can be found in our repository. The raw data files, BoaG database, and JSON MongoDB

files can be obtained from an online repository (https://boalang.github.io/bio/). A Docker container with

BoaG scripts, a BoaG sequence file of a subset of the raw files and instructions on how to use BoaG can also be downloaded from this location. We have generated a subset of GFF files and assembly statistics files for all fungi data contained in RefSeq. This data subset is 5.4 GB and can be used to test BoaG queries for reproducible results.

3.4.5 Run BoaG on Docker container and Jupyter

For the fungal data subset, users can run a containerized version of a 3 node Hadoop cluster for BoaG as well as Jupyter versions on a single machine. These integrations with current technologies can help users test and run queries and reproduce our results. Instructions on how to run a Docker version and a Jupyter version of BoaG are available on this website: https://boalang.github.io/bio/.

3.5 Application of BoaG to the RefSeq database

A total of 153,848 annotations (GFF), assembly (FASTA) files, and metadata were downloaded from

NCBI RefSeq [60] and written to a BoaG database. Metadata included genome assembly statistics (Genome size, scaffold count, scaffold N50, contig count, contig N50) and assembler software used to generate the assembly from which the genome annotation file was created.

3.6 Results on the RefSeq database

This chapter presents some examples of questions that could be asked on the RefSeq dataset. It also worth mentioning that these are a few examples, the infrastructure provides an interface that users can test different hypotheses. 15

3.6.1 Summary statistics of RefSeq

While it is straightforward to use the RefSeq website (https://www.ncbi.nlm.nih.gov/refseq/) to look up this information for your favorite species, it is cumbersome to look up this information for tens to hundreds of species. Similarly, while each of these genomes have an annotation file, querying and summarizing the information contained in this annotation file from several related genomes such as the average number of genes, the average number of exons per gene and average gene size requires downloading and organizing the annotation files of interest prior to calculating the statistics.

Data from the RefSeq database was downloaded, a schema was designed and a Hadoop sequence file generated for use with Boag, a domain-specific language and shared data infrastructure. The RefSeq data used in this first implementation of BoaG contains GFF files and metadata from bacterial (143,907), archaea

(814), animal (480), fungal (284) and plant (110) genomes. Each genome has metadata related to the quality of its assembly (Genome size, scaffold count, scaffold N50, contig count, contig N50), the assembler software, and the genic data contained within the GFF annotation file.

Our goal was to implement BoaG on a biological dataset to demonstrate a means to explore large datasets. In the following subsections, we will answer the four questions posed in the introduction and explore BoaG efficiency in storage, speed, and coding complexity.

3.6.2 The largest and smallest genome in the RefSeq database

Researchers might be interested in finding what is the largest and smallest genome in RefSeq? As of February 16th, 2019, the largest genome in the RefSeq database was Orycteropus afer (aardvark,

GCF_000298275.1) at a length of 4,444,080,527 bp. The smallest genome is RYMV, a small circular viroid-like RNA hammerhead ribozymein sequenced from Rice and annotated as a Rice yellow mottle virus satellite (viruses). Its complete genome has a length of 220 bases and has a RefSeq id GCF_000839085.1.

With the full RefSeq dataset in a Hadoop sequence file, this statistics only required seven lines of BoaG code (Figure 3.2). In line one, variable g is defined as a Genome which is a top-level type in our language.

MaxGenome and MinGenome are output aggregators that produce the maximum and minimum genome length respectively. Lines five and seven in the code emit the assembly total length to the reducer for 16 all the genomes in the dataset, then the reducer will identify the largest and smallest genomes. It took

BoaG approximately 30 seconds to finish this query when using a single node without Hadoop. It took the equivalent query using Python approximately one hour using a single core.

1 g: Genome=input; 2 MaxGenome: output maximum(1) of string weight int; 3 MinGenome: output minimum(1) of string weight int; 4 asm:= getAssembler(g.refseq); 5 MaxGenome << g.refseq weight asm.total_length; 6 if(asm.total_length>0) 7 MinGenome << g.refseq weight asm.total_length;

Figure 3.2 Code to find the smallest and largest genomes in RefSeq.

3.6.3 Study the changes of average number of exons per gene over time

Researchers might also be interested in finding how has the average number of exons per gene in a species clade changed for genomes deposited before and after 2016? Due to the rapid advancement of se- quencing technologies and genome assembly/annotation programs, any meaningful biological changes in gene and exon frequency will be confounded with these advancements. We explored seven clades: five kingdoms and two phyla to explore how exon number, gene number, gene length and exons per gene have changed before and after 2016. These branches of the tree of life included Bacteria, Archaea, Fungi, As- comycota (a fungal phylum), Viriplantae (plants), Eudicotyledons (a clade in flowering plants) and Meta- zoans (a clade of animals). In the last two years, the number of sequenced bacterial genomes has nearly quadrupled, while all other clades have seen at least a 50% increase in the RefSeq database (Table 3.3, Table

3.4). The number of genes, number of exons and exons per gene have increased for all clades database

(Table 3.3, Table 3.4). Since prokaryotes do not have exons, Bacteria and Archaea were excluded from this query for exon number and exon per gene (NA). A higher number of exons per gene for the Eukaryotes suggests that gene models are improving and becoming less fragmented. This improvement could be due to improvements in gene annotation software or assembly contiguity.

We find fewer genes in archaea than in bacteria, at 2.9k and 4.3k genes respectively. The highest gene numbers in eukaryotes are plants (43k), with animals and fungi being having fewer genes at 24.9k and 10k, 17

1 g: Genome = input; 2 geneCounts: output sum[string][string] of int; 3 exonCounts: output sum[string][string] of int; 4 adata := getAssembler(g.refseq); 5 asYear :=yearof(adata.assembly_date); 6 if(asYear >= 2016) 7 foreach(i:int; def(g.sequence[i])){ 8 fdata:=getFeature(g.refseq,g.sequence[i].accession); 9 foreach(j:int; def(fdata.feature[j])){ 10 if(match("gene",fdata.feature[j].ftype)) 11 geneCounts [g.refseq][g.taxid]<< 1; 12 if(match("exon",fdata.feature[j].ftype)) 13 exonCounts [g.refseq][g.taxid]<< 1; 14 } 15 }

Figure 3.3 Number of exons, genes, and exons per gene after 2016. The output is shown in Table 3.3. respectively [39]; however, the mean gene length for these clades has not changed between time points, indicating that the increased exon content per gene is likely due to an improvement in annotation software.

Table 3.3 Exon Statistics for years >= 2016

Name Total species Exon number Gene number Gene Length Exon per Gene Bacteria 92287 N/A 4.3k ± 1.5k 890 ± 64 N/A Fungi 90 32.3k ± 1.8k 10k ± 3.5k 1.6k ± 171 2.9 ± 1.3 Archaea 338 N/A 2.9k ± 0.9k 851 ± 31 N/A Viridiplantae 46 385k ± 155k 43k ± 21k 4.1k ± 1.3k 9.2 ± 1.9 Metazoas 185 462k ± 280k 24.9k ± 10.3k 23k ± 11.8k 17.7 ± 6.4 Ascomycota 70 28.4k ± 13.7k 10.4k ± 3.1k 1.6k ± 142 2.5 ± 0.8 eudicotyledons(dicots) 37 397k ± 167k 45k ± 22k 3.8k ± 688 9 ± 1.3

This query required 15 lines of BoaG code (Figure 3.3) using a five node shared Hadoop cluster on

Bridges with 64 mappers approximately 42 minutes to answer this question. It took the equivalent query using 45 lines of Python code approximately 20 hours using a single core. 18

Table 3.4 Exon Statistics for years < 2016

Name Total species Exon number Gene number Gene Length Exon per Gene Bacteria 51537 N/A 3.8k ± 1.5k 885 ± 65 N/A Fungi 194 29k ± 20k 9.2k ± 3.5k 1.6k ± 254 2.8 ± 1.5 Archaea 474 N/A 2.9k ± 0.8k 855 ± 40 N/A Viridiplantae 61 273k ± 153k 32k ± 17k 4.1k ± 2.3k 8 ± 2.5 Metazoas 262 314k ± 211k 22.3k ± 9.6k 22k ± 12k 13.4 ± 5.4 Ascomycota 143 25.2k ± 14.3k 9.5k ± 3.1k 1.6k ± 205 2.4 ± 1 eudicotyledons(dicots) 41 328k ± 133k 38k ± 16k 4k ± 1.4k 8.6 ± 1.3

3.6.4 Popularity of bacterial genome assembly programs

Another research question might be how has the popularity of bacterial genome assembly programs changed? The choice of genome assembly program to assemble a genome depends on many factors includ- ing but not limited to user familiarity of the program in the domain, ease of use, assembly quality, turnaround time. Looking at the number of genomes assembled by the top five most popular assemblers in bacteria in- dicate that more genomes are being assembled over time, that there was a brief period of popularity with

AllPaths in 2014 and a rapid rise in popularity of the SPAdes assembler in the last couple of years. CLC workbench offers a GUI interface to users without programming experience and has consistently maintained a slice of the user market (Figure 3.5).

This query required six lines of BoaG code Figure 3.4 using a five node Hadoop cluster with 32 map- pers approximately 30 seconds to answer this question. The equivalent single-cored Python query took approximately one hour with 35 lines of code.

1 g: Genome = input; 2 counts: output sum[int][string][string] of int; 3 asm := getAssembler(g.refseq); 4 asYear :=yearof(asm.assembly_date); 5 foreach(k:int; def(asm.assembler[k])) 6 counts [asYear][g.taxid][asm.assembler[k].name]<<1;

Figure 3.4 Bacterial assembly programs popularity over time. The output of this script is shown in Figure 3.5. 19

Figure 3.5 Assembler programs for Bacteria over the years

3.6.5 Study the quality of metazoan assembly programs over time

Another research question is how has metazoan assembly quality changed for genomes deposited before and after 2016? To minimize bias in organismal variation and assembly software, we have limited our com- parison to metazoans and the top three assembly programs. The popular assembly programs for metazoans has been AllPaths after 2016 while SOAPdenovo was the most popular one before 2016. A high-quality assembly is characterized by a low scaffold count and high N50, stats that dramatically improved at the

2016 transition. As it can be seen in Table 3.5 and Table 3.6, the scaffold count has decreased for all three assemblers after 2016 while the contig N50 metric has increased. This is not a surprise, as assembly algo- rithms are expected to improve over time. Newbler had a dramatic decrease in scaffold count after 2016.

The highest average N50 among metazoans belongs to AllPaths.

This query, Figure 3.6, required 6 lines of BoaG code using five nodes Hadoop cluster with 32 mappers approximately 30 seconds. An equivalent single-cored Python query took approximately one hour and 32 lines of code 20

Table 3.5 List of top three most used assembly programs for Metazoa (Year > =2016)

Kingdom Program Name species Total length Scaffold-count ScaffoldN50 ContigCount ContigN50 Metazoa SOAPdenovo 21 1B ± 0.8B 38k ± 49k 7.8M ± 11M 86k ± 66k 98k ± 208k AllPaths 48 0.9B ± 0.7B 7.1k ± 7k 4.3M ± 1.4M 33k ± 38k 188k ± 335k Newbler 7 0.8B ± 0.9B 3.3k ± 2.2k 877k ± 910k 56k ± 80k 75k ± 60k

Table 3.6 List of top three most used assembly programs for Metazoa (Year < 2016)

Kingdom Program Name species Total length Scaffold-count ScaffoldN50 ContigCount ContigN50 Metazoa SOAPdenovo 98 1.2B ± 0.7B 40k ± 38k 4.5M ± 13M 116k ± 79k 42k ± 48k AllPaths 54 1.5B ± 1.1B 11k ± 13k 7.4M ± 9.7M 119k ± 97k 38k ± 32k Newbler 18 0.9B ± 0.9B 87k ± 117k 2.1M ± 2.3M 133k ± 157k 34k ± 27k

3.7 Discussion and Future work

In this section, we describe the storage and computational efficiency of BoaG infrastructure. We discuss and compare BoaG with Python and MongoDB. Finally, we summarize our work and provide suggestions for future work.

3.7.1 Database storage efficiency and computational efficiency with Hadoop

One benefit of the BoaG database is the significant reduction in the required storage of the raw data. The downloaded NCBI RefSeq data was 379GB but reduced to 64GB (6.2 fold reduction) in the BoaG database.

This data size reduction is due to the binary format of the Hadoop Sequence file which makes disk writing faster than a text file (Figure 3.8). A fungi-only subset of the RefSeq data was dramatically reduced from

5.4GB to 0.5 GB (10 fold reduction). This variability in size reduction is presumably due to variability in the number and size of files among phyla.

The second benefit of BoaG is its ability to take advantage of parallelization and distribution during computation. Increasing the number of Hadoop mappers for a BoaG job decreases the query turnaround time. Taking the four queries we posed in the introduction, we varied the level of Hadoop mappers to show 21

Figure 3.6 Assembly statistics for genomes for years after 2016. The output is shown in Table 3.7.

1 g: Genome = input; 2 counts: output collection[string][string][int][int][int][int][int][int] of int; 3 adata := getAssembler(g.refseq); 4 asYear :=yearof(adata.assembly_date); 5 if(asYear >= 2016) 6 counts[g.refseq][g.taxid][adata.total_length][adata.total_gap_length][adata. scaffold_count][adata.scaffold_N50][adata.contig_count][adata.contig_N50] <<1; the speedup that results by adding additional Hadoop mappers to an analysis. Figure 3.7, demonstrates the exponential decrease in required computation time with a corresponding increase in the number of Hadoop mappers. As you can see, if the number of mappers is not optimized for the amount of computational infrastructure than the second query takes approximately 350 minutes on 2 mappers to complete; however, as more mappers are added, the time required levels out to less than one minute on assembly related queries.

This lower bounds of this relationship is presumably due to the overhead of splitting and gathering of data across the mappers. As we add more mappers the running time decreases for example with 256 mappers runtime is 22 minutes on the entire RefSeq. It is not difficult to see the benefit of using a domain-specific language like BoaG and Hadoop infrastructure to query much larger biological datasets than RefSeq.

Table 3.7 Kingdoms and average summary statistics for their genome assemblies (Years > =2016)

Tax ID Name Species Total length Scaffold-count ScaffoldN50 ContigCount ContigN50 2 Bacteria 92290 4.3M ± 1.6M 66 ± 78 0.9M ± 1.4M 132 ± 176 0.39M ± 0.86M 4751 Fungi 90 29M ± 15M 139 ± 159 1.3M ± 0.9M 360 ± 688 0.78M ± 1M 2157 Archaea 338 2.9M ± 0.98M 52 ± 40 0.38M ± 0.43M 74 ± 121 0.53M ± 71M 33090 Viridiplantae 46 0.97B ± 0.88B 9.1k ± 18.3k 31M ± 49M 38k ± 43k 1.8M ± 4.9M 33208 Metazoas 185 1.2B ± 0.95B 20.6k ± 43.7k 22M ± 36M 53k ± 77k 2.5M ± 7.9M 71240 eudicotyledons(dicots) 37 0.91B ± 0.76B 6.4k ± 10.6k 26M ± 50M 40k ± 44k 1.6M ± 4.3M

Taking advantages of Hadoop based infrastructure, all the queries in the Table 3.7 and Table 3.8 that describe the genome assembly statistics before and after 2016 transition required less than a minute. 22

Table 3.8 Kingdoms and average summary statistics for their genome assemblies (Years <= 2015)

Tax ID Name Species Total length Scaffold Count ScaffoldN50 ContigCount ContigN50 2 Bacteria 51962 3.8M ± 1.6M 45 ± 82 1.3M ± 1.5M 126 ± 177 0.27M ± 0.55M 4751 Fungi 202 29M ± 17M 341 ± 699 2M ± 1.7M 858 ± 1433 0.55M ± 0.75M 2157 Archaea 470 2.9M ± 1M 17 ± 16 1.35M ± 1.17M 110 ± 126 0.38M ± 0.7M 33090 Viridiplantae 67 0.62B ± 0.68B 22.9k ± 46.6k 14.7M ± 24.9M 52.5k ± 71.6k 0.47M ± 1.8M 33208 Metazoas 295 1.3B ± 1B 37.4k ± 64.2k 7.2M ± 14M 118.6k ± 119k 0.13M ± 1.2M 71240 eudicotyledons(dicots) 46 0.754B ± 0.750B 26.3k ± 53.5k 17M ± 27M 58.8k ± 74k 0.3M ± 1.6M

Figure 3.7 Scalability of BoaG programs (time is in Log base 2 (sec)). Queries 1,2,3 and 4 are the four questions investigated here.

3.7.2 Comparison between MongoDB and BoaG

The data schema in MongoDB also needs to be saved along with the data, the output files are larger and take longer to write (Figure 3.8). The JSON file size is larger and on average it is more than double the size of the RefSeq raw data. While experts in MongoDB may write this query more efficiently, the BoaG language requires fewer lines of code (Figure 3.9), thereby providing an easier interface for bioinformaticians to explore big data.

The performance of MongoDB and Hadoop has been previously compared [25], showing that the read- write overhead of Hadoop has a lower read-write overhead (Table 3.9). 23

Figure 3.8 The BoaG database size comparison with the raw data in the RefSeq as well as the JSON version of the dataset.

1 var mapFunc1 = function(){ 1 g: Genome = input; 2 emit (this.taxid, 1);}; 2 counts: output sum[string] of int; 3 var reduceFunc1 = function(key, values){ 3 counts [g.taxid]<< 1; 4 return Array.sum(values); }; 5 db.refseq1.mapReduce( (b) BoaG 6 mapFunc1, 7 reduceFunc1,{out:"sum_taxid"} 8 ).find() (a) MongoDB

Figure 3.9 Comparison of the code needed to query the number of assembler programs per taxon id run on Refseq Data. On the left side, the MongoDB code needs eight lines of code in Python whereas the BoaG script needs only three lines of code.

3.7.3 Comparison between Python and BoaG

A general-purpose language like Python could also be utilized to execute the same queries investigated here; however, the Python code would be larger and require learning how to use Python libraries. To illustrate, we wrote an example program in Python to calculate the top three most used assembly programs required only five lines of code in BoaG language. In Python, a similar analysis required 38 lines of code

(Figure 3.10). Because Python needs to aggregate the output data, it needs more lines of code and longer 24 script needs g BoaG Performance 1; g: Genome = counts: input ; output top (3) assemblerDataof :=getAssembler(g.refseq); string foreach (i: int ; weight def (assemblerData.assembler[i])) int ; counts << assemblerData.assembler[i].name weight 1 2 3 4 5 Python only five. top three most used assemblyequivalent BoaG programs?" code run needs 38 on lines Refseq of Data. code in On Python whereas the the left Boa side, the Figure 3.10 Comparison of Line of Code(LOC) and performance to answer query " What are the data=line.split( ’ : ) assembly_program = data[1].strip() assembly_program=None with open ( file , ’ r ) as... f: // retrieve assembly programs for each assembly_stats[assembly_program] +=1 assembly_stats[assembly_program] = 1 file directory endswith( ’ . txt )] items(), key=operator.itemgetter(1)) assembly_program=get_assembly(f) if assembly_program in assembly_stats: else : sorted_assemblers = sorted (assembly_stats. print (sorted_assemblers[-2:]) ... //assembly_stats={} imports def get_assembly( file ): ... // all assembly_stats.txt in files_list=[f thefor f current in os.listdir( " . ) for f if in f. files_list: 1 5 6 7 8 9 15 16 25 26 31 32 33 34 35 36 37 38 25

Table 3.9 Comparison between MongoDB and BoaG Feature MongoDB BoaG Lines of Code larger smaller because it abstracts details of data analysis Data generation time longer due to the larger file faster because of Binary file Data file JSON is 2.7 times larger than raw data Hadoop Sequence file 5 times smaller than raw data Schema Flexibility Yes. Supports semi-structured data Yes. Schema and compiler can be modified MapReduce Yes Yes

runtime. This advantage inherent to domain-specific languages will speed up a researcher’s ability to query large datasets.

More comparisons in terms of runtime and lines of codes are given in Figure 3.11. These tests were performed on an iMac system with processor 4 GHz Intel Core i7 and 32 GB 1867 MHz DDR3 of memory.

An analysis in BoaG requires fewer lines of codes than other languages available like MongoDB and

Python (Figure 3.9). The file size in the BoaG database is much smaller than the JSON file used in Mon- goDB, as BoaG utilizes a binary format. Since

BoaG also provides an external implementation that allows users to bring their implementation from

Python, Perl, Bash, etc. Not all users of the infrastructure can run any arbitrary scripts on the infrastruc- ture. Scripts need to be converted to a DSL function so that they will not cause security issues for the infrastructure.

3.8 Conclusion

In this chapter, we presented BoaG which is a domain-specific language and shared data science in- frastructure that takes advantage of Hadoop distribution for large-scale computations. Boag’s infrastructure opens the exploration of large datasets in ways that were previously not possible without deep expertise in data acquisition, data storage, data retrieval, data mining, and parallelization. The RefSeq database was used as an example dataset from Biology to show how to implement the domain-specific language BoaG for biological discovery. BoaG can query the RefSeq dataset in under 2 minutes for most queries, offering sub- stantial time savings from other methods. Many examples, tutorials, and a Docker container are available at a GitHub repository. This chapter provides a proof of concept behind the BoaG infrastructure and its ability 26

Lines of Code(LOC) Run Time (min) Task Python BoaG Diff Python BoaG Speedup A. Summary Statistics across all species 1. The average number of genome features in GFF files 39 7 4.2x 784 67 11x 2. Compute the mean and counts of feature size(base pairs) 33 7 4.7x 878 108 8x 3. Find the top 10 genomes with the largest and smallest genes 35 12 2.9x 1120 131 8x B. Genome related questions 1. Compute the number of reads for each genome 30 4 7.5x 620 20 31x 2. List of all tax ids and their counts 37 4 9.25x 54 1.2 45x 3. List the genomes within a specific genome size range 33 7 4.7x 62 1 62x 4. Find the smallest genome size within the clade 34 8 4.25x 55 1.5 36x C. Feature related questions 1. List of Genebank ID of all gene features in a specific range 32 15 2.1x 948 110 8.6x 2. The top genomes with smallest and largest mRNA length 30 11 2.72x 884 85 10x 3. What is the average mRNA length in each GFF? 27 9 3x 796 71 11x D. Attribute related questions 1. Gene, exon count, gene length, and average exons per gene 40 12 3.33x 1260 60 21x 2. Compute the shortest and largest CDS 44 12 3.66x 1124 75 15x 3. Compute the average number of mRNA per gene 45 11 4.09x 1320 65 20x E. Assembler related questions 1. What is the top used assembler within a clade? 32 5 6.4x 62 0.7 88x 2. The popularity of assembly programs over the years? 35 6 5.8x 60 0.7 85x 3. The most commonly used assembler in RefSeq 32 7 4.5x 50 0.5 100x 4. The best performing assembler in terms of contig N50? 31 6 5.1x 55 0.6 91x

Figure 3.11 Example of BoaG programs to compute different tasks on the full RefSeq dataset. The python programs were running on the single core. The Hadoop infrastructure on Bridges has 5 shared nodes with 32 mappers. While these queries can be written in parallel in python, this needs more lines of code and more program- ming skills to write a parallel code. to scale to much larger datasets. This is the first step towards providing a shared data science infrastructure to explore large biological datasets.

In the future, we will integrate new data types including the Non-Redundant protein database, biological ontologies, SRA, etc. We will also update the BoaG database and provide a publicly available web-interface for researchers to run queries on our infrastructure. 27

CHAPTER 4. A CYBERINFRASTRUCTURE FOR ANALYZING LARGE-SCALE BIOLOGICAL DATA

4.1 Introduction

The amount of sequencing data generated every year continues to grow exponentially. GenBank ([15]), has more than doubled in the last three years from 317 million sequences with 1.3 trillion bases to over

773.7 million sequences with 3.6 trillion bases. A researcher can choose to deposit a sequence into one of several different databases and frequently deposit the same sequence into multiple databases. This results in the problem of redundant information inflating the size of all known sequences. To address the growing challenge of sequences redundancy in public databases, a Non-Redundant (NR) database was introduced by the National Center for Biotechnology Information (NCBI) [7]. NR is defined by NCBI as protein sequences that have 100% identity and are the same protein length. This means that sequences that are shorter but have

100% identity are retained in the database and may or may not be labeled as a partial sequence. There is still redundant information contained in NCBI’s non-redundant database, the extent of which is not widely known. The NR database encompasses protein sequences from non-curated (low quality) and curated (high quality) databases:

• GenBank/GenPept: This is unreviewed and low-quality sequences due to submission from individuals

and laboratories.

• trEMBL: This is an unreviewed subset of UniProt [23]. These sequences are annotated with compu-

tational tools.

• SwissProt: This is a manually annotated protein sequences [17].

• RefSeq: This is the manually reviewed sequences from GenBank and is maintained by NCBI’s

staff [61]. 28

• PIR: This is a non-redundant annotated protein sequence database [77].

• PDB: This database is annotated experimentally, and it also contains structures of proteins and nucleic

acids [16].

Researchers use BLAST [9] to query the NR database to identify homologous sequences and use that information to try and make an informed decision on the taxonomic assignment and function of unknown protein sequences.

The main advantage of NR is that it is comprehensive and solves the redundancy at the identical se- quence level; however, the amount of redundancy and ambiguity of annotations at the large scale remains largely unknown and problematic to the user. Each sequence can have multiple annotations (i.e., taxonomic assignments and protein functional) resulting from the merging of definition lines from an identical sequence found in multiple databases. This redundancy impacts the ability of researchers to use, curate, and explore theNR database. For example, it is difficult to assess the confidence of an annotation because it is hard to determine where (the provenance) and how many times (frequency) a given annotation was assigned to a known sequence from multiple databases. In addition, there are unseen biases to the sequences contained in the nr database with significantly more coverage for some species/clades in the tree of life and other species with little to no sequences. To fully leverage the sequence data contained in the NR database, the clustering of proteins based on sequence similarity would be greatly beneficial.

A robust, distributed infrastructure is needed to analyze and quantify the content of the NR database and its clustering information. To this end, we utilized BoaG to address these challenges at scale. BoaG belongs to the family of a domain-specific language and shared infrastructure, called Boa, that has been applied to address challenges in mining software repositories [28], genomics data [12], and big data trans- portation [36]. Boa can process and query terabytes of raw data and uses a backend based on map-reduce to effectively distribute computational analyses and querying tasks. MapReduce is a framework that has been used for scalable analysis in scientific data. Hadoop is an open-source implementation of MapReduce.

BoaG has been shown to substantially reduce programming efforts, thus lowering the barrier to entry to ana- lyze very large data sets and drastically improves scalability and reproducibility. BoaG has aggregators that are functions that run on the entire database or a large subset of the database and therefore takes advantage 29 of the BoaG database designed for both the data and the computation to be distributed across the Hadoop cluster.

This work is built on top of previous work that we introduced BoaG as a domain-specific language and shared data science Hadoop-based infrastructure for genomic data [12]. We demonstrated the computational power of BoaG on a proof of concept dataset, RefSeq, on a VirtualBox and Docker container. We also showed an application of BoaG in detecting and correcting misclassified sequences in the NR database [13].

Here, we built the infrastructure and made it publicly available for researchers to test different hypothe- ses. We extended the BoaG infrastructure by integrating the sequencing data of the NR database, and its clustering information to illustrate the potential of the infrastructure to analyze the information contained in large public sequence databases. To that end, the BoaG database and schema were generated, and the compiler has been modified. We took the protein sequences in the NR database and clustered them using

CD-Hit at several sequence similarity levels, then took this clustering information and combine it with the sequence metadata corresponding to protein function and taxonomic assignment. Using this information, we are able to better quantify the content, taxonomic distribution of proteins, and protein functions in the

NR database. Specifically, we answer the following questions:

• What are the provenance and frequency of annotations in the NR database?

• What are the levels of ambiguity and redundancy in the taxonomic assignment and protein functions?

• How many conserved proteins are there in the NR database?

• What is the taxonomic distribution of protein across the tree of life?

• What are summary statistics for clustering information at different similarity levels?

• What is the distribution of proteins length in the NR database?

We found that BoaG can perform queries on this large dataset to quickly determine the average length of protein sequences, along with the most common taxonomic assignments and functional annotations, and the area of the tree of life that are less explored by researchers. For all the analyses, the BoaG infrastructure took fewer lines of code, reduced storage size, and provided automatic parallelization for these analyses. BoaG’s 30 web-interface is also implemented and is made publicly available for researchers to test different hypotheses and share them among others. The output of BoaG may require further post-processing or visualization. We used libraries in Python for post-processing.

The rest of this chapter is organized as follows. In Section 4.2, we present methods and materials for dataset generation, the BoaG infrastructure, and interpreting the output. In Section 4.3, we present some interesting insights from the NR database and its clustering information by utilizing BoaG. Then, we discuss the performance and efficiency of the BoaG language and infrastructure and compare it with Python and MongoDB. In Section 4.4, we conclude with suggestions for the future.

4.2 Methods and Materials

In this section, we will describe the overview architecture of the publicly available BoaG language and infrastructure. Then, we discuss the BoaG language types to support NR and its clustering information.

Next, we describe data generation steps. Finally, we explain how to write an arbitrary BoaG query and submit it to our infrastructure.

4.2.1 Overview architecture

BoaG is a domain-specific language that uses a Hadoop based infrastructure for biological data [12].

A BoaG program is submitted to the infrastructure through the web-interface, as seen in Figure 4.1. It is compiled and executed on a distributed Hadoop cluster to query data in the BoaG formatted database of the raw data. BoaG has aggregators that can be run on the entire database or a subset of the database taking advantage of protobuf-based schema optimized for the Hadoop cluster for both the data and the computation. These aggregators are similar to but not limited to aggregators traditionally found in SQL databases and NoSQL databases like MongoDB.

4.2.2 BoaG domain-specific language

To utilize the potential of BoaG for our raw data, we created domain types, attributes, and functions specific to the non-redundant protein (NR) dataset and its clustering information. As shown in Table 4.1, 31

Sequence, Cluster, and Annotation are types in our domain-specific language, and tax_id, tax_name, and definition line are attributes of the Annotation type. Sequence, Annotation types, and their attributes in BoaG language represent the NR database, and Cluster type with its attributes represents the CD-HIT clustering information. We created the data schema based on the Google protocol buffer, which is an efficient data representation of genomic data that provides both storage and computation efficiency on Hadoop. The raw data, i.e., the flat file of our raw data, was parsed into a Hadoop sequence file. When a BoaG program is executing in parallel, it emits values to the output aggregator that collects all data and provides the final output. Aggregators, for example, top, mean, maximum, and minimum, also can contain indices that would be a grouping operation similar to traditional query languages [28].

4.2.3 Cluster NR at different level of sequence similarity

We have utilized the CD-HIT program [31] to cluster protein sequences in NR using XSEDE computa- tional resources [58]. CD-Hit provides protein clusters and a representative sequence for each cluster at the specified similarity level. CD-HIT [31] (version v4.6.8-2017-1208) was used using the following parameters

(-n 5 -g 1 -G 0 -aS 0.8 -d 0 -p 1 -T 28 -M 0). These parameters use a word length of 5 and require that the alignment of the shorter sequence be at least 80% of its length. The representative sequence, which is de-

fined as the longest sequence in the cluster, was then clustered using the same parameters at 90% similarity.

Clusters of lower similarity were generated using CD-Hit at 5% increments until 65% similarity, and until all of the following similarity, clusterings were obtained: 95%, 90%, 85%, 80%, 75%, 70%, and 65%. The database size for (entire NR) 95% similarity, was about 100GB. The CD-Hit computation required six days and 20 hours on a compute node with 2 CPU with 14 core each (Model: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz). The same analysis of representative sequences at 90%, 85%, 80%, 75%, 70% and 65% the database size were 40GB, 33GB, 28GB, 24GB, 21GB, 18GB, and 16GB respectively and the running times were three days, one day and 21 hours, one day and 12 hours, one day and two hours, 20 hours, and

16 hours respectively. 32

Figure 4.1 Users submit scripts on the web-interface to the BoaG infrastructure and results could be visualized by any general purpose languages such as R or Python.

1 s: Sequence = input; 2 clstrOut : output sum[int][string][string] of int; 3 foreach(i:int; def(s.annotation[i])) 4 foreach(j:int; def(s.cluster[j])) 5 clstrOut[s.cluster[j].similarity][s.cluster[j].cid] [s. annotation[i].tax_name]<<1;

#Following are few lines of output clstrOut[65][10659416][Paracoccus versutus] = 3 clstrOut[65][12239365][Trypanosoma vivax] = 8 clstrOut[65][13139204][Thermotoga maritima] = 6

Figure 4.2 Frequencies of taxonomic assignments for each protein cluster at different se- quence similarity level. The variable clstrOut is the output aggregator that produces the sum of output indexed over similarity level, cluster id, and tax- onomic name. The BoaG script and results are publicly available here: http://boa.cs.iastate.edu/boag/?q=boa/job/public/128 33

Table 4.1 Domain specific types for the NR database and its clustering information

Type Attributes Description seqid sequence id Sequence Annotation Annotation type Cluster Cluster type keyID protein sequence id Annotation defline Definition line in the protein annotations tax_id taxonomic assignment from XML files tax_name taxonomic name similarity Different similarity level from 65 to 95 cid cluster id representative Representative sequence of the cluster length length of cluster Cluster seq_start sequence starting point seq_stop sequence stop rep_start representative start rep_stop representative stop match percentage match in sequence similarity

4.2.4 Generate BoaG database from the raw dataset

The NCBI NR protein FASTA files were downloaded from ftp://ftp.ncbi.nlm.nih.gov/ blast/db/FASTA/ on Oct 22, 2018. Taxonomic information was obtained from XML files downloaded from https://ftp.ncbi.nlm.nih.gov/blast/temp/DB_XML/. A flat file was generated by appending cluster information and taxonomic assignment to the annotation of each sequence in the NR database. A line of raw flatten data file that used for data generation is as follows: List of def-lines, as appeared in the original NR database, are separated by ˆA then it follows by ð character as a separator and starting of clustering information. Since one sequence might appear in different similarity levels we will have a list of clustering information separated by ü character, for example, be a representative or nr95, nr90, etc. or be a representative of a cluster at nr95 and a member of another cluster at nr90. As shown in Figure

4.1, a raw dataset has converted to a BoaG dataset based on the data schema shown in (Table 4.1). The converter program is written in Java, and it took about two hours. Data preprocessing, downloading, and clustering took about three days. 34

4.2.5 Submit queries on the BoaG infrastructure

We have implemented BoaG and provide a web-based interface to BoaG’s infrastructure [11]. Re- searchers can go to the BoaG’s web-interface and submit their query and download or share the results with others. For example, the program to determine the number of taxonomic assignments in each cluster at different similarity levels requires only five lines of BoaG code (Figure 4.2). In the first line, the variable s is defined as a sequence in NR, which is a top-level type in our language. In the second line, the variable clstrOut is an output aggregator that produces the sum of output indexed over cluster similarity level, cluster id, and taxonomic name. For example, first line of the output shows that in the cluster 65% similarity and id 10659416, the taxonomy name Paracoccus versutus appeared three times. For each sequence, lines three and four iterates over all the annotations and clusters. Line five emits the value to the reducer for all the protein sequences in NR and provides the final results. The output can be downloaded and utilized for the post-processing tasks in the downstream analyses. For example, we used ETE3 toolkit [35] in Python jupyter notebook to generate the tree of life in Section 4.3.1. A compiler, data generation, and other documentation are provided on our GitHub repository.

4.2.6 Interpreting BoaG’s outputs

For all the analyses, we utilized the proposed infrastructure to run the most computationally expensive parts. Some outputs of BoaG may require further post-processing or data visualizations by R, Python,

Bash, etc. For each application, we provide a jupyter notebook that could be found on our repository here: https://github.com/boalang/NR_Dataset/tree/master/jupyter_notebooks.

For example, analyzing NR sequence and cluster with BoaG’s aggregators only takes few minutes, while the same analysis on a single machine may take days to complete. Further comparison is provided in the discussion section.

4.3 Results

In this section, we present several interesting findings by utilizing BoaG language and its infrastructure to analyze 174M protein sequences and its 88M clusters. First, we discuss protein length in the NR database. 35

Figure 4.3 The protein frequency by log(2) of protein length.

Later, we will talk about the distribution of proteins across the tree of life. Then, we present clustering statis- tics from 95% down to 65% similarity. Other analyses are the frequency of taxonomic assignments in the proteins, the statistics about highly conserved proteins, the provenance of annotations, and the redundancy and ambiguity of annotations in the NR database. To interpret the outputs, jupyter notebooks are provided for these analyses.

4.3.1 NR Proteins are not evenly distributed across tree of life

We performed an analysis to understand how researchers have explored known phylums, described by

Ruggiero et al. [63], across the tree of life. The distribution of the protein sequences at the 95% sequence similarity level among all known phyla in the tree of life is shown in Figure 4.4. The majority (74%) of the 36

Figure 4.4 Distributions of proteins in the tree of life. Number in each node represents all proteins rooted with that node. Percentages less than 1 are not shown. 37 protein sequences are in Bacteria (74%), followed by Eukaryota (23%) and finally Archaea (2.21%). The phyla with the most sequenced proteins include Actinobacteria (14%), (31%) and Firmicutes

(12%) for Bacteria. In Eukaryota, the phyla with the most abundant sequenced proteins are Ascomycota

(4.65%), Chordata (4.366%), Arthropoda (2.44%), and Basidiomycota (2.09%). In Archaea, only the Eu- ryarchaeota phyla (1.68%) had sequenced proteins above 1% of the total. In the opposite extreme, there are several phyla that had little to no protein sequences. Specifically, the phyla in Eukaryota, Nematomor- pha (44), Loricifera (0), Kinorhyncha (41), Gastrotricha (78), Cycliophora (3), Gnathostomulida (36), and

Rhombozoa (83) each have fewer than 100 sequenced proteins after a 95% CD-HIT clustering. While the phyla with the lowest number of sequenced proteins in Bacterial and Archea had more than 5000 protein sequences, specifically, Dictyoglomi (5503) and Chrysiogenetes in Bacteria and Crenarchaeota (237862) in

Archeal. The BoaG query is shown in supplemental file Figure 1, and the results generated in 52 minutes, and we used ETE3 toolkit [35] to generate the tree.

4.3.2 Proteins in NR vary greatly in length

The length of protein sequences in the NR database appears to be normally distributed with a mean of

365 amino acids, a standard deviation of 353 amino acids, and a long tail to the right (Figure 4.3). The small- est proteins are 11 amino acid peptides. These peptides are in the NR database from the PDB database, where small peptides are commonly part of a larger protein-peptide structure. These peptides can be synthetic con- structs of viruses or fragments of larger proteins involved in protein-peptide or protein binding. The 100 longest proteins in NR have one of three protein functions: hypothetical protein (DBY08_01055) found in Clostridiales bacterium with unknown function (PWL95011), Titin found in multiple species involved in passive elasticity of muscle or LEPR-XLL domain-containing protein found in Chlorobium chlorochro- matii with sizes of 74,488, 38,105 and 36,805 amino acids in length, respectively. The hypothetical protein

(DBY08_01055) was submitted in March of 2018 and is now the longest known protein sequence supersed- ing the previous holder of the longest sequence, Titin, by more than two-fold. Figure 4.3 shows the protein frequency by log(2) of protein length, which is normally distributed around a median length of 256 amino 38 acids (28). Researchers may also explore and analyze the length of proteins in a different subset of NR. This query required five lines of code in BoaG (see supplementals).

4.3.3 Clustering of similar protein sequences indicate a much lower number of unique proteins in

NR

We clustered NR at 95% sequence similarity and then at lower similarities with 5% intervals using the longest sequence in each cluster until we reach 65% similarity. As we would expect, the number of clusters, proteins, amino acid content, and taxa decreases as we form clusters at lower similarity using only representative sequences from the previous clusters. Approximately half of the protein sequences fall into clusters at 95% sequence similarity and requiring over 80% the length of the shorter sequence. However,

64 of the 174 million proteins at 95% sequence similarity remain unclustered. At 65% similarity, the NR database can be clustered into 34.4 million clusters, containing 23% of the original unclustered proteins and

21.5% of the original amino acid content. However, of the 40 million proteins at this similarity, 30.6 million

(76.5%) are unclustered and of the 11.9% of the original unclustered taxa. The amount of similar data in the NR database has important consequences, assumptions, and limitations. The presence of 159 million taxa for 174 million sequences suggests that the naming of taxa almost has as unique as the sequence ids themselves. In reality, additional information is often added to the taxonomy to add specificity about the line, cultivar, or sample.

4.3.4 Almost as many Taxa as proteins

Since the taxonomic assignment for the protein sequences in NR were merged from several databases, there can be anywhere from one to 10034 taxonomic assignment for a given protein sequence. We analyzed the frequencies of taxonomic assignment in the NR database and identified sequences with a large number of taxa. The Protein sequence with the highest number of assigned taxonomic classifications included

Influenza A virus with more than 10k assignments. Table 4.2 shows a few examples of the proteins that have a large number of taxonomic assignments. The protein sequences with the highest level of taxonomic assignments are most likely due to the fact that viruses and bacteria are given a strain identifier appended 39

Figure 4.5 Frequency of protein sequences with different taxonomic assignments. The x-axis shows the number of taxonomic assignments and the y-axis the frequencies of pro- tein sequences.

Table 4.2 proteins that have the large numbers of taxonomic assignments

Sequence ID Protein Name #of taxa AAX11496 Influenza A virus (A/New York/32/2003(H3N2)) 10,034 Q76V02 RecName: Full=Matrix protein 1; Short=M1 9,227 AAD31614 histone H3, partial [Euperipatoides leuckartii] 8,735 AAZ38596 Influenza A virus (A/New York/391/2005(H3N2)) 7,854 YP_009118623 Influenza A virus (A/California/07/2009(H1N1)) 7,536 to their taxonomy name resulting in many taxonomic assignments for the same sequence. Figure 4.5 shows the frequency of proteins that have a certain number of taxonomic assignments. For example, 17,496,167 protein sequences have two annotations, and 5,921,066 proteins have three annotations. This implies that annotations have a large number of redundancy that impacts exploring and analyzing of the NR database.

More details are discussed in the Section 4.3.7.

Similarly, we generated the frequency of clusters at 95% similarity that has a certain number of tax- onomic assignments (see supplementals Figure 4). For example, 12,960,476 clusters sequences have two taxonomic assignments, and 4,683,663 clusters have three taxonomic assignments. To generate this output, a BoaG script, shown in Figure 4.2, needs five lines of code and takes two hours on the BoaG infrastructure. 40

Table 4.3 Examples of protein functions and their appearances in sequences that have more than 10 distinct taxa. Category Protein function #of functions Unknown hypothetical/unknown/unnamed 27,649,805 Highly conserved conserved hypothetical protein 96,348 Highly conserved membrane protein 204,891 Highly generic transcriptional regulator 192,757 Highly conserved rRNA 21,836

Figure 4.6 Provenance and frequency of annotations from each database for sequence with primary id of NP_000311.2. Most annotations originate from GenBank.

4.3.5 Highly conserved protein functions

We used BoaG aggregators to query the NR database and identify highly conserved protein sequences.

We defined highly conserved as the protein sequences with at least 10 distinct taxonomic assignments. Some examples of the top protein functions and their frequencies are shown in Table 4.3. For example, as we would expect, we see the highly conserved rRNA protein function among the list. In addition, we see an abundance of uninformative/generic functions like unknown function, membrane protein, and transcriptional regulator.

The BoaG query is shown in the supplemental file Figure 3.

4.3.6 Provenance of annotations

We refer to provenance as a database of origin for the annotations, i.e., taxonomic assignments and protein functions. Protein annotations come from different databases that are curated manually or calculated computationally. Therefore, in terms of quality of metadata, it would be beneficial for re- searchers to know about the origin of each taxonomic assignment as they explore their protein of in- 41 terest. For each protein, users can create a phylogenetic tree from the list of taxonomic assignments.

Figure 4.6 provides an example of the provenance and frequency of each taxonomic assignment for the protein sequence with id NP_000311.2. Leaves are annotated with a frequency of each taxonomic assignment as a bar chart from all reviewed and unreviewed databases, i.e., RefSeq, GenBank, PDB

UniProt\SwissProt, and UniProt\TrEMBL respectively. Details on generating the tree are on the GitHub repository (https://github.com/boalang/NR_Dataset/tree/master/jupyter_notebooks)

As shown in the previous work [13], the provenance information could be utilized to clean the NR database by assigning more weight to the manually reviewed annotations.

4.3.7 Redundancy and ambiguity of annotations

There is significant redundancy in the protein annotations of the NR database for the taxonomic as- signment and protein function due to the integration from different databases. Using BoaG, we generated a non-redundant version of annotations, collapsing all identical annotations and providing the number of times that annotation was present in the original ID description. As it can be seen in Table 4.2, some pro- teins have thousands of taxonomic assignments. We previously explored the NR database for the taxonomic misclassified sequences [13]. The non-redundant version of annotations in the NR database improves the usage and querying of the NR database. The running time for this query was 19 minutes, as the output size was 54 GB that needed a longer time to write on the disk.

In addition, researchers independently use different words to refer to the same biological concept; for ex- ample, unnamed protein, hypothetical protein, and unknown protein have been used to describe an unknown protein function in different public databases. Another example is rRNA that appears in 21,836 functions with a different combination of other words. The protein function analysis needs a huge effort in natural language processing. This ambiguity in annotations negatively impacts the usage of the NR database. This implies that we need a better annotation methodology to improve the quality of metadata and answer differ- ent biological questions. One way of improving the annotation in public databases would be to utilize the ontologies, for example, GO [10] and PRO [55]. NCBI provides tools to limit the effects of redundancy. For example, the Conserved Domain Database (CDD) maintained by NCBI is a resource for proteins that clus- 42 ters redundant homologous families to reduce redundancy [45]. However, this is at the level of sequences, and it does not address annotations and metadata.

4.4 Discussion

In this work, we implemented the BoaG infrastructure and made it publicly available. We used BoaG to explore the NR database along with the clustering information. We discussed the average length of proteins, distributions of proteins in the tree of life, top taxonomic assignments, and top protein functions. We also showed the annotation redundancy and ambiguity that affect the quality of metadata. Here, we utilized

BoaG’s aggregators to generate a summarized and non-redundant version of annotations. Summarizing these annotations will help researchers to utilize the wealth of public databases on protein annotations for different areas of biological research that include but are not limited to phylogenetics, taxonomy, and medical research to identify the causes of genetic diseases.

4.4.1 Storage and computational efficiency in BoaG

Exploring the entire non-redundant (NR) database is computationally expensive. Most researchers use subsets of the NR database to test their hypothesis, while BoaG provides a facility to explore the NR database in its entirety. There have been works on deduplication and reducing the NR size. Yu et.al [78] developed a pipeline to construct a subset of NCBI-NR database for the quick similarity search and annotation of huge metagenomic datasets based on BLAST-MEGAN. There is another approach based on MD5 checksum to provide a non-redundant protein database [76] by splitting sequence data in a single FASTA file and metadata in a SQL database.

In this work, we integrated all 174 million protein sequences and 159 million annotations. The BoaG data schema is based on the protocol buffer and stores in a binary file format, and hence it will significantly reduce the storage size. The translated dataset is much smaller even though in the Hadoop file system by default, we may have replication of factor two in order to provide the reliability in data storage across machines in a Hadoop cluster [18]. In this translation, no data loss happens since the BoaG database is binary. Figure 4.7 shows the storage efficiency of BoaG. The file size in the BoaG database is much smaller 43 than the JSON file used in MongoDB. More details about the comparison between BoaG and the MongoDB and original raw data is shown here (https://github.com/boalang/NR_Dataset/tree/master/supplemental).

Figure 4.8 describes the decrease in the required computation time with a corresponding increase in the number of Hadoop mappers. Query 1 is the analysis of protein length frequencies that described in

Section 4.3.2. Query 2 is the analysis of protein distribution across the tree of life that described in Section

4.3.1. Query 3 is the analysis of Highly Conserved Proteins that we discussed in Section 4.3.5. Query 4 is the analysis of clusters in NR that described in Section 4.3.4. For these queries, we varied the number of mappers in the 5-node shared Hadoop clusters to evaluate the speedup results by adding additional mappers to an analysis. As we added more mappers, the running time decreases significantly.

4.4.2 Programming efficiency in BoaG

These analyses we discussed in this work required fewer lines of code in BoaG language and automat- ically translated to a larger parallel code in Java and run on Hadoop, reducing programming efforts. For example, the analysis of protein length, as discussed in Section 4.3.2, required BoaG only 2 minutes on a 5-nodes shared Hadoop, while a more complex one required 42 minutes and produced 40 GB output.

The analyses that were performed here could have been performed using MongoDB and post-processing in Python or directly on the raw flat file using Bash. However, it would have taken more time and lines of code (See Supplemental Figure 5). Utilizing a genomics specific language like BoaG that provides a data preprocessing and curation at the data generation phase, makes downstream analysis based on the gener- ated dataset more reliable. For example, while generating the BoaG dataset from the raw data, we found anomalies in the NR database. We detected several sequences that had no information and contained only

X (unknown amino acid). These sequence IDs were reported to NCBI. A list of detected anomalies is given in the supplemental folder on our GitHub repository.

To summarize, BoaG provides automatic parallelization on top of Hadoop, reduces programming errors by abstracting details in few lines of code, and the curated dataset that is smaller than the original raw data.

We anticipate that the strategy of BoaG might facilitate the exploration of other biological databases. We will also provide facilities for researches to clean NR database. 44

Figure 4.7 The BoaG dataset is compared with the raw data and the equivalent of MongoDB.

Figure 4.8 Scalability of Boa programs (time in log 2 seconds). Four queries are analysis of protein length frequencies, distribution in the tree of life, highly conserved proteins, and cluster analysis.

4.5 Conclusion

In this work, we explored the NR database and clustering information of the NR database at different similarity levels and showed the computational power of the BoaG language and infrastructure. We showed the storage efficiency, automatic parallelism with less effort compared to the general-purpose languages. We described the average length of protein sequences found in NR, most common taxonomic assignments, top protein functions, redundancy and ambiguity of annotations in NR. The redundancy and ambiguity at the annotation level impacts the usage, curation, and exploration of the NR database. BoaG infrastructure will greatly improve the usage and exploration of NR. The infrastructure is publically available for bioinformati- cians to test different hypothesis and share it among others. 45

CHAPTER 5. DETECTING AND CORRECTING MISCLASSIFIED SEQUENCES IN THE LARGE-SCALE PUBLIC DATABASES

5.1 Introduction

Researchers use BLAST on the non-redundant (NR) database on a daily basis to identify the source and function of a protein/DNA sequence. The non-redundant (NR) database encompasses protein sequences from non-curated (low quality) and curated (high quality) databases. It contains non-redundant sequences from GenBank translations (i.e. GenPept) together with sequences from other databases (Refseq [61], PDB

[16], SwissProt [17], PIR [77] and PRF). NR removes 100% identical sequences and merges the annotations and sequence IDs.

We have identified three root causes for annotation errors in the public databases: user metadata sub- mission, contamination error in the biological samples, and computational methods. NCBI relies on the accuracy of the metadata provided by researchers that are depositing sequencing data into the database.

Data are deposited into NCBI into Biosamples and Bioprojects as raw data, genome assemblies, and tran- scriptomes. Biosamples contain metadata describing the data type, scope, organism, publication, authors, and attributes, which include cultivar, biomaterial provider, collection date, tissue, developmental stage, ge- ographical location, coordinates, and additional notes. This metadata is then propagated to the sequences that are deposited. For example, if data for DNA sequences were deposited by a plant researcher studying soybeans obtained from a soybean roots, then all sequences tied to that metadata will be labeled with the organism name Glycine max. If the researcher had in fact been working on Glycine soja then this would result in a misassignment of all Glycine max sequences.

The second key challenge that all large databases have is the issue of contamination [66]. For exam- ple, if the aforementioned hypothetical soybean research did not remove the soybean root nodules dur- ing sample processing, then the tissue sample could also contain DNA from Nitrogen fixating soil bac- teria that infect nodules leading to contamination of the sequences and ultimately the sequence database. 46

NCBI is aware of the potential for contamination in sequence databases and describes potential sources of contamination that include: DNA recombination techniques (vectors, adaptors, linkers and PCR primers, transposon, and insertion sequences) and sample impurities (organelle, DNA/RNA, multiple organisms).

NCBI encourages the use of programs to try to reduce issues with contamination. Specifically, they rec- ommend screening for contamination using VecScreen (VecScreen; https://www.ncbi.nlm.nih. gov/tools/vecscreen/) and BLAST for the sequences used in sequencing library preparation. More broadly, they recommend BLAST to screen out bacterial, yeast, and Escherichia coli sequences and BLAST- ing against the NR database to identify potential contaminating sequences. Unfortunately, despite efforts to reduce contamination, sequences still end up in the NR database that is incorrectly taxonomically classified.

This can limit our ability to identify contamination of future sequence submissions, as BLASTing against the database could propagate these types of errors as the database grows in size [66]. The contamination problem is not unique to NCBI but can be found in all large databases. A large-scale study of complete and draft bacterial and archaea genomes in the NCBI RefSeq database revealed that 2250 genomes are contami- nated by human sequences [19]. Breitwieser et al. reported 3437 spurious protein entries that are currently present in NR and TrEMBL protein databases.

The third key challenge is that there are errors in the annotations due to the computational error in tools that are based on homology to existing sequences to predict the annotations [66]. These errors have caused annotation accuracy and database quality issues over the years. Annotation errors are not limited to contamination or computationally predicted one. For instance, there exists evidence that suggests some of the experimentally derived annotations may be incorrect [66].

Therefore, it will be beneficial for researchers to utilize a quality control method to detect misclassified sequences and propose the most probable taxonomic assignment.

To address these well-known problems, there are two approaches in the literature: phylogenetic-based approach and functional approach. For the first approach, Kozlov et al. [40] have proposed a phylogeny- aware method to detect and correct misclassified sequences in public databases. They utilized the Evolu- tionary Placement Algorithm (EPA) to identify mislabeled taxonomic annotation. Edgar [30] has studied 47 taxonomy annotation error in rRNA databases. They showed that the annotation error rate in SILVA and

Greengenes databases is about 17%. They also used the phylogenetic-based approach.

In the second approach, it is a common technique for quality control and data cleaning to utilize domain knowledge in the form of ontologies [21]. Gene Ontology (GO) [10] has been suggested to infer aspects of protein function based on sequence similarity [34]. The MisPred [53] and FixPred [54] programs are used to address the identification and correction of misclassified sequences in the public databases. The

FixPred and MisPred methods are based on the principle that an annotation is likely to be erroneous if its feature violates our knowledge about proteins [52]. MisPred [53] is a tool developed to detect incomplete, abnormal, or mispredicted protein annotations. There is a web interface to check the protein sequence online. MisPred uses protein-coding genes and protein knowledge to detect erroneous annotations at the protein function level. For example, they have found for a subset of protein databases that violation of domain integrity accounts for the majority of mispredictions. Modha et al. have proposed a pipeline to pinpoint taxonomic error as well as identifying novel viral species [50]. There is another web-server for exploratory analysis and quality control of proteome-wide sequence search [47] that requires a protein se- quence in a FASTA format. European Bioinformatics Institute (EMBL-EBI) developed InterPro (InterPro; http://www.ebi.ac.uk/interpro/) to classify protein sequences at the superfamily, family and subfamily levels. UniProt has also developed two prediction systems, UniRule and the Statistical Automatic

Annotation System (SAAS) (SAS; https://www.uniprot.org/help/saas), to annotate UniPro- tKB/TrEMBL protein database automatically. CDD is a Conserved Domain Database for the functional annotation of proteins [45].

Exploring public sequence databases and curating annotations at large-scale is challenging. Previous research on the NR database focused on a small subset of the NR database and analyzed annotation error due to the computational requirements. There has been a study [66] on misclassification levels for molecular function for a model set of 37 enzyme families. To the best of our knowledge, the amount of misclassification in the entire database has not been well quantified.

Here, we attempt to address these limitations in detecting and correcting annotations at large-scale and make the following contributions: 48

• We utilize a genomics-specific language, BoaG, that uses the Hadoop cluster [12], to explore annota-

tions in the NR database that is not available in other works.

• We also present a heuristic-based method to detect misclassified taxonomic assignments in the NR

database that is low-cost and easy to use. We automatically generate a taxonomy tree from a list of

taxonomic assignments and use the tree, along with frequency, the provenance (database of origin) of

each taxonomic annotation, and clustering information from NR at 95% similarity to identify potential

misclassification and propose the most probable taxonomic assignment.

• The technique proposed in this work could be generalized to apply to other public databases and dif-

ferent kinds of annotations like protein functions. In this work, we address the taxonomic annotation

error in protein databases. We also tested our approach on the RNA dataset introduced in the literature.

We have identified “29,175,336" proteins in the NR database that have more than one distinct taxonomic assignments, among which “2,238,230" (7.6%) are potentially taxonomically misclassified. We also found that the total number of potential misclassifications in clusters at 95% similarity, above the genus level, is

“3,689,089" out of 88M clusters, which are 4% of the total clusters. This percentage of misclassifications in

NR has a significant impact due to the potential for error propagation in the downstream analysis [51].

The rest of the chapter is organized as follows. In Section 2, we present methods and materials for dataset generation and our approach. In section 3, we discuss the results of taxonomically misclassified proteins within sequences and in NR 95%. We also present the correcting approach for detected sequences.

In Section 4, we conclude with suggestions for the future.

5.2 Materials and methods

In this section, we will describe the overview architecture of our detection and correction approach.

Then, we describe the dataset generation and how we generate a phylogenetic tree from taxonomic assign- ments. Next, we discuss our detection algorithm to find misclassified sequences. Then, we describe our approach to propose taxonomic assignments for the sequences identified as misclassified. Finally, we will describe the sensitivity analysis on changing the different parameters to propose the taxonomic assignments. 49

Figure 5.1 Overview architecture of the proposed method to detecting taxonomically misclassi- fied sequences in the NR database. Diagram shows the raw dataset and steps for the proposed work.

5.2.1 An overview of the method

Figure 5.1 shows an overview of our approach. The NCBI’NR database files were downloaded from (ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/) on Oct 20, 2018. Taxonomic infor- mation was obtained from XML files on NCBI (https://ftp.ncbi.nlm.nih.gov/blast/temp/

DB_XML/). CD-HIT [31] (version v4.6.8-2017-1208) was used to cluster NR protein sequences into clus- ters at 95% similarity using the following parameters (-n 5 -g 1 -G 0 -aS 0.8 -d 0 -p 1 -T 28 -M 0). These parameters use a word length of 5 and require that the alignment of the short sequences is at least 80% of its 50 length. The data acquisition, preprocessing, and clustering took about 3 days. The detection and correction part took about 8 hours.

We took the NR protein FASTA files that have the definition lines containing annotations from different databases and generate the BoaG format that took about 2 hours. Each definition line in the raw data includes protein ID, protein name followed by an organism name in square brackets, e.g., ">AAB18559 unnamed protein product [Escherichia coli str. K-12 substr. MG1655]”. BoaG is a domain-specific language that uses a Hadoop based infrastructure for biological data [12]. A BoaG program is submitted to the BoaG infrastructure. It is compiled and executed on a distributed Hadoop cluster to execute a query on the BoaG- formatted database of the raw data. BoaG has aggregators that can be run on the entire database or a subset of the database taking advantage of protobuf-based schema design optimized for a Hadoop cluster for both the data and the computation. These aggregators are similar to but not limited to aggregators traditionally found in SQL databases and NoSQL databases like MongoDB. A BoaG script requires fewer lines of code, provides storage efficiency, and automatically parallelized large-scale analysis.

5.2.1.1 Dataset generation

To describe our dataset, let D denotes the protein and clustering dataset in our study: D = {P, C, τ, z}.

Here, P = {P1,P2,...,Pm} is a set of all the proteins in the NR database. C = {C1,C2,...,Cn} represents a set of all clusters at 95% similarity. |P | and |C| in our dataset are about 174M and 88M respectively. τ is a set of taxonomic assignment for proteins, and z is a set of functions in the NR database. In this work, we focus on exploring taxonomic assignments.

Definition 1 Cluster. We define cluster as a set of protein sequences such that their sequence are 95% similar and their sequence length is 80% similar. Every particular cluster, Cj, has k members:

Cj = {P1,P2,...,Pk} , and k ∈ [1, |P |] (5.1)

In Definition 5.1, each protein sequence belongs to exactly one cluster at 95% similarity, and each cluster has one representative sequence. If a protein is not identical in sequence and length, it will fall into a cluster with no other member. 51

5.2.1.2 Generating phylogenetic tree from taxonomic assignments

We get the list of taxonomic assignments that originate from different databases (manually reviewed and computationally created) and build a phylogenetic tree by utilizing the ETE3 library [35]. This library utilizes the NCBI taxonomy database that is updated frequently.

Definition 2 Annotation List. Each phylogeny tree is associated with one particular protein, Pi, and has the set of taxonomic assignments that originate from different databases. Here Ai,j denotes annotation number j for protein Pi :

τ (Pi) = {Ai,1,Ai,2,...,Ai,j} , j ∈ [1, |τ|] (5.2)

For example, the protein sequence AAB18559 has taxonomic assignments of "511145” and "723603” that each appeared once.

Definition 3 Provenance. For the particular protein Pi, we define prov (Ai,a) the provenance of annotation

Ai,a as a set of databases that the annotation Ai,a originates from:

prov (Ai,a) ∈ {GenBank, trEMBL, P DB, RefSeq, SwissP rot} (5.3)

In Definition 5.3, annotations from GenBank, trEMBL, and PDB are calculated computationally, while annotations from RefSeq and SwissProt are manually reviewed. For example, prov(511145) = GenBank meaning that the tax id "511145” for the sequence AAB18559 originates from the GenBank database.

Definition 4 Annotation Probability.

We define probability for each taxonomic assignment based on the frequency of each annotation divided by total taxonomic assignments from different databases as follows:

freq (Ai,a∈Comp) + w × freq (Ai,a∈Rev) prob(Ai,a) = P P (5.4) j∈Comp freq (Ai,j) + j∈Rev w × freq (Ai,j)

In Definition 5.4, Ai,a∈Comp represents the annotation that calculated computationally (Comp) from databases i.e., GenBank, trEMBL, PDB, and Ai,a∈Rev denotes the reviewed (Rev) one from RefSeq, Swis- sProt. One annotation might originate from both reviewed and computational created databases. We use 52 a conservative weighting factor, w, to denote the importance of the experimental annotation (manually re- viewed) in which w is an integer number and w ≥ 1.

The upper bound for total proteins, i.e. |P |, is 174M at the time we downloaded NR. Each leaf node,

Va, in the phylogenetic tree is annotated with the information described in the Definitions 5.2, 5.3, and 5.4. There are list of frequencies and provenances, shows as top bar, since one particular taxonomic annotation might originate from different databases:

Va = {prob (Ai,a) , freq (Ai,a) , prov (Ai,a)}. (5.5)

For particular protein Pi, we define most probable annotation (MPA) as MPA (Pi) = Ai,j as an an- notation with the highest probability among the set of annotations. In addition, we define least probable annotation (LPA), with the lowest probability, that potentially might be misclassified as LP A (Pi) = Ai,k, in which i 6= j.

Definition 5 Conserved Proteins. We define a conserved protein as a protein that has more than

10 distinct taxonomic assignment. List of these conserved proteins are shown in our repository

(https://github.com/boalang/nr).

Pi such that |τ(Pi)| ≥ 10 (5.6)

5.2.2 Approach to detect taxonomic misclassification

Our approach is as follows: first, we run a BoaG query (Supplementary Fig.1 ) on the NR database. This query runs on the full NR database in the Hadoop cluster. The algorithm1 describes the detection approach for misclassified sequences. It iterates over the entire NR database. In line 4, it takes a protein Pi and generates a phylogeny tree from the set of taxonomic assignments for Pi. Then, in line 5, it checks if it has a misclassification. If the lowest common ancestor (LCA) is the root level, it means there is a considerable distance between taxonomic assignments for that particular protein sequence. Therefore, there is a potential misassignment among the list of the taxonomic assignments due to the contamination in the sample, error in the computational method, or data entry by the researchers who deposited the sequence. We call this a root violation or conflict. We also consider superkingdom, phylum, class, order, and family violations. In 53

Algorithm 1 The NR misassignment detection algorithm. Input comes from the BoaG query (See supplementals)

1: procedure DETECTMISASSIGNMENTS(D) 2: NRLength ← |P | . m = 174M proteins 3: while i ≤ NRLength do 4: phylo ← P hyloT ree(Pi). 5: if misassigned(phylo) && not conserved(Pi) then 6: print (misassignment found in Pi)

7: procedure PHYLOTREE(Pi) 8: ncbi ← ncbiTAXA() . used to generate phylogeny tree 9: phyloT ree ← ncbi.get_topology(Pi) . From taxa list 10: for Ai,a in τ(Pi) do 11: Va ← prob(Ai,a), list(freq(Ai,a), prov(Ai,a)) 12: return phyloTree. addition, we looked at the highly conserved proteins to remove false positives because conserved proteins might appear in species that are far from each other, i.e., belong to different domains in the phylogeny tree.

We did not remove the list of conserved proteins in the dataset, since they contain taxonomic information that were utilized for proposing taxonomic assignment for the misclassified sequences. Assume Pi belongs to Cj. Once we detected the violation in Pi, we look at the cluster Cj and consider the most frequent taxonomic assignment as the correct taxa. Details are shown in Section 5.2.3.

The algorithm1 requires O (|P | ∗ |τ|) time. Here, |P | is the size of proteins in the NR database and

|τ| is the upper bound of number of taxonomic assignments per proteins. In line 5, misassigned(phylo) verifies if the LCA of the generated tree shows a root violation or any other violations. The conserved(Pi) expression checks if the protein sequence is a conserved one (See Eq. 5.6). This requires O (1) time because this is a straight-forward fetch, and we have the pointer to the root of the tree to check the LCA. In line 5, to check that a protein is not in a conserved list, Definition5, it requires a membership test and takes O (1) time.

This conserved list is a precomputed list from our dataset that is shown in our repository. We wrote a multi- threaded Python code, and the total run time for the algorithm was seven hours for the entire NR database on an iMac (Retina 5K, 27-inch, Late 2015) with core i7 and 32 GB RAM. For the second procedure, in 54 line 11, the algorithm requires O (|τ|) to calculate the probability of each leaf in the generated phylogenetic tree.

Algorithm 2 Annotation correction: The Most Probable Annotation for the misclassified se- quences. Input from the BoaG query (See supplementals)

1: procedure MOSTPROBABLE(Pi, p, c) 2: top_ann ← max(prob(τ(Pi))) . Most probable taxa 3: if prob(top_ann) ≥ p then 4: return (top_ann). 5: else 6: cluster ← Cj in which Pi ∈ Cj 7: top_ann ← ClusterMostP robable(cluster, p, c). 8: return top_ann. 9: procedure CLUSTERMOSTPROBABLE(clustr, p, c) 10: if size(cluster) ≥ c then 11: for Ai,a in τ(cluster) do 12: Va ← prob(Ai,a), list(freq(Ai,a), prov(Ai,a)) 13: top_ann ← max(prob(τ(cluster))) . Most probable taxa 14: if prob(top_ann) ≥ p then 15: return top_ann 16: else 17: return null . Cannot fix misclassification

5.2.3 The most probable taxonomic assignment for detected misclassifications

For the detected misclassified sequences, we defined criteria to propose the most probable taxonomic assignment (MPA). First, we ran a BoaG query (Supplementary Fig.2) to retrieve the annotations and clus- tering information at 95% similarity. As shown in Definition4, we considered provenance or database of origin, frequency of annotations to calculate the probable taxonomic assignment (MPA), which is the high- est probability. Let’s assume that Pi belongs to cluster Cj. If the algorithm does not find the MPA within a certain threshold, probability p, then we look at the cluster of 95% similarity that the sequence belongs to. Second, we found the most probable taxonomic assignment in Cj. If a particular taxonomic assignment was the most frequent one in Cj then we return that annotation as the MPA for the protein sequence Pi. For 55

example, in cluster Cj, seven sequences out of 10 sequences have a specific annotation. Then, we consider this annotation to be the most probable annotation protein sequence Pi with 70% confidence.

Details are shown in the algorithm2. In line 2, for a particular protein Pi, it returns the most frequent taxonomic assignment within a certain threshold p. Let’s assume we want a taxonomic assignment that appears more than 70% of the time. If the algorithm does not find the MPA, it checks the cluster Cj with 95% similarity that this sequence belongs to and finds the one with a certain probability, p, and a cluster size, c (line 7). In line 9, ClusterMostP robable takes the cluster id and finds the most probable taxonomic assignment in the cluster (line 13).

The algorithm2 requires O (|τ (P )|) time, Definition 5.2, to find the top(1) or maximum probability of an annotation in the list of annotations.

5.2.4 Simulated and literature dataset

To evaluate the performance of our taxonomic misclassification approach, we generated a simulated dataset. We took a subset of one million proteins of the reviewed dataset, i.e. RefSeq database, and randomly misclassified 50% of the proteins in the sample by adding a taxonomic assignment from another phylum or kingdoms. Then, we tested if the approach can detect these sequences. We also tested our approach for detecting misclassified sequences and correcting them on the real-world data, presented in the literature

[30, 40]. These works have focused on the RNA dataset, and they quantified misclassified RNA sequences.

We also used CD-HIT to cluster RNA databases based on 95% sequence similarity. Further details on the simulated dataset, scripts, and data files can be accessed from https://github.com/boalang/nr.

5.2.5 Sensitivity analysis

We define sensitivity analysis as a way that an input parameter affects the output of the proposed ap- proach. Here, probability based on annotation frequencies and the cluster size are the two input parameters that affect what percentages of detected misclassified sequences that we can fix, i.e., MPA, as shown in Al- gorithm2 on the NR dataset. The algorithm will not give the same suggestion for changes in parameters. For example, if we change the cluster size, no. of proteins in the cluster, it may or may not find correct taxa. We 56

Table 5.1 Detected misclassified taxonomic proteins in the NR database. taxa total root Kingdom Phylum Class Order Family 2 17,496,167 30,237 47,271 202,205 59,606 177,132 290,065 3 5,921,066 14,376 19,666 107,705 38,575 104,709 236,515 4 2,132,971 4,673 21,587 64,801 17,662 47,914 94,054 5 1,022,482 3,143 9,469 34,322 10,050 27,295 53,276 6 642,760 2,509 5,662 24,136 7,333 23,324 37,998 7 388,794 1,572 3,959 12,972 5,905 13,488 27,221 8 262,682 1,121 2,803 5,988 5,375 10,075 16,340 9 190,756 783 2,647 3,825 3,173 7,557 12,681 10 156,767 667 1,843 3,805 2,451 6,413 11,327 >10 960,891 10,940 23,232 30,048 38,679 46,391 107,679 conducted a sensitivity analysis based on the probability of each annotation that we defined in Definition 5.4 and the size of the cluster of 95% that the sequence belongs to. We run the algorithm to find the most proba- ble taxonomic assignments (MPA) with different clusters size, c, and with different probabilities, p. As it is shown in (Supplementary figure 3), with a probability of 0.4 and without giving more weight to the annota- tions that verified experimentally, we could provide a most probable taxonomic assignment to about 60% of the proteins that we detected as misclassified. We also extended sensitivity analysis by giving more weight to the experimental taxonomic assignment with the probability of 0.4 we could provide the most probable taxonomic assignment for more than 80% of the sequences that were identified as a misclassification.

5.3 Results

In this section, we present the number of proteins that are misclassified taxonomically. We also present the performance of our work on the simulated dataset and the datasets presented in the literature. Then, we describe our findings on misassignments in the clusters. Next, we present correcting taxonomic misclassifi- cation. Finally, we discuss a case study that we explored deeply to identify a subset of clusters that contain sequences with a taxonomic misclassification.

5.3.1 Detected taxonomically misclassified proteins

We found “29,175,336" proteins in the NR database that have more than one distinct taxonomic as- signments. The rest of the proteins have identical taxonomic assignments, even though they originate from 57

Figure 5.2 Phylogenetic tree generated for sequence ID NP_001026909. Taxonomic assignments originate from GenBank, trEMBL, PDB, RefSeq, and SwissProt database. different databases. The total number of potential taxonomically misclassified sequences is “2,238,230" out of “29,175,336" (7.6%) at the time of download. This percentage of NR is significant because of the error propagation in the downstream analysis [51]. Table 5.1 shows the number of violations in the protein sequences in NR at the superkingdom to the family level that have been detected by applying distance in the phylogenetic tree. The second column shows the number of total proteins that have a certain number of taxonomic assignments. For example, there are “17,496,167" protein sequences in NR that have two taxonomic assignments in which “30,237" of them have potential root violations and “47,271", “202,205",

“59,606", “177,132", “290,065" have kingdom, phylum, class, order, and family violations respectively. For the NR datasets, we did a sample study of 1000 samples, and manually found 5.5% misassignment. The potentially misclassified sequences detected by the approach was around 7.6% that is consistent with the total number that was manually found, i.e. 5.5%.

Table 5.1 shows proteins that have less than 10 taxonomic assignments. The last row shows all other proteins with more than 10 assignments. The first two bold rows show the highest potential misassignments because if a protein has two or three taxonomic assignments and shows a root or kingdom violation, it is more likely to be misclassified. The full list of detected misclassified proteins, and detailed analysis are shown in our GitHub repository. We did not report the genus conflict since the probability of a false- positive misclassification is much higher compared to other taxonomic levels of conflict, such as root and superkingdom.

Figure 5.2 shows one example of a detected misclassified protein, with an id of NP_001026909. Since the lowest common ancestor in this tree is the root, it means those taxonomic assignments belong to a dif- 58 ferent kingdom. Leaves are annotated with a frequency of each taxonomic assignment as a bar chart from all reviewed and unreviewed databases i.e., RefSeq [61], GenBank [15], PDB [16], UniProt\SwissProt [17], and UniProt\TrEMBL [23] respectively. As it is shown in the annotations, there are potential misassign- ments even though the key IDs originate from the reviewed databases, i.e., RefSeq and SwissProt. In this example, synthetic construct is the misassignment, and the MPA for this protein is homo sapiens.

We also explored some clusters in depth as a case study and identified proteins that are taxonomically misclassified as Glycine, which are in fact contamination in the sample (Supplementary Sec 1.6).

5.3.2 Performance on simulated and real-world dataset

Our approach to detecting taxonomically misclassified proteins on the simulated dataset showed 87% recall and 97% precision. We define true positive (TP) as sequences that misclassified in the sample, and our approach retrieves those sequences. False positives (FP) are sequences that do not have misassignments, but our approach classified them as misclassified sequences. False negative (FN) is a reviewed sequence which the algorithm incorrectly classifies as correct (not misclassified), while it is misclassified. Some of these false negatives are due to changes in the taxonomies over time. Some taxonomic IDs might be obsolete, deleted, or get merged into other tax ids. We also found that some of the trees generated by NCBI API have the root named "Cellular Organisms" with rank equal to "no rank", that did not fall in any of the taxonomic ranking.

TP TP We use the following formula to calculate precision and recall (precision = TP +FP ; recall = TP +FN ): We extended our experiment and added more than two random assignments to the proteins and the pre- cision increased. The reason is that adding more random assignments increases the distance among tax IDs in the phylogeny tree and hence increases the chance of detection by the approach. We also tested our ap- proach on the dataset presented by [30] in which they explored the Greengenes and the SILVA database for taxonomic error. Our methods reproduced their finding on annotation conflicts among SILVA and Green- genes [46] database. We did not run their approach on the simulated dataset since it was designed to detect misassignments in rRNA sequences, not proteins. For evaluating our work, we looked for similar works that focused on detecting taxonomic misassignments. However, their approach was hard-coded for RNA sequences. Therefore, we modified our approach to test on their dataset. The proposed work focuses on 59

Table 5.2 Accuracy of detecting misassignments and the comparison with work presented in SATIVA [40] Precision Recall Runtime SATIVA Proposed SATIVA Proposed SATIVA Proposed 0.93 0.98 0.98 0.90 116 min 12 min The best values are highlighted. inconsistencies among the list of taxonomies, and it can be applied to the RNA sequences as well. We clustered their dataset at 95% similarity and used the same consensus-based technique to detect conflicts be- tween sequences and clusters. The phylogeny-aware technique proposed by Kozlov et al., called SATIVA, identifies and corrects misclassified sequences for RNA databases [40]. They utilized the Evolutionary

Placement Algorithm (EPA) to detect misclassified sequences. In their approach, a reference tree is created.

Then, to estimate the most likely placements of the query sequence in the reference tree, they use EPA. We took their RNA dataset and cluster the sequences at 95% similarity, then utilized our technique to check if the annotation of each sequence has a conflict with a cluster that the sequence belongs to. There is a difference between the NR dataset and the RNA dataset presented by Kozlov et al. in terms of the number of taxonomic annotation. In their experiment, they have one taxonomic label for each sequence; however, in the NR database, there are several annotations for each protein sequence. Therefore, their technique is not designed to detect misclassification in a set of given annotations. In terms of running time, the clustering at

95% is less expensive than running sequence alignment and generating phylogeny-tree and verifying each query sequence. Therefore, our approach is scalable for large-scale sequence databases. In general, exam- ining the distance on the phylogenetic tree of multiple annotations for the shorter sequences performs better compared to the alignment-based approaches with the reference databases. Table 5.2 shows the standard values for precision and recall, as well as the running time comparison. Our approach to detect misassign- ments on the sample RNA dataset has a lower recall. This is due to the relatively smaller datasets that caused some clusters to have few members and made it challenging to detect misclassified sequences. 60

Figure 5.3 Compare running time of the proposed work with the SATIVA method. We used dataset from the SATIVA paper.

Table 5.3 Proposed taxa for the detected misclassified sequences in NR. Last column shows the confidence score (CS). Protein ID Cluster ID Original taxa Proposed taxa CS AAB18559 18982245 uncultured actinobacterium Escherichia coli 1 AAT83007 21005513 Mycobacteroides abscessus Cutibacterium acnes 0.8 CCW09133 9901357 Streptococcus pneumoniae Bacillus cereus 0.5 KFV03115 13041247 Tauraco erythrolophus Pelodiscus sinensis 0.4 YP_950729 83178931 Staphylococcus virus PH15 firmicutes 0.8

5.3.3 Detected misassignments in clusters

There are “12,960,476" clusters at 95% similarity that have two taxonomic assignments in which

“17,099" of them have potential root violations and “92,526", “263,844", “100,560", “267,251", “461,795" have kingdom, phylum, class, order, and family violations respectively. The number of root violations for two tax assignments in clusters is less than sequences because there are protein sequences that do not belong to any clusters at 95% similarity. 64M out of 174M proteins (36%) in the NR database are unclustered (Sup- plementary Table 1). The total number of potential misclassifications for clusters at 95% similarity, without genus level, is “3,689,089" out of “25,159,866" clusters that have more than one taxa, which are 15% of total clusters. Detail numbers of misclassified sequences in the clusters along with an example of detected taxonomically assigned annotations in the cluster are shown in the supplemental files. 61

5.3.4 Correcting Taxonomic Misclassification

Each protein sequence belongs to one and only one cluster. We analyzed the set of top three taxonomic annotations of each sequence and compared them with the top three taxonomic annotations of the cluster the sequence belongs to. For example, top three taxonomic assignment for sequence with id AAA32344 is ‘10743’, ‘1182665’, ‘656390’. This sequence falls in the cluster-id 8461728, and the top tax ids in this cluster are ‘562’, ‘83334’, ‘621’. We consider this as a conflict between sequence AAA32344 and cluster

8461728. All three annotations are different; therefore, we consider this case as three conflicts. If two annotations out of three are different, we classify this as two conflicts. If one taxonomic annotation is different from the two sets, we classify it as one conflict. Finally, if the three annotations are identical, there is no conflict. Different percentages of conflicts from the subset of one million sequences are shown in supplementary Fig. 5.

Table 5.3 shows several examples of the protein sequences that we have found to be misclassified in the NR database. The first column represents the sequence id, and the second column is the cluster id corresponds to the sequence. The third column shows the original taxonomic assignment, and the forth column is the proposed taxonomy based on the consensus information from the clusters of the NR database at

95% similarity. The last column is Confidence Score (CS), a number between 0 and 1, shows how confident we are in proposing new taxonomic assignment based on the consensus information from the clusters at

95% similarity. This score calculated from the clusters’ information as top taxonomic assignment, i.e. most frequent one, in the cluster divided by total taxa in the cluster. The assumption here is that the consensus of multiple independent sequence annotations can catch simple misannotation errors. For example, protein sequence with id YP_950729 has Staphylococcus virus PH15 as its taxonomic assignment. It falls in cluster id 83178931 and the recommended annotation is firmicutes. We also conducted similar analysis on the dataset by SATIVA, and could reproduce the proposed taxa based on the consensus information from the clusters. For the dataset by Edgar [30] since the number of sequences was small, we could not get clusters with enough members to suggest annotations. 62

5.3.5 Running time

We conducted an analysis of the RNA dataset presented by SATIVA with different samples of sequences.

Firstly, we took 100 sequences and ran SATIVA in the sample. Next, we took 500 sequences. In two other experiments, we took 1000 and 2000 additional sequences and recorded the running time. Figure 5.3 shows the comparison in terms of running time between proposed work and the SATIVA method. The most time- consuming part of our approach is the clustering time (run by CD-HIT). By adding more sequences, the runtime slightly increased. In contrast, for the SATIVA method, as we increase the number of sequences, the running time increases significantly. The computational expensive part of the SATIVA approach is the phylogenetic methods (the Evolutionary Placement Algorithm) it employs. The comparison between the proposed approach and SATIVA method has been made on the local system iMac (Retina 5K, 27-inch, Late

2015) with core i7 and 32 GB RAM.

5.4 Discussion and conclusion

In this work, we addressed taxonomically misclassified sequences in the large publicly available databases by utilizing our domain-specific language and Hadoop-based infrastructure. We focused on the misassignments at the taxonomic level, and similar to MisPred [53], we utilize the current knowledge of organismal classification, to detect annotation errors. Similar to [34], we utilized this form of knowledge- based reasoning for quality control and detect annotation errors. Compared to other works, our work differs in that we do not need to run sequence similarity to explore annotations and find taxonomic inconsistency for each query sequence in the NR database. Instead, first, we clustered the NR proteins at the data generation phase, and this is a one-time cost and used the clustering information later to detect annotation error and propose the most probable annotations. In this work, we proposed a heuristic method to find inconsistencies in the metadata, i.e., taxonomic assignments. In our method, we proposed the most probable taxonomic assignment for each protein sequence. We applied this method to the entire database. We also provided a

Python implementation in a that could be used for analyzing a list of annotations for any protein of interest and find the misclassification. The violations reported in this dissertation in Table 5.1 are the upper bound of the misassignments. The more stringent filter includes hypothetical protein and membrane protein func- 63 tions in the list of conserved protein, which will lower the number of identified misclassification. We use open-source CD-HIT clustering software only at the data generation phase, and we could utilize any other clustering software. Steinegger et al. have built a novel clustering tool that clusters a huge protein database in linear time [69]. Since this one-time cost happens only in the data generation phase, our approach to detect misassignments and propose the most probable taxonomic assignment is scalable.

5.4.1 Applications and limitations

At 95% similarity, 64M sequences in the NR remain unclustered. Therefore, if a particular protein remains unclustered, there is not enough consensus information to correct annotation for that protein. A solution for this might be to take the Evolutionary Placement Algorithm (EPA) approach [40] for these se- quences that remains as future work. The proposed technique to detect misassignments may fail with recent horizontal gene transfer (HGT) events since HGT is not transferred from parent to offspring. However, the consensus information from the clusters might reveal annotation errors. The proposed heuristic technique and findings could also be applied to other databases. Current work focuses on detecting and correcting misassignments at the level of taxonomic assignments, and we do not address protein function annotations.

5.4.2 Conclusion

Misclassification can lead to significant error propagation in the downstream analysis. In this work, we proposed a heuristic approach to detect misclassified taxonomic assignments and find the most probable annotations for misclassified sequences. This method will be a valuable tool in cleaning up on large-scale public databases. The technique we proposed could be extended in the form of ontologies to address other annotation errors like protein functions. 64

CHAPTER 6. IMPROVING DATA QUALITY OF TAXONOMIC ASSIGNMENTS IN LARGE-SCALE PUBLIC DATABASES

6.1 Introduction

Error propagation is a well known problem in large public databases such as NCBI’s Non-redundant

(NR) protein database. User metadata submission, contamination error in the biological samples, and computational methods are three main causes of annotation errors in the public databases [13, 66]. This dissertation focuses on taxonomic mislabeling or mis-assignment in the public reference database NR (non- redundant). The NCBI’s NR database encompasses protein sequences from non-curated (low quality) and curated (high quality) databases. It contains non-redundant sequences from GenBank translations (i.e. Gen-

Pept) together with sequences from other databases (Refseq [61], PDB [16], SwissProt ([17]), PIR [77] and

PRF). NR removes 100% identical sequences and merges the annotations and protein IDs. These anno- tations could be utilized to assign taxonomic assignments to the query sequence but this process leads to inconsistencies in taxonomic assignment even within a single sequence entry.

Sequence clustering information of the public databases continues to be used widely [71]. The use- fulness of clusters to explore different biological analyses has been shown for functional annotation, family classification, systems biology, structural genomics, and phylogenetic analysis [71]. One common challenge is assigning taxonomy for clusters with few proteins. We surmount this issue by using intra-cluster assign- ment and the consistency between the members of each cluster to assign taxonomy with higher confidence.

To the best of our knowledge, there has not been a scalable solution to clean data from large scale biological sequence databases after clustering. In brief, the methods described below utilize CD-HIT [31] to cluster

NR sequences at different similarity level, i.e. 95%, 90%, 85%, down to 65%. To improve the data quality of the clusters, anomalies are removed and then a confidence score is provided based on the lineage of all sequences within each cluster. 65

Figure 6.1 Single vs multiple assignments. About 83% of proteins in the NR database have a single assignment some may originate from multiple databases.

There are two different taxonomic assignments; first, there are sequences with multiple assignments that originate from multiple databases. Second, there are sequences that have a single assignment from one to multiple databases. About 16% of the NR database has more than two distinct taxonomic assignments, while

83% of the database has only one assignment (Fig. 6.1). For multiple assignments to a single sequence, it is known that it is likely that there might be inconsistency among the list of taxonomic assignments for a sequence [13, 66]. It is possible that a protein sequence could have the same incorrect taxonomic assignment from one or several databases, which is challenging to identify resulting in a false sense of reliability given to sequences that have a single assignment from multiple databases/resources. This, in turn, can lead to error propagation in the downstream analysis. How can we ensure that these assignments are not the result of error propagation? There is a need for a different technique to identify erroneous taxonomic assignment for these kinds of sequences in the public reference databases that is also scalable.

Previous work has focused on curating the biological databases. For example, Schnoes et al. [66] studied annotation errors for molecular function for a model set of 37 enzyme families. Meola et al. created a manually curated reference database of 16S rRNA gene sequences to improve taxonomy annotations [48].

However, these works address only a small subset of biological sequences and have issues with scalability. In the previous work, we addressed the identification and correction of taxonomically misclassified sequences with multiple distinct assignments originated from multiple databases [13]. We found that of the 29 million 66 sequences that contained multiple taxonomic assignments, there were more than two million (7.6%) that could potentially be misclassified.

In this work, we set out to determine if taxonomic mis-assignments could be identified for protein se- quences that only had a single taxonomic assignment. It is important for researchers to have some idea of the level of confidence for the taxonomic assignment of a sequence. Therefore, in this work, we designated sequences as reliable or suspicious depending on the level of support. For the detected suspicious assign- ments, we utilize the consensus information of clusters at 95% sequence similarity that all members have at least 80% sequence length overlap threshold. About 35% of the NR database did not cluster with other proteins at 95% sequence similarity. This means they are the only member of their clusters; hence, there is no information to propose the most likely assignment. We therefore for these sequences utilized clusters with 90% similarity that had other members and assigned a lower score for annotations.

Here, we attempt to improve the quality of sequences and their clusters in the large-scale public databases and make the following contributions:

• Scalable algorithm that improve data quality of clusters at different similarity levels, i.e., NR95,

NR90,..., NR65 clusters.

• Scalable identification of suspicious taxonomic assignments for proteins that have a single assignment.

• Utilize consensus information of clusters and provide a confidence score to the sequences where their

assignments are highly reliable.

• Propose the most reliable assignments for the suspicious sequences by utilizing the consensus infor-

mation of the clusters at 95% and 90% sequence similarity.

We found more than four million sequences which were potentially taxonomically misclassified from sequences with only one taxonomic assignment. We also found that 36% of the NR dataset has highly reliable or trustworthy assignments. We also cleaned NR clusters at different sequence similarity from 95% down to 65% sequence similarity. For the sequences that remained unclustered, we utilized information from the clusters with lower similarity and assigned a lower weight. 67

The rest of the chapter is organized as follows. In Section 2, we present methods and materials for the approach. In Section 3, we discuss the results of data cleaning of clusters and sequences. We also present the correcting approach for misclassified sequences. In Section 4, we conclude with suggestions for the future.

6.2 Materials and methods

6.2.1 Dataset generation and definitions

(a) Root conflict (b) Superkingdom conflict

(c) Phylum conflict (d) Class conflict

(e) Order conflict (f) Family Conflict

Figure 6.2 Examples of NR95 clusters with different misclassification levels. The highlighted as- signment is the highly suspicious assignment in the cluster that will be removed at the first stage of data cleaning.

The NR database was downloaded from the NCBI FTP (ftp://ftp.ncbi.nlm.nih.gov/ blast/db/FASTA/)[13]. Taxonomic information was obtained from XML files on NCBI (https:

//ftp.ncbi.nlm.nih.gov/blast/temp/DB_XML/).

The following are important steps and definitions in this work. 68

Dataset generation To describe our dataset, let D denote the protein and clustering dataset in this work: D = {P, C, τ, z}. We define P = {P1,P2,...,Pm} as a set of proteins in the NR database.

C = {Cs,1,Cs,2,...,Cs,n} represents a set of all clusters at similarity level s, where s ∈ {65, 70,..., 95}

. The set of taxonomic assignments and functions in the NR database are defined by τ and z, respectively.

In this work, we primarily focus on the taxonomic assignments, i.e. τ (Pi) = {A1,A2,...,Aj} which maximum number of j is 729233. The NCBI’s taxonomy database currently contains 729,233 different entries. The data acquisition, preprocessing, and clustering (NR95, NR90, ..., NR65) took about seven days.

The most time-consuming part was generating NR95. We took the NR FASTA files that have the definition lines containing annotations from different databases and generate the BoaG format [12]. The detection and correction for single assignment proteins took about an hour.

We utilized BoaG to retrieve the clustering information and metadata. The BoaG infrastructure provides an easy-to-use web-interface to query large scale datasets in few minutes. The BoaG web-interface is pub- licly available for researchers to run different queries and share them with others. In this work, we utilized

BoaG to explore taxonomic assignments at 95% down to 65% clusters. The queries and public links of the results are also made available on our GitHub repository (https://github.com/boalang/quality.). The dataset is the same as we used in the previous work [13] that was downloaded on Oct 20, 2018.

1 s: Sequence = input; 2 counts: output sum[string][string] of int; 3 foreach (i: int; s.cluster[i].similarity == 95) 4 foreach (j: int; def(s.annotation[j])) 5 counts[s.cluster[i].cid][s.annotation[j].tax_id] << 1;

#Following are few lines of the output counts[10000006][348824] = 2 counts[10000006][379] = 1 counts[10000006][501024] = 3

Figure 6.3 Frequency of taxonomic assignment for identifying taxonomically misclassified pro- tein sequences in the NR95. In line 1, s is defined a Sequence type. Line 2, counts is an output aggregator that produces the sum of the output in- dexed over protein sequence ID and taxonomic ID. See the public url here: http://boa.cs.iastate.edu/boag/?q=boa/job/public/129 69

NR Clusters. We used CD-HIT clustering [31] to cluster the NR proteins at 95% sequence similarity with 80% sequence length overlap threshold. We used following parameters (-n 5 -g 1 -G 0 -aS 0.8 -d 0 -p

1 -T 28 -M 0). This threshold is important as it will improve the consistency among annotations [71]. Then, we took the representative sequences, which is defined as the longest sequence in the cluster, of the NR95 and generated a NR90 of clusters with at least 90% sequence similarity with the similar parameters. We followed the same procedure for other similarity levels down to 65%. The clusters of lower similarity were generated using CD-Hit at 5% decrements until 65% similarity. Therefore, cluster ∈ {NR95, NR90, NR85,

NR80, NR75, NR70, NR65}.

Cluster size. Refers to the number of proteins in each cluster. It is a critical factor of the consensus information to detect and correct misassignments. A smaller size cluster leads to a lower confidence, while a larger cluster gives a higher confidence.

Constructing phylogenetic trees. Each cluster contains several taxonomic assignments that we used them to build a phylogeny tree. We utilized the ETE3 library [35]. This library utilizes the NCBI taxonomy database that is updated frequently.

Provenance. We define provenance as a database of origin for the taxonomic assignments which is a categorical variable. In correcting misclassified sequences, we give a higher weight to the assignments that were originated from manual databases. For a particular protein Pi, we de-

fine prov (Ai) as a set of databases that the annotation Ai originates from, i.e. P rov (Ai) ∈ {GenBank, trEMBL, P DB, RefSeq, SwissP rot}.

Confidence Score (CS). We define confidence score (CS) as a measure, between zero and one, on how confident we are when we propose an assignment for a misclassified sequence from the provenance and frequency of manual vs computational, and the consensus information of the clusters of NR95 and

NR90. We give confidence score to each assignment in a cluster based on its frequencies that originate from multiple databases and different members of the cluster, divided by total frequency of assignments in the cluster. We give more weight, for example, a factor of two, to those taxonomic assignments that come from 70 manually verified databases such as RefSeq. We define the objective function for NR95 and NR90 to find assignment Ai such that the following function is maximized:

maximize CSmanual + CScomputational i,C,M,w freq (A ,M) subject to CS = i (1) manual total freq (A ,C) CS = i computational total X X total = freq (Aj ,C) + w × freq (Aj ,M)

C ∈ {GenBank, trEMBL, P DB}

M ∈ {RefSeq, SwissP rot}, w ≥ 2

In this function, a pair of (Ai,D) represents the assignments in which the database D may have cal- culated computationally (C) from databases such as GenBank, trEMBL, PDB, and or manually reviewed

(M) such as RefSeq, SwissProt. One annotation might originate from both manually and computationally created databases. We use a conservative weighting factor, w, to denote the importance of the experimental annotation (manually reviewed) in which w is an integer number and w ≥ 2 that could be set upfront.

6.2.2 Improve data quality of clusters

To improve assignment quality of clusters, i.e. NR95, .., NR65, we take two steps as described in

Algorithm3. The first step is to identify assignments that are highly suspicious. The algorithm at this step either gives a reliable label to the sequence assignment or a suspicious label. We remove the highly suspicious assignments. The second step is to provide the common taxonomic and rank for each cluster.

This proposed algorithm is highly scalable and due to the data cleaning at the fist step it provides highly reliable assignments. Cluster Cj has a set of proteins P = {P1,P2,...,Pr,...,Pm}. Here, Pr is the representative sequence of cluster Cj. Each Pi has a set of assignments that we used to construct the tree

Ti = {A1,A2,...,An}. To remove anomalies in each cluster Cj, we detect highly suspicious sequences that their assignments are taxonomically far from the most frequent assignments in cluster Cj. We build phylogeny tree Ti for all the assignments in each cluster and check whether there is inconsistency among them. For example, we examine the root of each tree and if the root is superkingdom we label the cluster

Cj as superkingdom conflict. For the next step, Algorithm3 Check_Lineage function, we utilize lineage 71 of all species within each cluster and with a high confidence provide the taxonomy name and rank that is common among all members. For example, as shown in Fig. 6.4, for the cluster C11844617 that contains several taxonomic assignments. At the first step of data cleaning we identify a misclassification and remove

9606, i.e. Homo Sapiens. Then we report Influenza A virus as a common taxa for this cluster. There are algorithms that address the least common ancestor in two nodes in a binary tree. However, the problem of

finding the least common ancestor in a set of nodes at scale will be computationally expensive. To improve the efficiency, once we remove root, kingdom, phylum, and class conflicts, we start from right to left and

find the most common assignment in the array of lineages. This verification step can be done in parallel for all clusters at different similarity levels. We looked at all clusters from 95% sequence similarity down to 65%. In terms of computational complexity, the time to run the algorithm is O (|C|) where |C| is the number of clusters. The algorithm on average requires seven comparisons to find the common taxa and rank in the lineage.

Algorithm 3 Improve data quality of NR clusters. Input comes from the BoaG query (Fig. 6.3) 1: procedure CHECK_LINEAGE(T axList) 2: for tax in T axList do . All taxa in a clusters 3: lineages ← lineages + get_lineage(tax) . NCBI’s API 4: T axa ← find_common(lineages) . Traverse from right to left 5: return (Taxa.name, Taxa.rank) . this returns common taxa and rank as shown in Fig. 6.4 6: procedure VERIFY_TREE(tree) . Tree generated from list of taxa 7: lca ← Lowest_Common_Ancestor(tree) 8: if lca in {root,kingdom,phylum,class,order,family} then 9: return lca conflict 10: else 11: return no conflict. 12: procedure CLEAN_CLUSTER(C) . Detect suspicious as shown in Fig. 6.2 13: ClusterSize ← |C| . Size of NR database 14: while j ≤ ClusterSize do 15: taxa_list = get_taxa(Cj ) . assignments of the cluster 16: phylo ← ncbi.get_topology(taxa_list) 17: if verify_tree(phylo)==’conflict’ then . LCA conflicts 18: remove(taxa) . remove Suspicious assignment 19: Check_Lineage(TaxList)

6.2.3 Approach to give suspicious or reliable label to assignments

Let’s assume Pi belongs to cluster Cj at NR95. If we were to verify an assignment of a sequence Pi in the NR database, we can compare its assignment with the most frequent assignment of the cluster Cj. The 72

Figure 6.4 Lineage from cluster C11844617 that contains several taxonomic assignments 380331, 352520, and 9606. At the first step of cleaning we remove 9606, i.e. Homo Sapiens. Then, we report Influenza A virus as a common taxa.

Algorithm 4 Check for suspicious or reliable assignments in sequences. Input comes from the BoaG query (Fig. 6.3). 1: procedure DETECTMISASSIGNMENTS(D) 2: NRLength ← |P | . Single assignment sequences (83% of NR) 3: while i ≤ NRLength do 4: cluster_taxas = top3_taxa(cluster(Pi)) 5: if taxa(Pi) in cluster_taxas then .Pi has a single assignment 6: print(reliable assignment in Pi) 7: else 8: taxa_list = cluster_taxas + taxa(Pi) 9: phylo ← P hyloT ree(taxa_list). 10: if verify_tree(phylo)==’conflict’ then . checks lowest common ancestor 11: print (misassignment in Pi) . Suspicious assignment output of Algorithm6 is either suspicious along with a level of conflict, or a reliable label with a confidence score, that will be discussed in the next section. This process does not require generating a phylogeny tree.

It only requires at most three comparisons between sequence assignment and the top frequent assignments of clusters (line 5). If the assignment is not in the top three most frequent assignments of the cluster, we put them in a list of preliminary misannotation list, then we generate a phylogeny tree from the list of protein assignment and the most frequent taxa in the clusters (line 9). Next, we check if there is significant difference among these assignments by calling a procedure verify_tree at line 10. To that end, we look at the lowest common ancestor (LCA) of the tree and check the conflict level, i.e. root, kingdom, phylym, class, order, and family. Some examples of conflicts are shown in Figure 6.2, e.g. Fig. 6.2(a) shows the protein

ID = XP_025947399 has Alligator mississippiensis as an assignment, while the most frequent assignmentin its cluster are Gallus gallus, Lonchura striata domestica, and Apteryx rowi. The sequence assignment is taxonomically far from the most frequent assignments in its cluster and hence potentially misclassified. If the protein assignment is in the set of the most frequent taxonomic assignment in the respective cluster, we 73 label it as a reliable assignments (see Algorithm6). If the generated tree does not have a conflict level at root, superkingdom, phylum, etc, then this is also not a suspicious assignment.

As shown in line 5 Algorithm6, we check if the taxonomic assignment of the protein Pi is in the list of most frequent assignments (top three) of its respective cluster Cj. This requires much smaller time compared to the previous work (see Fig. 6.6) that utilized generating tree or BLAST against the public databases. Therefore, the required time for this algorithm is O(|P |) where |P | is the size of NR database.

6.2.4 Proposing the most reliable assignments for detected sequences

Algorithm 5 Annotation correction: The Most Probable Annotation for the misclassified sequences in NR95. Input comes from the BoaG query (Fig. 6.3).

1: procedure MOSTPROBABLE(Pi, p, c)

2: cluster ← Cj in which Pi ∈ Cj 3: top_ann ← ClusterMostP robable(cluster, 95). 4: return top_ann. 5: procedure CLUSTERMOSTPROBABLE(clustr, similarity) 6: if size(cluster) ≥ c then

7: for Pi in cluster do . w: weighting factor

8: sum ← sum + man(taxa(Pi)) + w ∗ comp(taxa(Pi)). (man(taxa(Pi)) + w ∗ comp(taxa(Pi))) 9: top1 ← max( ) sum 10: if top1 then 11: return top1 . clusters nr95 12: else 13: return ClusterMostProbable(cluster, 90) . utilize nr90 clusters

We utilized the annotation provenance and the consensus information of clusters at 95% and 90% to identify misclassified sequences, if any. We otherwise label the assignments as reliable assignment that conforms the clusters information. We give a lower weight for the assignments from 90% clusters. As shown in line 2 in Algorithm5, the protein Pi belongs to the cluster Cj. Then, it calls the procedure ClusterMostProbable in line 5 that starts with NR95. In this procedure, if it can not find the most probable assignment, then it recursively checks the NR90 at line 13. Similarly, for other similarity levels if the algorithm can not find reliable assignment, it checks the clusters with lower similarity. 74

6.2.5 Simulated and literature dataset

To test the performance of the proposed work, we built a simulated dataset and report the precision and recall. We assigned a random taxa to a sample of one million sequences and then we used our approach to check if the mislabeled sequences can be identified. We repeated this experiment three times and re- ported the average precision and recall. The probability of giving the assignment that is the same as the top assignment in the cluster is close to zero as the number of taxonomic assignments in NCBI’s taxon- omy database is 729, 233. We also tested our approach on the UniRef dataset provided by UniProt [5] and the literature dataset introduced by Kozlov et al. [40]. In addition, we compared the current ap- proach with the previous work [13] and reported the running time. Further details can be accessed from https://github.com/boalang/quality.

6.3 Results

In this section, we present data quality improvement of clusters at different sequence similarity. Then, we present the suspicious or reliable taxonomic assignments. Next, we discuss how we propose the most reliable assignments for the detected misclassified sequences. Following, we summarize a manual analysis on a sample of the NR to reflect on the proposed approach. We also present the performance of our work on the simulated and literature dataset. Finally, we will describe the running time.

6.3.1 Improve data quality of clusters at different sequence similarity

In this section, we present the highly suspicious sequence assignments for NR clusters at different simi- larity level. It is important to note that the members of clusters at each similarity level are the representative sequences at the higher similarity. For example, representative sequences at NR95 clusters were used to generate NR90 clusters. As shown in Table 6.1, we identified 76, 600 clusters at 95% sequence similarity in which there was an inconsistency among sequences that have root conflict. The bold numbers shows the level of conflict that are highly suspicious. In a lower sequence similarity level, the probability of misclas- sification at lower rank in the taxonomy is higher. For example, at clusters with 80% sequence similarity 75 the number of Order conflicts is higher than previous sequence similarity level, i.e. 85%. This is due to the lower threshold that may result in diverse species in a cluster at 80% similarity level.

6.3.2 Identified suspicious or reliable taxonomic assignments

Table 6.2 shows more than four million of proteins were identified as potentially to be misclassified.

These proteins have single assignments some of which might have originated from multiple databases that were not reported before. First column shows the size or number of protein sequences in each cluster. Second column shows a root conflict among proteins assignment and the most frequent taxonomic assignments in their respective clusters, i.e. NR95. For example, in line one, we found 3, 952 protein sequences that their taxonomic assignment have a root conflict with the most frequent taxonomic assignments in their respective clusters. These clusters have less than ten members. Similarly, in line two, we found 13, 816 proteins that have a root conflict with the three most frequent assignment in the clusters that have greater than 10 and less than 30 protein members. The larger the size of the cluster, the higher the confidence of the detected misclassified sequences. Larger clusters also have more data to propose the most reliable assignments.

In this table, the seven millions genus conflicts are not shown since the probability of the false-positive misclassifications is higher compared to other conflicts level such as root and kingdom.

It would be important for the public databases to have a reliable label for each taxonomic assignment.

We found that only about 36 out of 174 million proteins in the NR (20%), are highly reliable. Such a trusted label will give confidence to researchers when they transfer an annotation from a public database against an unknown or query sequence when they use BLAST. To that end, we made an assumption if a taxonomic assignment of the protein in the NR is the same as the most frequent taxonomic assignment of its respective cluster in 95% sequence similarity, then its assignment is highly reliable. Fig. 6.5 shows that 36M protein sequences in the NR database, i.e. 20% of the dataset, are highly reliable. In addition, this will address the challenge of cumbersome and lengthy process to manually verify the computationally annotated sequences including taxon, protein function, and literature curation [4]. As the time of this writing, there are more than

560 thousands verified sequences on UniProtKB, while there are more than 188 millions sequences that 76 await manual annotations [5]. Therefore, an automatic verification technique for taxon will improve some steps of this time-consuming process.

Figure 6.5 Compare sequence assignment with the most frequent assignments in the respective clusters. The right-hand side (red) shows the potential misclassification, i.e. 6.3% of the NR, while 36M protein sequences in the NR database, i.e. 20% of the dataset, are highly reliable.

6.3.3 Propose taxonomic assignment for the identified mislabeled sequences

In the previous section, we described sequences that were identified as potentially mislabeled. Here, we attempt to propose the most probable assignment from the consensus information of the clusters with at least 95% and 90% sequence similarity. Table 6.3 shows the proposed taxonomic assignment for the detected misassignments, shown in Fig. 6.2. The fourth and fifth columns are the number of manual and computational assignments for the current taxa. The last column provides a confidence score (CS) on the proposed taxa (equation 6.2.1). For example, line one, shows a confidence score of 0.84 for the protein id

RAV24885 that its assignment had a root violation with the most frequent assignments of its cluster. The current assignment is Staphylococcus warneri which the approach proposed Escherichia coli as the most probable assignment. 77

6.3.4 Performance on simulated and real-world dataset

We used standard metrics of precision and recall to analyze the performance of our classification ap-

TP proach. We used the following formula to calculate precision and recall: (precision = TP +FP ; recall = TP TP +FN ). Our approach showed a precision of 98% and a recall of 95%. True positives (TP) were the proteins that were mislabeled and our approach correctly identified them. False negatives (FN) were among those mislabeled sequences that the approach did not identify. The main reason for the false negatives is the taxonomic assignments change over time. Since we used the NCBI taxonomy database to generate the tree, we could not generate a tree for some older taxonomies. We also tested our approach on the liter- ature dataset introduced by Kozlov et al. [40]. Their dataset contained rRNA sequences. The approach proposed by Kozlov et al. showed a higher recall as they compared all query sequences against the public databases while the proposed approach relies heavily on the consensus information of the clusters. Since their dataset was much smaller than the public reference databases, there were sequences with smaller numbers of proteins at the 95% clusters. We also tested our data cleaning approach on the UniRef 90% similarity clusters [6] and found several misclassifications. Examples are shown in the repository here:

(https://github.com/boalang/quality/tree/main/evaluation)

6.3.5 Manual study

We also conducted a manual study of a sample of 1000 proteins in the NR database and found about

22% of the sample had misclassification. We found six root violations, nine kingdom violations, 10 phylum violations, and 122 family violations in the sample. We also found 460 proteins that their assignment had genus conflict with the most frequent taxa in the clusters. In this sample, 33% of the proteins did not have a violation, and their taxonomic assignment were indeed among the most frequent assignments of their clusters at 95% similarity. The manual observation indeed supports the detection, correction, and cleaning approach sections 6.3.1, 6.3.3, and 6.3.2. 78

Figure 6.6 Compare running time of the proposed work with the previous phylogeny based method [13].

6.3.6 Running time

The detection approach, i.e. Algorithm (6), in this work is computationally low cost as it compares the assignments of a query protein with the three most frequent assignment in its cluster at 95% similarity. This operation needs only three comparisons of protein assignment with the three most frequent assignments of the clusters, while the previous works for detecting mislabeled taxa needed generating a phylogeny tree [13] or comparing a query sequence against the public database that are computationally expensive tasks [40].

Once it detects misassignments, it generates a tree for detected sequences which is a small subset of the NR.

The required time to generate the tree depends on the number of leaves, i.e. species and their taxonomic assignments.

Fig. 6.6 shows the comparison in terms of running time between proposed work and previous work based on phylogeny tree [13]. In the proposed approach, we do not generate tree at the first step. We only compare sequence assignment with the three most frequent assignments in the cluster. Next, we generate a tree if the assignment is not in the top three most frequent (line 9 Algorithm6). The number of taxonomic assignments were at most four, while in the previous technique, there were on average significantly larger number of taxonomic assignments. The comparison between the proposed approach and previous techniques has been made on the local system iMac (late 2015) with Core i7 CPUs and 32GB RAM.

The following shows total time for the running of the algorithms including preprocessing and data clean- ing in clusters and sequences. The Tpreprocessing is a one time cost for generating the dataset and running 79 the BoaG queries. The BoaG analysis was run on a Hadoop cluster that significantly improves querying the dataset at scale. The Tcluster time is the required time for assessing taxonomic assignments of all clusters at different sequence similarity that could be checked at parallel. The Ttree generation represents the time that is needed to construct a phylogeny tree for the entire NR that could also be run in parallel. Here, |P | is the size of protein database in NR.

95 |P | X X Ttotal = Tpreprocessing + Tcluster time + Ttree generation 65 1

6.4 Discussion and conclusion

6.4.1 Applications and limitations

We used BoaG to explore NCBI’s NR sequences and clusters at scale; however any language such as

Python could be utilized. The approach relies on the consensus information of the clusters. It can greatly help large public databases to improve the quality of their annotations. However, for the smaller datasets, it can not provide higher confidence results due to the smaller cluster size. There are sequences that remain unclustered. To mitigate that, we utilized the clusters with the lower similarity and assigned a lower score.

There is a preprocessing time to generate the dataset using the CD-HIT. However, this is a one time cost and it can be incrementally be updated.

6.4.2 Conclusion

In this work, we proposed a scalable algorithm to clean NR clusters at different sequence similarity. For these clusters we also provided the common taxonomic rank and common taxonomic name for all members of the clusters. We addressed misclassified sequences in the NR database that have only one taxonomic assignment. Our approach assigns each protein with a label for researchers to easily identify potentially misassigned sequences. We utilized the consensus information of the NR clusters at 95% and 90% sequence similarity. The technique proposed in this work could be utilized to improve data quality of other kinds of annotations such as protein functions or any dataset that can be clustered by means unrelated to its metadata.

It can also be utilized for the RNA sequences and it is not limited to the protein databases. 80

Table 6.1 Identified misclassified sequences at different similarity level. Dataset Root Kingdom Phylum Class Order Family NR95 76,660 287,389 662,541 340,977 889,357 1,432,165 NR90 29,418 130,920 528,437 208,349 408,261 777,598 NR85 21,337 119,661 518,318 209,607 381,079 725,249 NR80 21,230 147,052 556,936 279,392 560,039 1,009,928 NR75 18,431 160,484 563,198 310,322 545,169 927,156 NR70 16,463 178,017 560,494 330,082 513,525 819,236 NR65 15,696 205,334 554,307 345,162 479,654 717,606 Note: The bold numbers are clusters that have highly suspicious assignments.

Table 6.2 Potentially misclassified proteins in the NR database that have a single assignment. These numbers are clusters with 95% sequence similarity (NR95) . Cluster Size root cellular Kingdom Phylum Class Order Family < 10 3,952 4,241 25,878 73,035 51,611 154,134 205,142 10-20 13,816 5,664 25,452 32,625 58,851 194,518 295,284 20-30 12,107 3,227 11,842 14,049 41,661 137,233 187,329 30-40 11,000 1,759 9,302 7,363 17,231 90,526 126,109 40-50 9,725 1,191 6,611 5,974 12,534 67,303 81,758 > 50 79,699 7,557 73,858 41,357 199,183 502,576 1,092,599 Note: First column is the cluster size, i.e. number of sequences.

Table 6.3 Proposed taxa for detected misclassified sequences as shown in Figure 6.2. The fourth and fifth columns are number of manual(M) and computational (C) assignments. The last column is the confidence score (CS) by utilizing consensus information of clusters. Protein ID Current taxa Proposed taxa M C CS RAV24885 Staphylococcus warneri Escherichia coli 0 1 0.84 XP_025947399 Alligator mississippiensis Gallus gallus 2 0 0.41 XP_011853137 Mandrillus leucophaeus Homo sapiens 2 0 0.40 KOU22909 Streptomyces sp. WM6368 Streptomyces sp. Mg1 0 2 0.42 WP_111758327 Paucimonas lemoignei syringae 1 1 0.55 1AQB Sus scrofa domesticus Sus scrofa 0 1 0.50 M: Manual assignments, C: Computational assignments, CS: Confidence Score 81

CHAPTER 7. DATA CLEANING TECHNIQUE FOR PROTEIN FUNCTIONS IN THE LARGE-SCALE PUBLIC DATABASES

7.1 Introduction

Public protein databases, such as the National Center for Biotechnology Information (NCBI), provide great resources for scientists to use the current knowledge of functional annotations to understand and predict unknown sequences. However, functional annotation errors in the public databases due to the computational error have caused a compound quality issue. The false-positive matching of rRNA pyrosequencing reads to the NCBI non-redundant protein database (NR) is close to 90% [74].

The NCBI’s NR database encompasses protein sequences from non-curated (low quality) and curated

(high quality) databases. It contains non-redundant sequences from GenBank translations (i.e. GenPept) together with sequences from other databases (Refseq [61], PDB [16], SwissProt [17], PIR [77] and PRF).

NR removes 100% identical sequences and merges the annotations and protein IDs. These annotations could be utilized to assign functional annotations to the query sequence.

Taxonomic misprediction has been addressed in the literature. For example, Kozlov et al. proposed a phylogeny-based approach to detect and correct misassigned sequences [40]. In the previous work, we also addressed taxonomic classifications in the NR database [13]. However, to the best of our knowledge, there has not been a large-scale work to address mispredicted functional annotations for the public databases.

In this work, we address the functional annotation quality by detecting the potentially mispredicted an- notations using a Protein Ontology (PRO) [56] and Gene Ontology [? ]. Ontologies have been utilized to express knowledge; they also could be leveraged for data quality purposes. Sequence clustering of the public databases is another usefulness source of information to explore different biological analyses such as family classification, functional annotation, systems biology, structural genomics, and phylogenetic analysis [71].

We also address data cleaning issue of clusters and provide highly reliable ontology term for each clusters at different sequence similarity. 82

SPARQL has been used to query Ontology [33]; however, this approach is time-consuming to reason about large-scale data cleaning of functional annotations. Different strategies have been proposed in the literature. For example,Bengtsson-Palme et al. [14] proposed cleaning strategy to flag the manual anno- tations. However, this requires significant human effort, and it will be very challenging to address a huge number of new sequences.

In this work, we present a computational method that abstracts ontology graphs into a lower-dimensional network representation that makes reasoning for inconsistencies easier. We have utilized network embedding to detect inconsistencies in protein functions. Here, we attempt to improve quality of functional annotations and make the following contributions:

• Scalable identification of mispredicted protein functions in the NR database. We provided lower-

dimensional representations of GO and PRO ontologies that utilized to identify inconsistencies.

• Scalable algorithm that improve data quality of clusters at different similarity levels, i.e., NR95,

NR90,. . . , NR65 clusters, and provide common ontology term for each cluster.

• Provide reliable functional annotations for the detected sequences from the provenance and consensus

information of clusters at 95% and 90% sequence similarity.

We found about 61 million proteins in the NR database (about 35%) that have more than one distinct functional annotation. Among these proteins, we found about 7% of the functional annotations appear to be mispredicted. This is the conservative percentages of the NR annotations. We also found more than 27 million proteins in the NR database (about 14%) that contain unknown, hypothetical, and unnamed labels. In the next step, we use frequencies, the provenance of annotations, and the consensus information of clustering information in the 95% and 90% sequence similarity to propose the most reliable function annotations. We also provide common ontology terms for different similarity levels, i.e., NR95, NR90,..., NR65 clusters.

The rest of the paper is organized as follows. In Section 2, we present the methods, dataset, detection, and correction approach. In Section 3, we discuss the results of functional data cleaning at the sequence and cluster level. In Section 4, we conclude with suggestions for the future. 83

7.2 Methods

In this section, we will describe the overview of our data cleaning approach for the functional annota- tions. Then, we describe the dataset generation and how we generate an ontology graph from functional annotations. Next, we discuss our detection strategy to identify potentially mispredicted function annota- tions. Then, we describe our approach to propose the most reliable annotations based on the provenance of the annotations and clusters at 95% and 90% sequence similarity. Finally, we describe simulated and literature dataset to test our approach.

7.2.1 Dataset generation

The NR dataset was downloaded from the NCBI FTP (ftp://ftp.ncbi.nlm.nih.gov/blast/ db/FASTA/) as described in the previous work [13]. Functional annotations were obtained from the def- inition line for each protein sequence. For example, for the protein with ID WP_000135199 the definition line contains list of database ID, protein function, and taxonomic assignment in the bracket as follows:

>WP_000135199.1 MULTISPECIES: 30S ribosomal protein S18 [Bacteria] NP_313205.1 30S ribosomal subunit protein S18 [Escherichia coli O157:H7 str. Sakai] This sequence belongs to a cluster with ID

77576702 at NR95% that contains 54 members.

We used CD-HIT clustering [31] to cluster the NR dataset at 95% sequence similarity with 80% sequence length overlap threshold. This threshold means all members of a given cluster have at least 80% length overlap. This threshold is important as it will improve the consistency among annotations [71]. Then, we took the representative sequences of the NR95 and generated an NR90 of clusters with at least 90% sequence similarity. We followed the same procedure for other similarity levels down to 65%. Therefore, cluster ∈

{NR95, NR90, NR85, NR80, NR75, NR70, NR65}. The data acquisition, preprocessing, and clustering

(NR95, NR90, ..., NR65) took about seven days. The most time-consuming part was generating NR95. We took the NR FASTA files that have the definition lines containing annotations from different databases and generate the BoaG format [12].

Figure 7.4 shows an overview of the proposed approach. We take functional annotations from the NR database. We used BoaG query to generate the list of functions from the NR database as shown in Figure 7.1. 84

In this work, We used the BoaG infrastructure that is publicly available here http://boa.cs.iastate.edu/boag/.

The BoaG infrastructure provides an easy-to-use web-interface to query large-scale datasets in a few min- utes. We could also utilize any general-purpose programming language such as Python to retrieve NR annotations.

1 s: Sequence = input; 2 protOut : output sum[string][string] of int; 3 distinctTax := function (seq: Sequence): int{ 4 taxSet : set of string; 5 foreach(i:int; def(seq.annotation[i])) 6 add(taxSet,seq.annotation[i].tax_id); 7 return(len(taxSet));};

9 foreach(i:int; def(s.annotation[i])) 10 protOut [s.annotation[i].defline][s.seqid]<<1;

#Following are few lines of output protOut[(d)CMP kinase][WP_004849475] = 21 protOut[16S rRNA methyltransferase][WP_002298874] = 19 protOut[2-component regulator][WP_001251544] = 131

Figure 7.1 BoaG script for the list of protein functions for proteins that have more than 10 dis- tinct taxonomic assignments. The output of this query is used to generate the top pro- tein functions. The BoaG query and its result is publicly available on the cluster here: http://boa.cs.iastate.edu/boag/?q=boa/job/public/40

Figure 7.2 Ontology graph generated from the list of functional annotations of protein sequence ID WP_000135199. The node with label ulaA appears to be mispredicted. 85

Figure 7.3 Ontology graph generated from the list of functional annotations of protein sequence ID WP_000184067 . The node with label ybdZ appears to be mispredicted. Unlike other functions it also appeared only once in the list of annotations.

7.2.2 Detecting mispredicted functions

As shown in Algorithm6, we iterate over all the sequences in the dataset. At the first step, we map functional annotation text to the nodes in the ontology graph. Then, we generate a knowledge graph with the list of these nodes for each protein sequence.

Algorithm 6 Detected mispredicted functional annotations. Input comes from the BoaG query shown in Figure 7.1. 1: procedure BUILD_EMBEDDING(F unctionList) 2: for func in F unctionList do 3: emb ← emb + Node2V ec(func) . Using Node2Vec 4: return (emb) . Vector represents Ontology 5: procedure DETECT_MISPREDICTED(C) . Detect suspicious 6: NRSize ← |P | . Size of NR database 7: while j ≤ NRSize do 8: function_list = get_annotation(Pj ) . list of functions 9: emb ← Build_Embedding(function_list) 10: if verify_embedding(emb)==’conflict’ then . PRO conflicts 11: remove(func) . remove Suspicious function

7.2.2.1 Map functional annotations to the ontology graph

In the fist step, we mapped all the functional annotations to the closest node in the PRO or GO knowledge graph. We obtained the obo-formatted file for these ontologies. We utilized the Goatools, Python library 86

Functional Boa NR annotations g Database

Build a Ontology Graph

Check misclassifications

List of mispredicted proteins NR Clusters

Propose Reliable annotations

List of reliable Functional annotations

Figure 7.4 An overview of the approach.Each node in the graph will be represented as a vector that preserves its semantic. to handle Gene Ontology (GO) and Protein Ontology (PRO) terms [38] for parsing and visualizing the ontology graphs. We used Word2Vec [32] and used PRO name, definitions, and synonyms to generate vectors to compare them with the functional text. In this process, we found that several functions could map to a single node in the ontology graph. We also found that some protein functions, the equivalent ontology terms was closer on the GO rather than PRO. Therefore, we utilized both ontologies. For example, for the protein sequence WP _000135199.1 the functions 30s ribosomal protein s18, ribosomal protein s18, and

30s ribosomal protein s18 putative are mapped to a node in the graph with ID PR : 000023871. These functions have appeared 20116, 2718, and 1240 times respectively originated from multiple databases. 87

7.2.2.2 Graph Embedding

Node embedding has been applied in several application such as evaluating biological knowledge [70], protein protein interactions [79], and relation prediction for ontology population [41]. Here, we used both protein ontology (PRO) and Gene Ontology (GO) to detect inconsistencies and conflicting functions among the list of annotations assigned to a single protein sequence. We represent each node in the graph with a vector that preserves the semantic of the node in the ontology graph. At the time of this writing, the PRO and GO terms contained 331, 980 and 7, 963, 579 different nodes or annotations, respectively.

Figure 7.2 shows the graph generated from the list of functional annotations for this protein se- quence WP _000135199.1 we discussed in the previous section. It also contained a function names ascorbate-specific pts system eiic component that originated from a computational database GenBank with

ID ARE54176.1 that appears to be mispredicted. The reason is that it is in terms of semantic and domain knowledge far from other terms in the graph. This protein also annotated with rpsr which seems to be an acronym of other functions. This protein sequence also contains several hypothetical protein as functional annotations. In another example, for the protein sequence ID WP_000184067 ontology graph generated from the list of functional annotations is shown in Figure 7.3. The node with label ybdZ appears to be mispredicted. Unlike other functions, it also appeared only once in the list of annotations.

7.2.3 Correcting mispredicted functions

As shown in line 7 in Algorithm6, we iterate over all the proteins in the NR database. In line 9, we call the procedure Build_Embedding in line 9 that takes a list of protein functions. In line 3, we use Node2Vec to generate the embedding vector for each protein function annotation. If the embedding has a significant difference in the graph, we label it as suspicious functional annotation and we remove it from the list of annotations for the protein Pj in line 11. Later on, we utilize the consensus information of clusters at 95% and 90% to propose the most reliable protein functions. We otherwise label the function as a reliable annotation that conforms to the cluster’s information. We give a lower weight for the functional annotations from 90% clusters. 88

7.2.4 Manual study

We randomly took a sample of 1000 protein sequences and manually studied the level of functional misannotations. We utilized the protein ontology website and checked different functional annotations that originated from multiple databases. For example, for a given protein sequence, we searched each functional annotation and found the closest node in the ontology. Sometimes we need to check the synonym and definition of the ontology name to find the proper node in the PRO graph. Then, we generated the ontology graph with the list of PRO ID that we found in the first step. Finally, we checked if there is a significant difference in the list of PRO names highlighted in the graph. (https://proconsortium.org/app/ visual/cytoscape/pro/PR:000023871,PR:Q4ZYX0,PR:000024175,PR:P62269-1)

7.2.5 Literature and simulated dataset

We tested the proposed approach on the literature dataset presented by [74] in which they explored misannotations in the subset of the NR database. To test the performance of the proposed work, we also built a simulated dataset and reported the precision and recall. We also took a sample of the NR database and randomly assigned a protein function, and used our approach to check if we can identify misannotations.

PRO ontology contains 331, 980 different classes, and we take a random one function and assign it to the sample.

7.3 Results

In this section, we present proteins that potentially have mispredicted functional annotations. Next, we present the most reliable functional annotations for the sequences and clusters at different sequence similarity. We also provide a common ontology name for each cluster at different sequences similarity.

Then, we describe our findings on the manual study of the 1000 protein sample of the NR. We also highlight the database of origin for each mispredicted function. Finally, we discuss the case study, manual analysis, and the running time of the approach. 89

Table 7.1 Number of records and types of protein sequences form different public databases based on primary keys.

Database Protein Number AA number Total taxa Total functions RefSeq 119,371,598 45,377,890,558 213,310,807 230,510,702 GenBank 54,692,570 18,149,422,963 66,206,578 63,301,649 UniPort 78,787 23,643,873 99,596 315,610 PDB 90,408 29,187,501 24,5987 186,141

7.3.1 Detecting potential functional misannotations

We found about 61 million proteins in the NR database (about 35%) that have more than one distinct functional annotation. The RefSeq database contains 230 million total protein functions. The NR annota- tions originated from the GenBank databases are about 63 million, as shown in Table 7.1.

As described in the section 7.2, we analyzed the list of functional annotations for each sequence to reason about the potential inconsistency among them. We found that about 7% of the NR annotations are inconsistent with each other. Table 7.2 provides several examples of such detected annotations. More details and the full list of the mispredicted functions are available on the GitHub repository.

We found that more than 27 million proteins were labeled as hypothetical, unknown, and unnamed proteins that we removed in our cleaning process.

7.3.2 Correcting mispredicted functional annotations

Several examples of potentially mispredicted functional annotations are shown in Table 7.2. The first row shows the protein ID NP_008839.2 that has functional annotations of gtpase and unknown. The most reliable function for this sequence based on the frequency and the consensus information of the NR95 clusters and the manually reviewed databases such as RefSeq is ras-related c3 botulinum toxin substrate 1 isoform rac1.

The NR95 cluster contains about 88 million clusters in which 64 of the 174 million proteins at 95% sequence similarity remain unclustered. For the correcting approach, for these sequences, we looked at clusters with 90% sequence similarity and gave the lower score. 90

Table 7.2 Protein functions misassignments and the proposed one

Protein ID Potential mislabeled function Proposed function Conflict Level NP_008839.2 gtpase/unknown ras-related c3 botulinum toxin substrate 1 isoform rac1 Gene WP_001056273.1 monosaccharide-transporting atpase ribose abc transporter substrate-binding protein rbsb sequence WP_003092007.1 glycosyltransferase_gtb_type superfamily rhamnosyltransferase sequence WP_000127927.1 peptidoglycan o-acetyltransferase d-alanyl-lipoteichoic acid biosynthesis protein dltb sequence WP_000135199.1 ascorbate-specific pts system eiic component 30s ribosomal protein s18 Gene WP_003409891.1 mw2488 o-acetyltransferase oata sequence WP_000182899.1 pts system protein galactitol-specific phosphotransferase enzyme iia component Gene WP_001190781.1 orf_o230 atp-binding protein organism-gene WP_001019484.1 m42 glutamyl aminopeptidase family protein peptidase sequence WP_000358960.1 rplr 50s ribosomal protein l18 organism-gene WP_000184067.1 enterobactin biosynthesis protein ybdz mbth family protein, antibiotic transporter Modification

7.3.3 Common PRO name for NR clusters at different similarity

We provide a common PRO name for each cluster at different sequence similarity. For example, the sequence we discussed in the method section (Figure 7.2), belongs to the cluster-ID 77576702 at NR95 that contains 54 members. In this cluster, 30S ribosomal protein S18 function is the most frequent function among all members that appeared more than 22, 000.

We removed the hypothetical protein PAP10c_2698, Putative uncharacterized protein , and ’Tail- specific protease precursor as they are not very useful or they might potentially be mispredicted. All these annotations appeared in the cluster only once.

7.3.4 Manual analysis

We conducted a manual study of a sample of 1000 proteins in the NR database and found about 7% of the sample had mispredicted functional annotations. We also found about 10% of the samples had unknown/ hypothetical/ acronyms in their functional annotations. In these samples, after associating each functional annotation to the PRO name in the graph, we found that multiple annotations mapped to a single node in the graph, which made it easier to reason about inconsistencies. We also found that almost all sequences had redundancies in the annotations. For example, for the protein sequence WP _000135199.1 the functions

30s ribosomal protein s18, ribosomal protein s18, and 30s ribosomal protein s18 putative have appeared

20116, 2718, and 1240 times respectively originated from multiple databases. This protein also contained

26 hypothetical protein and unnamed protein products. The results and more details could be found on the

GitHub repository. 91

7.3.5 Provenance of mispredicted annotations

For each protein sequence, we identified the database of origin for the mispredicted sequences. For example, for the protein sequence WP _000135199.1 the functions 30s ribosomal protein s18, ribosomal protein s18, and 30s ribosomal protein s18 putative are mapped to a node in the graph with ID PR :

000023871. Figure 7.2 shows the graph generated from the list of functional annotations for this protein sequence. These functions have appeared 20116, 2718, and 1240 times respectively originated from multiple databases. It also contained a function names ascorbate-specific pts system eiic component that originated from a computational database GenBank with ID ARE54176.1 that appears to be mispredicted.

7.3.6 Case Study

We explored more deeply some clusters that had Glycine to identify a small subset of clusters and study the functional annotations.

The five clusters we identified were at the 95% similarity level and had at least one protein with the genus Glycine in the cluster, and at least one other genus but not more than eight different genera, and the proteins were not functionally annotated as ribosomal proteins. The cluster IDs at 95% similarity that were identified are 13741561, 26559833, 37311095, 64827049 and 81606138. Each one of these clusters happened to contain a single protein sequence that was taxonomically classified in the genus Glycine.

These protein sequence ids in the cluster 13741561 are: (2GGD, 2PQB, 2PQD, AAB36219, AAL67577,

ABY76172, AEP17820, AII71485, AMM72836, ARW80139, ARW80140, CBG92008, KIF05476,

Q9R4E4, WP_037405162, WP_050745479, WP_117368106) In the cluster 13741561, CP4 EPSPS’ and

’EPSPS CTP, partial’ are the top frequent functional annotations among all members.

7.3.7 Performance on the simulated and literature dataset

We used standard metrics of precision and recall to analyze the performance of our classification ap-

TP proach. We used the following formula to calculate precision and recall: (precision = TP +FP ; recall = TP TP +FN ). Our approach showed a precision of 98% and a recall of 90%. True positives (TP) were the proteins that were mislabeled, and our approach correctly identified them. False negatives (FN) were among 92 those mislabeled sequences that the approach did not identify. The main reason for the false negatives is that we used Word2Vec to find similar texts in the protein ontology, and it did not find the correct PRO name.

7.3.8 Running time

The database size for (entire NR) 95% similarity, was about 100GB. The CD-Hit computation required six days and 20 hours on a compute node with 2 CPU with 14 core each (Model: Intel(R) Xeon(R) CPU

E5-2695 v3 @ 2.30GHz). The same analysis of representative sequences at 90%, 85%, 80%, 75%, 70% and 65% the database size were 40GB, 33GB, 28GB, 24GB, 21GB, 18GB, and 16GB respectively and the running times were three days, one day and 21 hours, one day and 12 hours, one day and two hours, 20 hours, and 16 hours respectively. The detection approach, i.e. Algorithm6, in this work requires building an ontology graph from the list of functional annotations. It also requires an embedding using Node2Vec library [32] that took us eight hours. The postprocessing analysis, i.e., embedding and verifying the ontology graph, have been made on the local system iMac (late 2015) with Core i7 CPUs and 32GB RAM. The BoaG analysis was run on a Hadoop cluster that significantly improves querying the dataset at scale.

7.3.9 Discussions

We used BoaG to explore NCBI’s NR functional annotations at the sequence and clusters level at scale; however, any language such as Python could be utilized. To detect mispredicted functions, we utilized the protein ontology (PRO) graph and verified if the list of functions mapped to nodes in the PRO is consistent.

The approach relies on the consensus information of the clusters at 95% and 90% sequence similarity. It can greatly help large public databases to improve the quality of their functional annotations. However, for the smaller datasets, it can not provide higher confidence results due to the smaller cluster size. There is a preprocessing time to generate the dataset using the CD-HIT. However, this is a one-time cost, and it can be incrementally be updated. 93

CHAPTER 8. CONCLUSION AND FUTURE WORK

The cost of sequencing is decreasing while the amount of data used to answer ever more complex bi- ological questions is increasing. These data sets have significant data quality challenges for researchers to test different hypotheses. The metadata errors due to contamination in the biological samples, computa- tional misprediction, and user metadata submission can lead to significant error propagations. Therefore, it will be important for the public databases to provide automatic data quality control. In this dissertation, we proposed and implemented a large-scale infrastructure and an automatic data cleaning approach that will significantly improve the trustworthiness of the genomics metadata. The proposed infrastructure opens up possibilities to explore data in ways previously not possible without deep expertise in data acquisition, data storage, data retrieval, data mining, and parallelization. In this dissertation, we also proposed a heuristic method to find inconsistencies in the metadata, i.e., taxonomic misassignments and functional mispredic- tion. In our approach, we proposed the most probable reliable annotations for each protein sequence. We applied this method to the entire NR database, that contains the current comprehensive knowledge of protein information, and its clustering information at different sequence similarity.

8.1 Future Work

The contributions made in this dissertation can lead to several potential research directions. Some of these future research directions are discussed below:

8.1.1 Language and Infrastructure Extension

BoaG will change the way that researchers run large-scale analysis on multi-tiered biological data with much ease, a shorter learning curve, and higher reusability. To explore evolutionary properties for tens of thousands of genomes spanning the tree of life and lower the barrier to more bioinformaticians, it will be beneficial to provide an easy-to-use web-interface such as BoaG. Therefore, one direction would be 94 to integrate more Omics datasets into the current infrastructure. Such an important dataset that could be integrated is SARS-CoV-2 and other viruses to study their evolutionary properties. This will help to identify the functional importance of particular regions of the coronavirus protein sequence.

8.1.2 Data cleaning for Different Metadata

There are different ontologies and knowledge-graph that are being used by scientists. Therefore, an immediate research direction could be to find a common understanding of these knowledge graphs and find a low-cost embedding that could be leveraged to improve the data quality of public databases.

8.1.3 Crowd-Source Data cleaning

There are still a significantly huge amount of metadata that are needed to be manually curated. It would be beneficial for the public databases and current algorithms to integrate crowd-sourcing supports for the community to help with detecting and removing erroneous metadata.

8.1.4 Machine Learning Model in the Infrastructure

By improving data quality, this project will decrease the cost of error propagation in downstream anal- ysis. Thus, it will enable a more significant and dependable data-intensive scientific discovery in this area.

Several machine learning algorithms could also be implemented to the current infrastructure.

8.1.5 Recommendation for Public Databases

The proposed infrastructure and cleaning strategy in this dissertation will help not only individual scien- tists but especially the public repository holders, such as NCBI, to provide higher quality data and metadata for researchers. 95

BIBLIOGRAPHY

[1] Generic Feature Format version 3. http://gmod.org/wiki/GFF. Accessed: 2019-02-07.

[2] Genomics England. https://www.genomicsengland.co.uk/. Accessed: 2020-10-04.

[3] Hadoop and MongoDB. https://www.mongodb.com/hadoop-and-mongodb. Accessed: 2019-02-07.

[4] How do we manually annotate a UniProtKB entry? https://www.uniprot.org/help/ manual_curation. Accessed: 2020-10-04.

[5] Uniprot knowledgebase (UniProtKB) results. https://www.uniprot.org/uniprot/. Ac- cessed: 2020-10-04.

[6] Uniprot knowledgebase (UniProtKB) uniref 2020_05 results. https://www.uniprot.org/ uniref/?query=&fil=identity:0.9. Accessed: 2020-10-28.

[7] (2019). Non-redundant database (NR). https://www.ncbi.nlm.nih.gov/refseq/about/ nonredundantproteins/. Accessed: 2019-06-10.

[8] Alnasir, J. and Shanahan, H. (2018). The application of Hadoop in structural bioinformatics. BioRxiv, page 376467.

[9] Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990). Basic local alignment search tool. Journal of molecular biology, 215(3):403–410.

[10] Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., et al. (2000). Gene ontology: tool for the unification of biology. Nature genetics, 25(1):25.

[11] Bagheri, H., Dyer, R., and Rajan, H. BoaG web-interface for genomics data. http://boa.cs. iastate.edu/boag/. Accessed: 2020-05-10.

[12] Bagheri, H., Muppirala, U., Masonbrink, R., Severin, A. J., and Rajan, H. (2019). Shared data science infrastructure for genomics data, doi: https://doi.org/10.21203/rs.2.4295/v3. BMC Bioinformatics.

[13] Bagheri, H., Severin, A., and Rajan, H. (2020). Detecting and correcting misclassified sequences in the large-scale public databases. Bioinformatics. btaa586. 96

[14] Bengtsson-Palme, J., Boulund, F., Edström, R., Feizi, A., Johnning, A., Jonsson, V. A., Karlsson, F. H., Pal, C., Pereira, M. B., Rehammar, A., et al. (2016). Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, 16(18):2454–2460.

[15] Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., and Sayers, E. W. (2008). Genbank. Nucleic acids research, 37(suppl_1):D26–D31.

[16] Berman, H. M., Bourne, P. E., Westbrook, J., and Zardecki, C. (2003). The protein data bank. In Protein Structure, pages 394–410. CRC Press.

[17] Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.-C., Estreicher, A., Gasteiger, E., Martin, M. J., Michoud, K., O’donovan, C., Phan, I., et al. (2003). The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic acids research, 31(1):365–370.

[18] Borthakur, D. et al. (2008). Hdfs architecture guide. Hadoop Apache Project, 53:1–13.

[19] Breitwieser, F. P., Pertea, M., Zimin, A. V., and Salzberg, S. L. (2019). Human contamination in bacterial genomes has created thousands of spurious proteins. Genome research, 29(6):954–960.

[20] Chodorow, K. (2013). MongoDB: the definitive guide: powerful and scalable data storage. " O’Reilly Media, Inc.".

[21] Chu, X., Morcos, J., Ilyas, I. F., Ouzzani, M., Papotti, P., Tang, N., and Ye, Y. (2015). Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1247–1261.

[22] Cock, P. J., Antao, T., Chang, J. T., Chapman, B. A., Cox, C. J., Dalke, A., Friedberg, I., Hamelryck, T., Kauff, F., Wilczynski, B., et al. (2009). Biopython: freely available python tools for computational molecular biology and bioinformatics. Bioinformatics, 25(11):1422–1423.

[23] Consortium, U. (2014). UniProt: a hub for protein information. Nucleic acids research, 43(D1):D204– D212.

[24] Dean, J. and Ghemawat, S. (2008). Mapreduce: simplified data processing on large clusters. Commu- nications of the ACM, 51(1):107–113.

[25] Dede, E., Govindaraju, M., Gunter, D., Canon, R. S., and Ramakrishnan, L. (2013). Performance evaluation of a MongoDB and Hadoop platform for scientific data analysis. In Proceedings of the 4th ACM workshop on Scientific cloud computing, pages 13–20. ACM.

[26] Deus, H. F., Correa, M. C., Stanislaus, R., Miragaia, M., Maass, W., De Lencastre, H., Fox, R., and Almeida, J. S. (2011). S3QL: A distributed domain specific language for controlled semantic integration of life sciences data. BMC bioinformatics, 12(1):285. 97

[27] Drost, H.-G. and Paszkowski, J. (2017). Biomartr: genomic data retrieval with r. Bioinformatics, 33(8):1216–1217.

[28] Dyer, R., Nguyen, H. A., Rajan, H., and Nguyen, T. N. (2013). Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In Proceedings of the 2013 International Conference on Software Engineering, pages 422–431. IEEE Press.

[29] Dyer, R., Nguyen, H. A., Rajan, H., and Nguyen, T. N. (2015). Boa: Ultra-large-scale software repos- itory and source-code mining. ACM Transactions on Software Engineering and Methodology (TOSEM), 25(1):7.

[30] Edgar, R. (2018). Taxonomy annotation and guide tree errors in 16S rRNA databases. PeerJ, 6:e5030.

[31] Fu, L., Niu, B., Zhu, Z., Wu, S., and Li, W. (2012). CD-HIT: accelerated for clustering the next- generation sequencing data. Bioinformatics, 28(23):3150–3152.

[32] Grover, A. and Leskovec, J. (2016). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pages 855–864.

[33] Hassan, M. and Bansal, S. K. (2018). Rdf data storage techniques for efficient sparql query processing using distributed computation engines. In 2018 IEEE International Conference on Information Reuse and Integration (IRI), pages 323–330. IEEE.

[34] Holliday, G. L., Davidson, R., Akiva, E., and Babbitt, P. C. (2017). Evaluating functional annotations of enzymes using the gene ontology. In The Gene Ontology Handbook, pages 111–132. Humana Press, New York, NY.

[35] Huerta-Cepas, J., Serra, F., and Bork, P. (2016). Ete 3: reconstruction, analysis, and visualization of phylogenomic data. Molecular biology and evolution, 33(6):1635–1638.

[36] Islam, M. J., Sharma, A., and Rajan, H. (2018). A cyberinfrastructure for bigdata transportation engineering. arXiv preprint arXiv:1805.00105.

[37] Islam, M. J., Sharma, A., and Rajan, H. (2019). A cyberinfrastructure for big data transportation engineering. Journal of Big Data Analytics in Transportation.

[38] Klopfenstein, D., Zhang, L., Pedersen, B. S., Ramírez, F., Vesztrocy, A. W., Naldi, A., Mungall, C. J., Yunes, J. M., Botvinnik, O., Weigel, M., et al. (2018). Goatools: A python library for gene ontology analyses. Scientific reports, 8(1):1–17.

[39] Koonin, E. V. and Wolf, Y. I. (2008). Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic acids research, 36(21):6688–6719. 98

[40] Kozlov, A. M., Zhang, J., Yilmaz, P., Glöckner, F. O., and Stamatakis, A. (2016). Phylogeny- aware identification and correction of taxonomically mislabeled sequences. Nucleic Acids Research, 44(11):5022–5033.

[41] Kulmanov, M., Schofield, P. N., Gkoutos, G. V., and Hoehndorf, R. (2018). Ontology-based validation and identification of regulatory phenotypes. Bioinformatics, 34(17):i857–i865.

[42] Langmead, B., Hansen, K. D., and Leek, J. T. (2010). Cloud-scale RNA-sequencing differential ex- pression analysis with Myrna. Genome Biol., 11(8):R83.

[43] Leo, S., Santoni, F., and Zanetti, G. (2009). Biodoop: bioinformatics on Hadoop. In Parallel Process- ing Workshops, 2009. ICPPW’09. International Conference on, pages 415–422. IEEE.

[44] Mahadik, K., Wright, C., Zhang, J., Kulkarni, M., Bagchi, S., and Chaterji, S. (2016). SARVAVID: A domain specific language for developing scalable computational genomics applications. In Proceedings of the 2016 International Conference on Supercomputing, ICS ’16, pages 34:1–34:12, New York, NY, USA. ACM.

[45] Marchler-Bauer, A., Lu, S., Anderson, J. B., Chitsaz, F., Derbyshire, M. K., DeWeese-Scott, C., Fong, J. H., Geer, L. Y., Geer, R. C., Gonzales, N. R., et al. (2010). CDD: a conserved domain database for the functional annotation of proteins. Nucleic acids research, 39(suppl_1):D225–D229.

[46] McDonald, D., Price, M. N., Goodrich, J., Nawrocki, E. P., DeSantis, T. Z., Probst, A., Andersen, G. L., Knight, R., and Hugenholtz, P. (2012). An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. The ISME journal, 6(3):610.

[47] Medlar, A. J., Törönen, P., and Holm, L. (2018). AAI-profiler: fast proteome-wide exploratory analysis reveals taxonomic identity, misclassification and contamination. Nucleic acids research, 46(W1):W479– W485.

[48] Meola, M., Rifa, E., Shani, N., Delbès, C., Berthoud, H., and Chassard, C. (2019). DAIRYdb: a manually curated reference database for improved taxonomy annotation of 16S rRNA gene sequences from dairy products. BMC genomics, 20(1):560.

[49] Mernik, M., Heering, J., and Sloane, A. M. (2005). When and how to develop domain-specific lan- guages. ACM computing surveys (CSUR), 37(4):316–344.

[50] Modha, S., Thanki, A. S., Cotmore, S. F., Davison, A. J., and Hughes, J. (2018). Victree: an automated framework for taxonomic classification from protein sequences. Bioinformatics, 34(13):2195–2200.

[51] Mukherjee, S., Huntemann, M., Ivanova, N., Kyrpides, N. C., and Pati, A. (2015). Large-scale contam- ination of microbial isolate genomes by Illumina PhiX control. Standards in genomic sciences, 10(1):18. 99

[52] Nagy, A., Hegyi, H., Farkas, K., Tordai, H., Kozma, E., Bányai, L., and Patthy, L. (2008). Identi- fication and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics, 9(1):353.

[53] Nagy, A. and Patthy, L. (2013). MisPred: a resource for identification of erroneous protein sequences in public databases. Database, 2013.

[54] Nagy, A. and Patthy, L. (2014). FixPred: a resource for correction of erroneous protein sequences. Database, 2014.

[55] Natale, D. A., Arighi, C. N., Blake, J. A., Bona, J., Chen, C., Chen, S.-C., Christie, K. R., Cowart, J., D’Eustachio, P., Diehl, A. D., et al. (2016). Protein Ontology (PRO): enhancing and scaling up the representation of protein entities. Nucleic acids research, 45(D1):D339–D346.

[56] Natale, D. A., Arighi, C. N., Blake, J. A., Bona, J., Chen, C., Chen, S.-C., Christie, K. R., Cowart, J., D’Eustachio, P., Diehl, A. D., et al. (2017). Protein Ontology (PRO): enhancing and scaling up the representation of protein entities. Nucleic acids research, 45(D1):D339–D346.

[57] Niemenmaa, M., Kallio, A., Schumacher, A., Klemelä, P., Korpelainen, E., and Heljanko, K. (2012). Hadoop-BAM: directly manipulating next generation sequencing data in the cloud. Bioinformatics, 28(6):876–877.

[58] Nystrom, N. A., Levine, M. J., Roskies, R. Z., and Scott, J. R. (2015). Bridges: A uniquely flexible hpc resource for new communities and data analytics. In Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure, XSEDE ’15, pages 30:1–30:8, New York, NY, USA. ACM.

[59] Prlic,´ A., Yates, A., Bliven, S. E., Rose, P. W., Jacobsen, J., Troshin, P. V., Chapman, M., Gao, J., Koh, C. H., Foisy, S., et al. (2012). BioJava: an open-source framework for bioinformatics in 2012. Bioinformatics, 28(20):2693–2695.

[60] Pruitt, K. D., Tatusova, T., and Maglott, D. R. (2006a). NCBI reference sequences (RefSeq): a cu- rated non-redundant sequence database of genomes, transcripts and proteins. Nucleic acids research, 35(suppl_1):D61–D65.

[61] Pruitt, K. D., Tatusova, T., and Maglott, D. R. (2006b). NCBI reference sequences (RefSeq): a cu- rated non-redundant sequence database of genomes, transcripts and proteins. Nucleic acids research, 35(suppl_1):D61–D65.

[62] Rajan, H. (2017). Bridging the digital divide in data science. SPLASH-I: The ACM SIGPLAN confer- ence on Systems, Programming, Languages and Applications: Software for Humanity (SPLASH).

[63] Ruggiero, M. A., Gordon, D. P., Orrell, T. M., Bailly, N., Bourgoin, T., Brusca, R. C., Cavalier-Smith, T., Guiry, M. D., and Kirk, P. M. (2015). A higher level classification of all living organisms. PloS one, 10(4). 100

[64] Sadasivam, G. S. and Baktavatchalam, G. (2010). A novel approach to multiple sequence alignment using Hadoop data grids. In Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud, MDAC ’10, pages 2:1–2:7, New York, NY, USA. ACM.

[65] Schmidt, B. and Hildebrandt, A. (2017). Next-generation sequencing: big data meets high performance computing. Drug Discovery Today.

[66] Schnoes, A. M., Brown, S. D., Dodevski, I., and Babbitt, P. C. (2009). Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS computational biology, 5(12):e1000605.

[67] Smedley, D., Haider, S., Ballester, B., Holland, R., London, D., Thorisson, G., and Kasprzyk, A. (2009). BioMart–biological queries made easy. BMC genomics, 10(1):22.

[68] Stajich, J. E., Block, D., Boulez, K., Brenner, S. E., Chervitz, S. A., Dagdigian, C., Fuellen, G., Gilbert, J. G., Korf, I., Lapp, H., et al. (2002). The bioperl toolkit: Perl modules for the life sciences. Genome research, 12(10):1611–1618.

[69] Steinegger, M. and Söding, J. (2018). Clustering huge protein sequence sets in linear time. Nature communications, 9(1):1–8.

[70] Su, C., Tong, J., Zhu, Y., Cui, P., and Wang, F. (2020). Network embedding in biomedical data science. Briefings in bioinformatics, 21(1):182–197.

[71] Suzek, B. E., Wang, Y., Huang, H., McGarvey, P. B., Wu, C. H., and Consortium, U. (2015). UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinfor- matics, 31(6):926–932.

[72] Taylor, R. C. (2010). An overview of the Hadoop/MapReduce/HBase framework and its current appli- cations in bioinformatics. BMC Bioinformatics, 11 Suppl 12:S1.

[73] Terrizzano, I. G., Schwarz, P. M., Roth, M., and Colino, J. E. (2015). Data wrangling: The challenging yourney from the wild to the lake. In CIDR.

[74] Tripp, H. J., Hewson, I., Boyarsky, S., Stuart, J. M., and Zehr, J. P. (2011). Misannotations of rRNA can now generate 90% false positive protein matches in metatranscriptomic studies. Nucleic acids research, 39(20):8792–8802.

[75] Turnbull, C., Scott, R. H., Thomas, E., Jones, L., Murugaesu, N., Pretty, F. B., Halai, D., Baple, E., Craig, C., Hamblin, A., et al. (2018). The 100000 genomes project: Bringing whole genome sequencing to the NHS. BMJ: British Medical Journal (Online), 361.

[76] Wilke, A., Harrison, T., Wilkening, J., Field, D., Glass, E. M., Kyrpides, N., Mavrommatis, K., and Meyer, F. (2012). The M5nr: a novel non-redundant database containing protein sequences and annota- tions from multiple sources and associated tools. BMC bioinformatics, 13(1):141. 101

[77] Wu, C. H., Yeh, L.-S. L., Huang, H., Arminski, L., Castro-Alvear, J., Chen, Y., Hu, Z., Kourtesis, P., Ledley, R. S., Suzek, B. E., et al. (2003). The protein information resource. Nucleic acids research, 31(1):345–347.

[78] Yu, K. and Zhang, T. (2013). Construction of customized sub-databases from NCBI-nr database for rapid annotation of huge metagenomic datasets using a combined blast and megan approach. PLoS One, 8(4):e59831.

[79] Zhang, J., Zhu, M., and Qian, Y. (2020). protein2vec: Predicting protein-protein interactions based on lstm. IEEE/ACM Transactions on Computational Biology and Bioinformatics.