Mendler Kerrin Msc Thesis
Total Page:16
File Type:pdf, Size:1020Kb
Large-scale phylogenomic visualization and analysis of functional traits in bacteria by Kerrin Mendler A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Science in Biology Waterloo, Ontario, Canada, 2019 c Kerrin Mendler 2019 Author’s Declaration This thesis consists of material all of which I authored or co-authored: see Statement of Contri- butions included in this document. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Statement of Contributions The manuscript presented in this thesis is the work of Kerrin Mendler, in collaboration with her co-authors Han Chen, Donovan Parks, Laura Hug, and Andrew Doxey. The manuscript was submitted as a preprint to the journal bioRxiv and for publication to the journal Nucleic Acids Research. Text and figures from the manuscript were modified for use in this dissertation. • Mendler, K., Chen, H., Parks, D. H., Hug, L. A. and Doxey, A. C. (2018). AnnoTree: visualization and exploration of a functionally annotated microbial tree of life. bioRxiv, 463455 For the manuscript, H.C and K.M. built the front-end interface of AnnoTree and back-end database. K.M. performed data analysis. D.P. assisted with bioinformatic analysis and genome annotation. L.H. assisted with phylogenetics and tool design. A.D., K.M., and L.H. wrote the manuscript. A.D. conceived the project and tool design. Additional contributors that were not included as co-authors on the paper are Briallen Lobb, who defined the ‘saturation’ and ‘catchment’ metrics for the classification of lineage-specific traits (unpublished) and Lee Bergstrand, for his assistance with the Docker-Compose implementation of the AnnoTree application. iii Abstract The growth of genomic information in public databases has dramatically improved our view of the tree of life and at the same time expanded our knowledge of protein diversity. Through the use of automated annotation pipelines, researchers can predict many of the functional capabilities of organisms directly from their genome sequence. Although there exist numerous phylogenetic and protein databases, there have been fewer attempts to combine these data, which is essential for the study of protein evolution. The web application AnnoTree (annotree.uwaterloo.ca) was developed as part of this thesis to facilitate the exploration and visualization of protein fami- lies (Pfams) and KEGG orthologs (KOs) on a phylogeny composed of nearly 24,000 bacterial genomes. The visualization includes an interactive tree of life, a summary of the taxonomic distribution of the query, basic taxonomic information, and annotation confidence scores. All protein sequences, visualizations, and summary information can be downloaded directly from the interface. The AnnoTree framework is open-source and can be modified to incorporate any custom tree, taxonomy, and proteome dataset. AnnoTree allows users to visualize the phylo- genetic distribution of a Pfam of interest, which, in combination with obtained gain/loss data, promotes hypothesis-generation in the context of protein evolution. To identify functions that are more tightly associated with evolutionary mechanisms such as horizontal gene transfer and evolutionary conservation, the pre-computed annotation data were combined with the bacterial tree of life in a phylogenomics analysis. The phyletic patchiness of all Pfam and KO annotations iv was measured using the normalized consistency index (CI), a measure of disagreement between the presence/absence states of traits across the tree and the tree topology. Pfams and KOs with the highest normalized CI represent functions known to be associated with mobile genetic elements and viral defence. These annotations were most commonly found within the genomes of symbiotic and pathogenic bacteria. The most highly conserved Pfams and KOs were functions related to core processes such as transcription, DNA replication, and protein synthesis as well as those required for oxygenic photosynthesis and sporulation. Lineage-specific Pfams and KOs were classified in many bacterial taxa, revealing many clade-defining functions in the Bacillus A genus, the Oxyphotobacteria class, and the Actinobacteria class, among others. An additional phyloge- nomics analysis was performed to identify branches of a phylogeny encompassing representatives from all three domains of life undergoing the most Pfam gain and loss events. The branches dividing the three taxonomic domains had the highest density of gain events, all of which were associated with well-known clade-defining functions. Missing data influenced the frequency of Pfam losses in lower taxonomic levels, but some characterized genome streamlining events within Eukaryotes were uncovered. Ultimately, the development of AnnoTree and accompanying analyses provide new insights into large-scale bacterial phylogenomics and the evolution and distributions of bacterial protein domains and gene families. v Acknowledgements I am most grateful for having Dr. Andrew Doxey as a supervisor, whose overwhelming kindness and enthusiasm made my experience in his lab a comfortable and enjoyable one. I would also like to thank the other members of my committee, Dr. Laura Hug and Dr. Brendan McConkey, whose further guidance allowed me to develop the critical thinking and technical skills required for the completion of this thesis. My gratification is extended to the other members of the Doxey lab for the productive (and also unproductive) conversations that we had. It is rare to have such a knowledgeable and insightful group of people to bounce ideas off of. My final thanks go out to my friends and family for their unwavering support. vi Table of Contents List of Tablesx List of Figures xi List of Abbreviations xii 1 Introduction1 1.1 Overview of phylogenomics and trees of life....................1 1.2 Overview of protein functional annotation methods and databases........3 1.3 Overview of factors shaping functional diversity in bacterial genomes......6 1.4 Measuring phylogenetic dispersion of binary traits.................9 1.5 Research aims and objectives............................ 15 2 The AnnoTree web application 16 2.1 Background..................................... 16 2.2 Overview of AnnoTree: features and capabilities................. 18 2.3 Construction of AnnoTree............................. 19 2.3.1 Front-end.................................. 19 2.3.2 Back-end.................................. 21 Data sources................................ 22 AnnoTree database............................. 22 Server-side application........................... 25 2.4 Installation and system requirements........................ 26 2.4.1 Setup with the latest AnnoTree data.................... 26 2.4.2 Setup with custom or new data....................... 27 2.5 AnnoTree use case examples............................ 27 2.5.1 Visualizing the taxonomic distribution of annotations........... 27 2.5.2 Observing the taxonomic distribution of an NCBI BLAST result..... 29 vii 2.6 Conclusion..................................... 30 3 The phylogenetic distribution of functional traits 32 3.1 Distribution of Pfam and KEGG annotations in bacteria.............. 32 3.1.1 Background................................. 32 3.1.2 Methods.................................. 33 Gene prediction, annotation, and profile generation............ 33 Benchmarking the measurement of homoplasy metrics in R....... 34 Contamination sensitivity analysis of normalized CI........... 35 Calculating the significance of phylogenetic conservation......... 36 Classification of lineage-specific traits................... 37 Taxonomic rank homoplasy enrichment analysis............. 37 3.1.3 Results and Discussion........................... 38 Homoplasy and phylogenetically scattered traits............. 38 Lineage-specific traits........................... 47 3.1.4 Conclusion................................. 55 3.2 Distribution of protein families across all life................... 55 3.2.1 Background................................. 55 3.2.2 Methods.................................. 56 Data retrieval and profile generation.................... 56 Evolutionary model estimation and stochastic mapping.......... 57 3.2.3 Results and Discussion........................... 57 The phylogenetic distribution of Pfam gains and losses.......... 57 3.2.4 Conclusion................................. 62 4 Conclusion and Future Directions 65 4.1 Contributions.................................... 65 4.1.1 AnnoTree web application......................... 66 4.1.2 Phylogenetic distribution of functional traits................ 67 4.2 Future Directions.................................. 69 4.2.1 AnnoTree web application......................... 69 4.2.2 Phylogenetic distribution of functional traits................ 70 References 72 APPENDICES 94 A Measured phylogenetic dispersion of functional annotations 95 viii B Lineage-specific annotations 118 ix List of Tables Table 1.1 Metrics for the quantification of phylogenetic dispersion of binary traits.. 14 Table 2.1 AnnoTree repository locations........................ 26 Table 3.1 Taxon enrichment contingency table..................... 38 Table 3.2 Results of the homoplasy metric benchmarking experiment......... 40 Table 3.3 Lineage-specific Pfams also listed as essential DUFs............ 54 Table 3.4 Genera exhibiting the most Pfam gains................... 59