Computational Ortholog Prediction: Evaluating Use Cases and Improving High-Throughput Performance

Computational Ortholog Prediction: Evaluating Use Cases and Improving High-Throughput Performance by Matthew Daratha Whiteside B.Sc., (Hons., Bioinformatics), University of Waterloo, 2006 Thesis Submitted In Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy in the Department of Molecular Biology and Biochemistry Faculty of Science © Matthew Daratha Whiteside 2013 SIMON FRASER UNIVERSITY Spring 2013 Approval Name: Matthew Daratha Whiteside Degree: Doctor of Philosophy Title of Thesis: Computational Ortholog Prediction: Evaluating Use Cases and Improving High-Throughput Performance Examining Committee: Chair: Dr. Ralph Pantophlet Assistant Professor, Faculty of Health Sciences Dr. Fiona S.L. Brinkman Senior Supervisor Professor, Department of Molecular Biology and Biochemistry Dr. Jack Chen Supervisor Associate Professor, Department of Molecular Biology and Biochemistry Dr. Margo M. Moore Supervisor Professor, Department of Biological Sciences Dr. Ryan D. Morin Internal Examiner Assistant Professor, Department of Molecular Biology and Biochemistry Dr. Rosemary J. Redfield External Examiner Professor, Department of Zoology University of British Columbia Date Defended/Approved: March 8th, 2013 ii Partial Copyright Licence iii Abstract Orthologs are genes that diverged from an ancestral gene when the species diverged. High-throughput computational methods for ortholog prediction are a key component of many computational biology analyses. A fundamental premise in these analyses is that orthologs (when predicted correctly) are functionally equivalent and can be used to transfer gene annotations across species. Currently, many existing ortholog prediction methods generate a sizeable number of incorrect ortholog predictions, especially in cases of complex gene evolution. My thesis examines the functional equivalence hypothesis further and presents one solution that increases the precision of ortholog prediction. To examine the use of orthologs in computational analysis, I conducted and evaluated three projects that employ ortholog prediction in distinct ways. In these projects, orthologs were used to (1) identify conserved, unique genes in metazoan species, (2) validate predicted gene regulatory modules in Pseudomonas aeruginosa, and (3) construct a transcriptional regulatory network in Aspergillus fumigatus. I identified factors affecting ortholog prediction in these specific use cases, demonstrating how successive gene duplications, incomplete genomes and rapid evolution of gene regulation can impact the results for such analyses. To improve ortholog prediction, I evaluated and augmented an existing method called Ortholuge. Ortholuge is a computational method that increases the precision of ortholog prediction in a high-throughput setting. I evaluated the performance of Ortholuge, showing that its approach of classifying orthologs based on their relative phylogenetic divergence does identify orthologs that are more functionally equivalent. I compared Ortholuge to contemporary methods QuartetS and OMA, and showed that Ortholuge consistently identifies functionally-equivalent orthologs across a range of taxonomic distances. I also further developed Ortholuge’s functionality by reducing run-time, increasing accuracy and improving usability through a number of modifications. Lastly, to make Ortholuge results available to the research community, I developed a database of Ortholuge ortholog predictions for bacteria and archaea species. This online iv database provides high-level visualization of orthologs and the ability to easily run complex queries to retrieve genes that are shared or unique between specified taxa. Overall, this work contributes an enhanced method for precise high-throughput ortholog identification and increases our understanding of the functional equivalences between orthologs. Keywords: Orthology; Comparative Genomics; Bioinformatics; Phylogenomics; Evolution v Dedication Aan mijn liefde Joske. Dank u voor jou eeuwige steun. Je geeft me de moed om mijn dromen te volgen. vi Acknowledgements I would like to thank the many people who have helped me in the completion of my PhD thesis. The first is my senior supervisor, Dr. Fiona Brinkman. You have my sincerest gratitude. Thank you for this opportunity, which has taught me so much. My PhD has been more than an education; it has been a life changing experience. I would like to thank past and present members of the Brinkman lab. You have made this experience truly unforgettable. Thank you so much for your support and shared laughs during these past six years. I am very thankful for the invaluable guidance and contribution of my supervisory committee; Dr. Margo Moore and Dr. Jack Chen. I would like to acknowledge all project collaborators. Your help and the opportunities you have provided me have made this work possible. I would like to extend thanks to: • Drs. Jinko Graham and Brad McNeney, Jeong Eun Min – OL.locfdr Project • Drs. Melissa Frederic and Michel Leroux – Metazoan Project • Dr. Margo Moore, Linda Pinto and Jason Catterson – Aspergillus fumigatus Iron-Limitation Project • Geoff Winsor and Matthew Laird – OrtholugeDB Project Thank you to the funding agencies: the Michael Smith Foundation for Health Research and Simon Fraser University for their financial support during my PhD. Finally, I would like to thank my family. Without your support and encouragement, none of this would be possible. vii Table of Contents Approval .............................................................................................................................ii Partial Copyright Licence .................................................................................................. iii Abstract .............................................................................................................................iv Dedication .........................................................................................................................vi Acknowledgements .......................................................................................................... vii Table of Contents ............................................................................................................ viii List of Tables .................................................................................................................... xii List of Figures.................................................................................................................. xiii List of Acronyms ...............................................................................................................xv Glossary .......................................................................................................................... xvi 1. Introduction to Homology and Comparative Genomics ..................................... 1 1.1. Gene Evolution and Its Impact on Gene Function ................................................... 2 1.1.1. Overview of the Mechanisms of Gene Evolution .......................................... 2 1.1.2. Definition of Orthology and Paralogy ............................................................ 3 1.1.3. Correspondence between Mode of Gene Evolution and Functional Divergence ................................................................................................... 5 Fates of Genes after Duplication .................................................................. 6 Neofunctionalization ..................................................................................... 7 Subfunctionalization ..................................................................................... 7 1.2. Detection of Orthologs ............................................................................................. 8 1.2.1. Phylogenetic Tree-based Detection of Orthologs ......................................... 8 1.2.2. Graph-based Detection of Orthologs ............................................................ 9 1.2.3. Hybrid Approaches ..................................................................................... 10 1.2.4. Resolving Complex Ortholog Relationships ............................................... 11 Detection of In-paralogs ............................................................................. 11 Grouping Orthologs from Multiple Species ................................................. 11 1.3. Factors Affecting Ortholog Prediction Accuracy ..................................................... 14 Gene Loss and Incomplete Genomes ........................................................ 14 Gene Fusion and Fission ........................................................................... 15 Issues Identifying the Nearest Neighbour with the BLAST Algorithm ........ 15 Horizontal Gene Transfer ........................................................................... 16 1.4. Improving Ortholog Prediction ................................................................................ 16 1.4.1. Review of Existing Strategies for Improved Ortholog Prediction ................ 16 Phylogenetic-based Ortholog Prediction without a Species Tree .............. 16 Overcoming Limitations of the BLAST algorithm in Identifying the Nearest Neighbour ............................................................................... 17 A Domain-Centric Approach ....................................................................... 17 Synteny .....................................................................................................

Computational Ortholog Prediction: Evaluating Use Cases and Improving High-Throughput Performance

Cerebral Autosomal Dominant Arteriopathy with Subcortical Infarcts and Leukoencephalopathy Revisited Genotype-Phenotype Correlations of All Published Cases

Genome-Wide Analysis of LRR-RLK Gene Family in Four Gossypium Species and Expression Analysis During Cotton Development and Stress Responses

How Chaotic Is Genome Chaos?

On the Role of Chromosomal Rearrangements in Evolution

The Biocyc Collection of Microbial Genomes and Metabolic Pathways Peter D

Biological Capacities Clearly Define a Major Subdivision in Domain Bacteria 4 5 Raphaël Méheust+,1,2, David Burstein+,1,3,8, Cindy J

Hundreds of Novel Composite Genes and Chimeric Genes with Bacterial

Psortm: a Bacterial and Archaeal Protein Subcellular Localization Prediction Tool for Metagenomics Data Michael A

Role of ECDYSONELESS in ERBB2/HER2 Mediated Breast Oncogenesis

Genes of Innate Immunity and Their Significance in Evolutionary Ecology of Free Livings Rodents Alena Fornuskova

GLP1/GLP1 Receptors

Pathway Tools Version 24.0: Integrated Software for Pathway/Genome Informatics and Systems Biology