Ortholugedb: a Bacterial and Archaeal Orthology Resource for Improved Comparative Genomic Analysis Matthew D

D366–D376 Nucleic Acids Research, 2013, Vol. 41, Database issue Published online 29 November 2012 doi:10.1093/nar/gks1241 OrtholugeDB: a bacterial and archaeal orthology resource for improved comparative genomic analysis Matthew D. Whiteside, Geoffrey L. Winsor, Matthew R. Laird and Fiona S. L. Brinkman* Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia V5A 1S6, Canada Received September 28, 2012; Revised and Accepted November 1, 2012 ABSTRACT INTRODUCTION Prediction of orthologs (homologous genes that The increase in genomic sequencing throughput has diverged because of speciation) is an integral generated a rapid increase in the number of available bac- component of many comparative genomics terial and archaeal genome sequences (1,2). To make methods. Although orthologs are more likely to effective use of this growing resource, computational methods for comparative genomics analysis must keep have similar function versus paralogs (genes that pace, ideally without sacrificing accuracy or performance. diverged because of duplication), recent studies Computationally predicted orthologs are integral to many have shown that their degree of functional conser- comparative genomics analyses. Orthologs, related genes vation is variable. Also, there are inherent problems between species that have diverged as a result of speci- with several large-scale ortholog prediction ation, are thought to more likely have similar functions approaches. To address these issues, we previously than paralogs, which are homologous genes that have developed Ortholuge, which uses phylogenetic arisen through gene duplication (3). This ortholog func- distance ratios to provide more precise ortholog tional conservation hypothesis or conjecture is the basis assessments for a set of predicted orthologs. for many comparative genomics methods using computa- However, the original version of Ortholuge required tionally predicted orthologs to infer gene functions across manual intervention and was not easily accessible; species. Gene annotation transfer, phylogenetic profiling, metabolic network reconstruction and identification of therefore, we now report the development of gene regulatory elements by phylogenetic foot printing OrtholugeDB, available online at http://www. are examples of methods that use computationally pathogenomics.sfu.ca/ortholugedb. OrtholugeDB predicted orthologs. Several platforms for comparative provides ortholog predictions for completely genomic analysis in microbial species have been sequenced bacterial and archaeal genomes from developed, including the Comprehensive Microbial NCBI based on reciprocal best Basic Local Resource (4), MicrobesOnline (5), Microbial Genome Alignment Search Tool hits, supplemented with Database (6) and the Integrated Microbial Genomes further evaluation by the more precise Ortholuge database (7). Broadly, these resources specialize in data method. The OrtholugeDB web interface facilitates integration and functional inference based on comparative user-friendly and flexible ortholog analysis, from genomics. Some of these platforms could benefit from an single genes to genomes, plus flexible data ortholog prediction method that is both high throughput and highly precise. A number of high-throughput ortholog download options. We compare Ortholuge with prediction methods have been developed. Although there similar methods, showing how it may more consist- are >30 ortholog databases, OMA (8), QuartetS (9,10), ently identify orthologs with conserved features OrthoMCL (11,12), RoundUP (13), eggNOG (14) and across a wide range of taxonomic distances. HOGENOM (15,16) have associated web sites that OrtholugeDB facilitates rapid, and more accurate, provide ortholog predictions for significant numbers of bacterial and archaeal comparative genomic microbial genomes (not necessarily specific to bacteria analysis and large-scale ortholog predictions. and archaea species). Ortholog prediction methods have *To whom correspondence should be addressed. Tel: +1 778 782 5646; Fax: +1 778 782 5583; Email: [email protected] ß The Author(s) 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/3.0/), which permits non-commercial reuse, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]. Nucleic Acids Research, 2013, Vol. 41, Database issue D367 been classified into two categories: tree-based methods, similar functions and may be better suited for many com- which use phylogenetic trees to resolve orthologs, and parative genomic analyses. The phylogenetic approach graph-based methods, which use pairwise sequence used in Ortholuge does not require a separate species similarities computed across the entire genome (hybrid tree, making it especially suited for microbial genomes approaches have also been developed) (17). where species tree construction can be complicated by Shortcomings have been associated with both types (17). horizontal gene transfer and widely differing degrees of Tree-based methods require reliable automated tree divergence between the species being compared. building, accurate tree rooting and accurate species trees. However, the Ortholuge method, initially made princi- Many of the methods do not scale well (17). Although pally for our own internal use, has not been easily access- graph-based methods typically scale better, as these ible. Ortholuge-based predictions for the human, mouse methods often do not consider the broader phylogenetic and cow genomes are available through InnateDB—a context, they can generate false-positive ortholog predic- platform facilitating system-based analyses of the innate tions when the ortholog is missing (an ortholog can be immune response (28)—and predictions for Pseudomonas missing because of incomplete genome annotations or species genomes are available through the Pseudomonas gene deletions) (17,18). In the evaluations of ortholog pre- Genome Database (29). However, none of these predic- diction methods, the graph-based methods performed tions are queryable, and there is no flexibility in the well, even in comparison with at least one sophisticated display of ortholog prediction results. To make tree-based algorithm (19,20). The graph-based methods Ortholuge predictions widely available, we now report range significantly in their specificity and coverage. the development of OrtholugeDB, a web-accessible Some methods, such as the ortholog group-centric database of pre-computed Ortholuge-based predictions approaches: OrthoMCL, eggNOG and COG, have high for completely sequenced bacterial and archaeal coverage (producing many ortholog predictions), but species (http://www.pathogenomics.sfu.ca/ortholugedb). lower specificity (higher false prediction rate) (20). Two However, OrtholugeDB is not limited to providing methods have developed approaches to mitigate the Ortholuge results. OrtholugeDB was also developed as a misprediction due to missing ortholog genes. OMA and user-friendly and flexible interface for querying RBB- QuartetS are graph-based approaches that use additional based orthologs predictions, in-paralog predictions outgroup genes to examine the broader phylogeny of a (recently duplicated genes relative to the species being predicted ortholog pair to assess whether the predicted compared) and ortholog groups. orthologs are more likely paralogs (8,9). The reliability of the ortholog conjecture, that orthologs are more likely to have similar functions than paralogs, ORTHOLUGE has recently been examined on a larger scale (21–26). These studies showed that for similar levels of divergence, Ortholuge generates precise ortholog predictions between orthologs tend to more often have functions that are more two species on a genome-wide scale using an additional conserved than paralogs. The difference in function con- outgroup genome for reference (18). It computes phylo- servation, however, is not considerable and can vary genetic distance ratios for each pair of orthologs that between species and gene families (22,24–26). Although reflect the relative rate of divergence for the predicted orthologs provide predictive power over paralogs when orthologs (Figure 1). Two ratios are needed to summarize it comes to inferring genes functions across species, the the relative branch lengths for both ingroup genes. These implication from these studies is that ortholog prediction phylogenetic ratios allow the user to distinguish between methods could be improved by targeting the set of predicted orthologs with phylogenetic distance that is orthologs that are functionally similar rather than all evo- comparable with the species divergence [termed lutionary orthologs. supporting-species-divergence (SSD) orthologs] and pre- Ortholuge is a high-throughput method that improves dicted orthologs with unusual divergence (non-SSD). the specificity of ortholog prediction (18). It provides the SSD orthologs and orthologs undergoing unusual diver- benefits of graph-based methods, including scalability, but gence have distinct ratio distributions. These ortholog limits false positives generated by missing orthologs, as it types are observable when the ratios are plotted on a considers the phylogenetic context of predicted orthologs. genome-wide scale (Figure 1D) (27). SSD and non-SSD Ortholuge first predicts orthologs using the

Ortholugedb: a Bacterial and Archaeal Orthology Resource for Improved Comparative Genomic Analysis Matthew D

The Biocyc Collection of Microbial Genomes and Metabolic Pathways Peter D

Biological Capacities Clearly Define a Major Subdivision in Domain Bacteria 4 5 Raphaël Méheust+,1,2, David Burstein+,1,3,8, Cindy J

Hundreds of Novel Composite Genes and Chimeric Genes with Bacterial

Psortm: a Bacterial and Archaeal Protein Subcellular Localization Prediction Tool for Metagenomics Data Michael A

Pathway Tools Version 24.0: Integrated Software for Pathway/Genome Informatics and Systems Biology

A Sample Article Title

Computational Prediction and Comparative Analysis of Protein Subcellular Localization in Bacteria

Ontoloki: an Automatic, Instance-Based Method for the Evaluation of Biological Ontologies on the Semantic Web1

View Course Material

Dbmloc: a Database of Proteins with Multiple Subcellular Localizations Song Zhang†, Xuefeng Xia†, Jincheng Shen, Yun Zhou and Zhirong Sun*

Psortdb—An Expanded, Auto-Updated, User-Friendly Protein Subcellular Localization Database for Bacteria and Archaea Nancy Y

Computational Ortholog Prediction: Evaluating Use Cases and Improving High-Throughput Performance