Prospects & Overviews Orthology Prediction Methods: a Quality Assessment Using Curated Protein Families
Total Page:16
File Type:pdf, Size:1020Kb
Prospects & Overviews Methods, Models & Techniques Orthology prediction methods: A quality assessment using curated protein families Kalliopi Trachana1), Tomas A. Larsson1)2), Sean Powell1), Wei-Hua Chen1), Tobias Doerks1), Jean Muller3)4) and Peer Bork1)5)Ã The increasing number of sequenced genomes has Introduction prompted the development of several automated orthology prediction methods. Tests to evaluate the accuracy of pre- The analysis of fully sequenced genomes offers valuable insights into the function and evolution of biological systems dictions and to explore biases caused by biological and [1]. The annotation of newly sequenced genomes, comparative technical factors are therefore required. We used 70 man- and functional genomics, and phylogenomics depend on ually curated families to analyze the performance of five reliable descriptions of the evolutionary relationships of public methods in Metazoa. We analyzed the strengths protein families. All the members within a protein family and weaknesses of the methods and quantified the impact are homologous and can be further separated into orthologs, of biological and technical challenges. From the latter part which are genes derived through speciation from a single ancestral sequence, and paralogs, which are genes resulting of the analysis, genome annotation emerged as the largest from duplication events before and after speciation (out- and single influencer, affecting up to 30% of the performance. in-paralogy, respectively) [2, 3]. The large number of fully Generally, most methods did well in assigning orthologous sequenced genomes and the fundamental role of orthology group but they failed to assign the exact number of genes in modern biology have led to the development of a plethora of forhalfofthegroups.Thepublicly available benchmark set methods (e.g. [4–11]) that automatically predict orthologs among organisms. Current approaches of orthology assign- (http://eggnog.embl.de/orthobench/) should facilitate the ment can be classified into (i) graph-based methods, which improvement of current orthology assignment protocols, cluster orthologs based on sequence similarity of proteins, and which is of utmost importance for many fields of biology (ii) tree-based methods, which not only cluster, but also rec- and should be tackled by a broad scientific community. oncile the protein family tree with a species tree (Box 1). Despite the fact that orthology and paralogy are ideally illus- Keywords: trated through a phylogenetic tree, where all pairwise relation- .metazoan; orthology; quality assessment ships are evident, tree-based methods are computationally DOI 10.1002/bies.201100062 1) Structural and Computational Biology Unit, European Molecular Biology Abbreviations: Laboratory, Heidelberg, Germany COG, clusters of orthologous group; EGF, epidermal growth factor; HMM, 2) Developmental Biology Unit, European Molecular Biology Laboratory, hidden Markov models; LCA, last common ancestor; MSA, multiple sequence Heidelberg, Germany alignment; OG, orthologous group; RefOG, reference orthologous groups; 3) Institute of Genetics and Molecular and Cellular Biology, CNRS, INSERM, VWC, von Willebrand factor type C; VWD, von Willebrand factor type D. University of Strasbourg, Strasbourg, France 4) Genetic Diagnostics Laboratory, CHU Strasbourg Nouvel Hoˆ pital Civil, Authors’ contributions: Strasbourg, France T. A. L., K. T., T. D., W. H. C., and S. P. did the phylogenetic analysis of the 5) Max-Delbruck-Centre for Molecular Medicine, Berlin-Buch, Germany families. K. T., T. A. L., and W. H. C. performed the analysis. S. P. created the website. T. D., J. M., and P. B. designed the analysis. All the authors *Corresponding author: contributed to write the paper. Peer Bork E-mail: [email protected] Supporting information online Bioessays 33: 769–780,ß 2011 WILEY Periodicals, Inc. www.bioessays-journal.com 769 K. Trachana et al. Prospects & Overviews .... Box 1 Comparison of orthology prediction methods Orthology prediction methods can be classified based on into OGs. OMA removes from the initial graph BBHs the methodology they use to infer orthology into (i) graph- characterized by high evolutionary distance; a concept based and (ii) tree-based methods [12, 16, 17]. Different similar to RoundUp. After that, it performs clustering based graph-based methods are designed to assign orthology on maximum weight cliques. Unique database character- for two (pairwise) or more (multiple) species. Graph-based istics are the hierarchical groups (OGs in different taxonomic methods assign proteins into OGs based on their similarity levels) and ‘‘pure orthologs’’ (generate groups of one-to- scores, while tree-based methods infer orthology through one orthologs without paralogs), which has been introduced tree reconciliation. only by OMA (indicated as ÃÃ in the figure). Hierarchical Pairwise species methods (e.g. BHR, InParanoid, groups can substitute the view of phylogenetic trees. RoundUp): Multi-species tree-based methods (e.g. TreeFam, Based on these methods, orthologs are best bi-directional Ensembl Compara, PhylomeDB, LOFT): hits (BBH) between a pair of species. BRH [46] is the first Tree-based prediction methods can be separated into automated method and does not detect paralogs. approaches that do (like EnsemblCompara, TreeFam, InParanoid [47] implements an additional step for the and PhylomeDB) and do not, e.g. LOFT [49], use tree- detection of paralogs. RoundUp [48] uses evolutionary reconciliation. Tree-based methods also initially use distances instead of BBH. In addition to the restriction homology searches; however, their criteria are more of only two-species at a time, these methods are disad- relaxed, as the orthology is resolved through tree top- vantageous for long evolutionary distances. ology. Although a reconciled phylogenetic tree is the most Multi-species graph-based methods (e.g. COG, appropriate illustration of orthology/paralogy assignment, eggNOG, OrthoDB, OrthoMCL, OMA): there are a few caveats to such an approach, namely their Due to the fast implementation and high scalability, there are scalability and sensitivity to data quality. Methods, Models & Techniques many graph-based methods for multi-species compari- For a more detailed and extensive discussion of the differ- sons. So far, all of them use either BLAST or Smith- ences among orthology methodology, we recommend Waterman (e.g. PARALIGN, SIMG) as sequence-similarity refs. [12, 16, 17]. search algorithms. However, they are quite diverse regard- Phylogenetic distribution describes the species range of ing the clustering algorithms. COG, eggNOG, and OrthoDB each database. Homology search shows a few technical share the same methodology: they identify three-way BBHs differences for recruiting orthologs. §: Supplies OGs in three different species and then merge triangles that whose members share only orthologous relationships. Ã: share a common side. OrthoMCL is a probabilistic method The user can compare any two genomes spanning a that uses a Markov clustering procedure to cluster BBH phylogenetic distance from bacteria to animals. Phylogenetic Homology Clustering Hierarchical distribution Paralogs search strategy groups None - BRH ALL* NO BLAST A B None InParanoid ALL* YES BLAST - Evol. RoundUp ALL* NO None - comparison Distance Pairwise species A COG ALL YES BLAST Triangles NO eggNOG ALL YES BLAST Triangles YES C B OrthoDB Eukaryotes YES PARALIGN Triangles YES GRAPH-BASED METHODS comparison Multi-species SIMD Maximum ALL YES** YES D OMA & Evol. weight Distance cliques OrthoMCL ALL YES BLAST Markov NO Clustering Family tree Species tree Metazoa YES BLAST Hierarchical - c b TreeFam D A & HMM clustering B A Ensembl D C Metazoa YES BLAST Hierarchical - Out Compara clustering Reconciled tree PhylomeDB § - D ALL YES BLAST None TREE-BASED METHODS d c a D C b B A Out 770 Bioessays 33: 769–780,ß 2011 WILEY Periodicals, Inc. ....Prospects & Overviews K. Trachana et al. expensive and at times fail due to the complexity of the family same release of genome repositories and tested across all or to the substantial number of species in the comparison [12]. methods [16]. As a trade-off between speed and accuracy, the evolutionary Methods, Models & Techniques relationships among proteins in comparisons that include a large number of species are better explored using graph-based Benchmarking orthology prediction methods. During the first large-scale orthology assignment methods using a phylogeny approach project of multiple species, the concept of clusters of ortho- logous groups (COGs) was introduced [4]. A COG consists of Despite the acknowledged necessity of a phylogeny-based proteins that have evolved from a single ancestral sequence evaluation of orthology, thus far the majority of quality assess- existing in the last common ancestor (LCA) of the species that ment tests are based on the functional conservation of pre- are being compared, through a series of speciation and dupli- dicted orthologs [18–21]. However, orthology is an cation events [4]. The orthologous/paralogous relationships evolutionary term defined by the relationships among the among proteins of multiple species are better resolved through sequences under study, and functional equivalences are not orthologous groups (OGs) rather than pairs of orthologs. This always inferable [13]. Moreover, the functional divergence is particularly evident in the instances of complex protein between orthologs and paralogs (sub-/neo-functionalization family histories (e.g. tubulins) or families