Published online 18 October 2007 Nucleic Acids Research, 2008, Vol. 36, Database issue D271–D275 doi:10.1093/nar/gkm845

OrthoDB: the hierarchical catalog of eukaryotic orthologs Evgenia V. Kriventseva3, Nazim Rahman1, Octavio Espinosa1 and Evgeny M. Zdobnov1,2,4,*

1Department of Genetic Medicine and Development, University of Geneva Medical School, 2Swiss Institute of Bioinformatics, 1 rue Michel-Servet, 3Department of Structural Biology and Bioinformatics, University of Geneva Medical School, 1 rue Michel-Servet, 1211 Geneva, Switzerland and 4Imperial College London, South Kensington Campus, SW7 2AZ London, UK

Received August 20, 2007; Revised September 24, 2007; Accepted September 25, 2007

ABSTRACT particularly challenging with complex eukaryotic genomes. The fast growing number of available complete The concept of orthology is widely used to relate genomes facilitates a much better resolution of the across different species using comparative genealogies, while at the same time greatly increasing the genomics, and it provides the basis for inferring computational challenges. gene function. Here we present the web accessible There are two main approaches to delineate ortholo- OrthoDB database that catalogs groups of ortho- gous genes: (i) from reconciliation of gene trees with logous genes in a hierarchical manner, at each the species phylogeny and (ii) from classification of all- radiation of the species phylogeny, from more against-all sequence comparisons of complete genomes. general groups to more fine-grained delineations The phylogeny approach takes advantage of well-studied between closely related species. We used a of conserved cores of globular proteins using COG-like and Inparanoid-like ortholog delineation quantitative models of amino acid substitutions (4–6). The notable examples of the tree-based approach to procedure on the basis of all-against-all Smith- delineate orthologous genes are HOVERGEN (7) and Waterman sequence comparisons to analyze TreeFam (8). The expert curation of the phylogenetic trees 58 eukaryotic genomes, focusing on vertebrates, and the underlying multiple sequence alignments is both, insects and fungi to facilitate further compara- advantageous, providing better accuracy and disadvanta- tive studies. The database is freely available at geous, limiting the comprehensiveness, homogeneity of http://cegg.unige.ch/orthodb quality and expandability to new species. Although given the appropriate data phylogenetic methods are likely to give more accurate models of ancestral sequences and INTRODUCTION therefore to yield more accurate orthology prediction, their applicability to current genomic data is hindered Identification of orthologous genes is the cornerstone of by several factors, most importantly: (i) they require sub- comparative genomics, which is increasingly becoming stantially more computational resources, (ii) the reconci- an essential part of modern molecular biology. liation of gene and species trees relies on poorly quantified Functions of orthologous genes are often preserved models of gene duplication and loss, and (iii) they are through evolution, as by definition, orthologous genes sensitive to completeness of predicted genes as the descend by speciation from the common ancestor gene evolutionary models are designed for only well-conserved (1–3). Although the conservation of ortholog functions globular cores of proteins and missing data (gaps) render is not required or guaranteed, it is the most likely the approach inapplicable. The tree-based approaches also evolutionary scenario and provides a strong working require the knowledge of the species phylogeny, and hypothesis, particularly when the ortholog copy-number although the consensus on phylogeny seems to be is preserved over a long period of time. Identification close, it is still constantly challenged. The alternative of orthologs is intricate as it assumes knowledge of approach of clustering orthologous genes on the basis of ancestral state of the genes, and it requires knowledge their whole-length similarity around Best-Reciprocal-Hits of the complete gene repertoires. It is also complicated (BRHs, also known as SymBets, bi-directional BeTs by gene duplication, fusion and exon shuffling, as well and best–best hits, denoting sequences most similar to as pseudonization and loss, which make the problem each other in between-genome comparisons) was first

*To whom correspondence should be addressed. Tel: +41 22 379 59 73; Fax: +41 22 379 57 06; Email: [email protected]

ß 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. D272 Nucleic Acids Research, 2008, Vol. 36, Database issue introduced by the database of Clusters of Orthologous METHODS Groups (COGs) (9). Triggered by the earlier availability Orthology delineation of much smaller and simpler bacterial genomes the database has quickly gained wide recognition and was Groups of orthologous genes were automatically identi- later extrapolated to eukaryotic genomes (KOGs) (9). fied using a strategy employed previously (19–22) that is The identification of BRHs is widely adopted currently based on all-against-all protein sequence comparisons in the field of comparative genomics for its simplicity and using the Smith-Waterman algorithm as implemented feasibility of application to large-scale data. In terms of in ParAlign (23) with default parameters, followed by phylogenetic trees, BRHs could be interpreted as genes clustering of best reciprocal hits from highest scoring ones À6 À10 from different species with the shortest connecting path to 10 e-value cutoff for triangulating BRH or 10 over the distance-based tree. The simplest application of cutoff for unsupported BRH, and requiring a sequence this approach using BLAST (10) for interspecies compar- alignment overlap of at least 30 amino acids across all isons suffers from inaccuracies of sequence distance members of a group. Furthermore, the orthologous estimates and ignores many gene duplications after the groups were expanded by genes that are more similar to speciation that are, in fact, co-orthologs that are difficult each other within a proteome than to any gene in any of to differentiate functionally. However, using these genes the other species, and by very similar copies that share as anchors of orthologous groups in different species, over 97% sequence identity, which were identified initially additional co-orthologs can be identified as genes that using CD-Hit (24). We considered only the longest trans- are more similar to them in intra-genome comparisons cript per gene or the most common as specified in UniProt than to any other gene in the other genomes, as (25). The outlined procedure was first applied to all species popularized by the pairwise Inparanoid approach (11). considered, and then to each subset of species according to There are also a few alternative clustering heuristics with the radiation of the . varying compromises between specificity and selectivity (12) that focus on the growing number of available Phylogeny reconstruction eukaryotic genomes such as the probabilistic approach of To guide computation of the ortholog hierarchy we OrthoMCL (13) and the vertebrate-centric Ensembl- produced the multiple alignment of concatenated single- Compara (14). copy orthologs, using well-aligned regions extracted with Another important feature of orthology and paralogy Gblocks (26) from individually aligned orthologous classification, which is currently underappreciated, is that sequences using Muscle (27). This was used to compute it is relative to a particular ancestor, as orthology of genes the phylogenetic trees using the maximum-likelihood is defined by their descent from a common ancestor gene method as implemented in PHYML (28), employing the by speciation (1–3). Therefore, the more distantly related JTT model, a gamma correction with four discrete classes, species are considered the more general (inclusive) and an estimated alpha parameter and proportion of orthologous groups become, because all lineage-specific invariable sites. duplications since this last common ancestor should be considered as co-orthologs. Inversely, orthologous groups become more fine-grained (more 1 : 1 relations) when DATABASE CONTENT closely related species are considered, as there was less Overview statistics time for gene duplications to occur. The concept of hierarchical orthologous groups has already prompted As detailed in Table 1 we analyzed 23 complete proteomes of fungal species, 19 insects and 15 vertebrates at different development of Levels of Orthology From Trees (LOFT) levels of the species phylogeny. Overall, this effort spans (15) tool to interpret the gene-trees in the context of 870 737 genes, 82% of which have been classified into species tree, COrrelation COefficient-based CLustering 10 876 orthologous groups in fungi, 19 835 in insects and (COCO-CL) (16) methodology to refine clusters of 23 940 in vertebrates, providing the first systematic homologous genes, and PHOG approach (17) to resolve classification of the wealth of data that will provide the orthology at each taxonomy node using explicit modeling basis for further comparative evolutionary analyses. of the ancestral sequences and relying on PHOG-BLAST (18) profile–profile comparisons. Aiming to fuel comparative genomic studies we focused WEB INTERFACE here on the most represented eukaryotic phyla, namely, we analyzed 23 fungi, 19 insect (plus one crustacean) The database is freely accessible from http://cegg.unige. and 15 vertebrate species with complete proteomes ch/ available (Table 1). For this analysis, we employed our own implementation of COG-like and Inparanoid-like Hierarchy of the orthologous groups ortholog identification procedures from all-against-all Orthology and paralogy classification is relative to the set sequence comparisons across multiple species (19–22), of species considered [namely, to the particular ancestor and here we explicitly delineate the hierarchy of the (1–3)] and is more general (inclusive) for distantly related orthologous groups, consistently applying the procedure species, and more fine-grained (specific) for closely related to the sets of species with varying levels of relatedness species. We therefore delineated orthologous groups at according to the species tree (Figure 1). each radiation node of the species phylogeny. To clearly Nucleic Acids Research, 2008, Vol. 36, Database issue D273

Table 1. Sets of covered complete proteomes

Lineage Species name Abbreviation No. of genesà Classified (%) Source

Vertebrates Bos taurus Btar 21755 88 Ensembl v45 Jun2007 Canis familiaris Cfam 19305 94 Ensembl v45 Jun2007 Danio rerio Drer 24961 82 Ensembl v45 Jun2007 Gasterosteus aculeatus Gacu 20791 88 Ensembl v45 Jun2007 Gallus gallus Ggal 16736 83 Ensembl v45 Jun2007 Homo sapiens Hsap 22937 96 Ensembl v45 Jun2007 Monodelphis domestica Mdom 19520 91 Ensembl v45 Jun2007 Macaca mulatta Mmul 21944 90 Ensembl v45 Jun2007 Mus musculus Mmus 24496 87 Ensembl v45 Jun2007 Ornithorhynchus anatinus Oana 15723 87 Ensembl v45 Jun2007 Pan troglodytes Ptro 20965 97 Ensembl v45 Jun2007 Rattus norvegicus Rnov 22993 89 Ensembl v45 Jun2007 Tetraodon nigroviridis Tnig 28005 71 Ensembl v45 Jun2007 Takifugu rubripes Trub 21880 91 Ensembl v45 Jun2007 Xenopus tropicalis Xtro 18025 81 Ensembl v45 Jun2007 InsectsÃÃ Aedes aegypti Aaeg 16789 89 AaegL1.1 Anopheles gambiae Agam 13133 87 AgamP3.45 Apis mellifera Amel 10330 87 GLEAN+curated_set Bombyx mori Bmor 21302 48 SW_ge2k_BGF Culex pipiens Cpip 23165 66 JCVI.CpipJ1.0_5 Drosophila ananassae Dana 22551 74 CAP freeze_20061030 Drosophila erecta Dere 16880 91 CAP freeze_20061030 Drosophila grimshawi Dgri 16901 87 CAP freeze_20061030 Drosophila melanogaster Dmel 13733 98 CAP freeze r4.3.FB Drosophila mojavensis Dmoj 17738 84 CAP freeze_20061030 Drosophila persimilis Dper 23029 77 CAP freeze_20061030 Drosophila pseudoobscura Dpse 17328 91 CAP freeze_20061030 Daphnia pulex Dpul 30940 42 FrozenGC_2007_07_03 Drosophila sechellia Dsec 21332 81 CAP freeze_20061030 Drosophila simulans Dsim 17049 89 CAP freeze_20061030 Drosophila virilis Dvir 17679 86 CAP freeze_20061030 Drosophila willistoni Dwil 20211 77 CAP freeze_20061030 Drosophila yakuba Dyak 18816 86 CAP freeze_20061030 Pediculus humanus Phum 11206 82 TIGR.061807 Tribolium castaneum Tcas 16616 67 GLEAN+curated_set Fungi Ashbya gossypii ASHG 4720 97 UniProt v12 Jul2007 Aspergillus clavatus ASPC 9120 95 UniProt v12 Jul2007 Aspergillus fumigatus ASPF 9629 95 UniProt v12 Jul2007 Aspergillus oryzae ASPO 12055 84 UniProt v12 Jul2007 Aspergillus terreus ASPT 10405 90 UniProt v12 Jul2007 Candida glabrata CANG 5180 95 UniProt v12 Jul2007 Chaetomium globosum CHAG 11040 78 UniProt v12 Jul2007 Coccidioides immitis COCI 10435 69 UniProt v12 Jul2007 Cryptococcus neoformans CRYN 6438 83 UniProt v12 Jul2007 Debaryomyces hansenii DEBH 6311 89 UniProt v12 Jul2007 Encephalitozoon cuniculi ENCC 1909 62 UniProt v12 Jul2007 Kluyveromyces lactis KLUL 5326 92 UniProt v12 Jul2007 Lodderomyces elongisporus LODE 5781 91 UniProt v12 Jul2007 Magnaporthe grisea MAGG 12685 71 UniProt v12 Jul2007 Neosartorya fischeri NEOF 10403 95 UniProt v12 Jul2007 Neurospora crassa NEUC 10076 75 UniProt v12 Jul2007 Phaeosphaeria nodorum PHAN 16451 59 UniProt v12 Jul2007 Pichia guilliermondii PICG 5919 93 UniProt v12 Jul2007 Pichia stipitis PICS 5797 96 UniProt v12 Jul2007 Schizosaccharomyces pombe SCHP 5008 88 UniProt v12 Jul2007 Ustilago maydis USTM 6546 78 UniProt v12 Jul2007 Yarrowia lipolytica YARL 6525 81 UniProt v12 Jul2007 Saccharomyces cerevisiae YEAS 6214 88 UniProt v12 Jul2007

ÃOnly the longest transcript per gene was considered. ÃÃIncluding one crustacean. show the hierarchy level of the classifications and to selecting a radiation of interest on the phylogeny. allow easy navigation along the hierarchy we display Each result page provides a precompiled Bookmarklet, a the interactive species tree (Figure 1). The default level for snippet of JavaScript code that can be easily bookmarked an initial user query is set to fungi, or in the user browser, to allow direct query to a particular vertebrates and the level can be adjusted afterwards by phylogeny level. D274 Nucleic Acids Research, 2008, Vol. 36, Database issue

Figure 1. Example screenshot of the OrthoDB web interface (http://cegg.unige.ch/orthodb). The left panel enumerates the modes to query and browse the database by: (1) a keyword, (2) a user specified phylogenetic gene copy-number profile, (3) a common phylogenetic profile, or (4) search; the middle panel is reserved for displaying results; and the right panel accommodates help and query history messages.

Stable identifiers be done by activating the set of selectors next to the phylogenetic tree and specifying the ortholog copy- We assigned short identifiers using Noid utility to the number requirements in the species of interest, where ‘?’ generated orthologous groups that we will maintain unique across subsequent updates of the database to notation stands for no restriction (‘any number’) and ‘0’, allow stable references to the data. ‘1’, ‘>1’ are self explanatory. The ‘Filter’ button in the ‘Specify copy-number profile’ section will execute the Querying by keywords corresponding query. In addition, we provide a set of precompiled queries for phylogenic profiles of common The database is searchable by the relevant identifiers used interest (via the selection list) that are more complicated for the proteins or orthologous groups, as well as by to express, e.g. ‘all but one’ type: all single-copy orthologs keywords associated with the protein annotation in but allowing for a loss or run-away in one of the species, UniProt (25) or Ensembl (14). Currently the search is or multigene orthologs in all but one species, etc. This implemented as mySQL full-text index and the query is allows viewing of the gene clusters that have undergone interpreted in Boolean mode that allows use of ‘+’ and expansions or losses in the specific lineages, which is À ‘ ’operators to indicate that a word is required to be informative in the evolutionary context (29). These present or absent, respectively, for a match to occur; queries, as well as text search, are performed with respect à parentheses used to group words into subexpressions; ‘ ’ to the selected speciation root, marked by red on the serves as the wildcard operator; and a phrase is matched phylogenetic tree. literally if it is enclosed within quotes (e.g. ‘cytochrome c’). The results always refer to the relevant orthologous groups, not separate genes. Query by sequence homology Not all protein identifiers are widely known, particularly Filtering by phylogenic profile for automatically annotated genomes, and functional Another feature of the database interface is filtering annotations for many genes are still anticipated. orthologous groups by a phylogenic profile. This can We therefore provide data querying by sample sequences, Nucleic Acids Research, 2008, Vol. 36, Database issue D275 e.g. a user submitted sequence is matched using Blast 7. Duret,L., Mouchiroud,D. and Gouy,M. (1994) HOVERGEN: a against the collected proteomes, and the top five matches database of homologous vertebrate genes. Nucleic Acids Res., 22, 2360–2365. from distinct proteomes are shown to the user and used to 8. Li,H., Coghlan,A., Ruan,J., Coin,L.J., Heriche,J.K., Osmotherly,L., retrieve the associated orthologous groups, ranking by Li,R., Liu,T., Zhang,Z. et al. (2006) TreeFam: a curated database the number of hits to each group. Please note that if a of phylogenetic trees of animal gene families. Nucleic Acids Res., 34, sequence of an as yet unanalyzed species is used, the query D572–D580. 9. Tatusov,R.L., Fedorova,N.D., Jackson,J.D., Jacobs,A.R., will return the best matching ortholog cluster, however, Kiryutin,B., Koonin,E.V., Krylov,D.M., Mazumder,R., this may not be sufficient to assume orthology. Mekhedov,S.L. et al. (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4, 41. Export of data 10. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., Miller,W. and Lipman,D.J. (1997) Gapped BLAST and All results or particular groups can be retrieved as PSI-BLAST: a new generation of protein database search programs. tab-delimited text or as Fasta formatted protein sequences Nucleic Acids Res., 25, 3389–3402. 11. O’Brien,K.P., Remm,M. and Sonnhammer,E.L. (2005) with annotation of the orthologous group. Inparanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res., 33, 476–480. 12. Chen,F., Mackey,A.J., Vermunt,J.K. and Roos,D.S. (2007) FUTURE PERSPECTIVES Assessing performance of orthology detection strategies applied to eukaryotic genomes. PLoS ONE, 2, e383. All current approaches to identify orthologous groups 13. Chen,F., Mackey,A.J., Stoeckert,C.J. Jr and Roos,D.S. (2006) of genes have different deficiencies and there are ways to OrthoMCL-DB: querying a comprehensive multi-species collection improve their sensitivity and specificity. The implemented of ortholog groups. Nucleic Acids Res., 34, D363–D368. 14. Hubbard,T.J., Aken,B.L., Beal,K., Ballester,B., Caccamo,M., infrastructure in principal does not depend on the Chen,Y., Clarke,L., Coates,G., Cunningham,F. et al. (2007) particular choice of the method, although our own imple- Ensembl 2007. Nucleic Acids Res., 35, D610–D617. mentation of a COG-like and Inparanoid-like ortholog 15. van der Heijden,R.T., Snel,B., van Noort,V. and Huynen,M.A. identification procedure seems to produce reliable results (2007) Orthology prediction at scalable resolution by phylogenetic tree analysis. BMC Bioinformatics, 8, 83. according to extensive checks in the frame of our previous 16. Jothi,R., Zotenko,E., Tasneem,A. and Przytycka,T.M. (2006) research projects. We plan also to test other available COCO-CL: hierarchical clustering of homology relations based on orthology delineation procedures. evolutionary correlations. Bioinformatics, 22, 779–788. 17. Merkeev,I.V., Novichkov,P.S. and Mironov,A.A. (2006) PHOG: a database of supergenomes built from proteome complements. BMC Evol. Biol., 6, 52. ACKNOWLEDGEMENTS 18. Merkeev,I.V. and Mironov,A.A. (2006) PHOG-BLAST–a new We thank Thomas Junier for technical support, Dr Stefan generation tool for fast similarity search of protein families. BMC Evol. Biol., 6, 51. Wyder, Dr Ivo Pedruzzi and Robert M Waterhouse for 19. Waterhouse,R.M., Kriventseva,E.V., Meister,S., Xi,Z., Alvarez,K.S., feedback and quality checks. We would like to acknowl- Bartholomay,L.C., Barillas-Mury,C., Bian,G., Blandin,S. et al. (2007) edge UniProt, Ensembl, Vectorbase, Flybase, the AAA Evolutionary dynamics of immune-related genes and pathways in fly genomes initiative, genome analysis consortiums disease-vector mosquitoes. Science, 316, 1738–1743. 20. Zdobnov,E.M. and Bork,P. (2007) Quantification of insect genome and sequencing centers Agencourt, Baylor, BGI, Broad divergence. Trends Genet., 23, 16–20. Institute, JCVI and JGI for the data. Foundation Giorgi- 21. Zdobnov,E.M., von Mering,C., Letunic,I., Torrents,D., Suyama,M., Cavaglieri and Swiss National Science Foundation are Copley,R.R., Christophides,G.K., Thomasova,D., Holt,R.A. et al. acknowledged for funding (SNF 3100A0-112588 to EMZ). (2002) Comparative genome and proteome analysis of Anopheles Funding to pay the Open Access publication charges gambiae and Drosophila melanogaster. Science, 298, 149–159. 22. Consortium. (2006) Insights into social insects from the genome of for this article was provided by SNF 3100A0-112588. the honeybee Apis mellifera. Nature, 443, 931–949. 23. Saebo,P.E., Andersen,S.M., Myrseth,J., Laerdahl,J.K. and Conflict of interest statement. None declared. Rognes,T. (2005) PARALIGN: rapid and sensitive sequence similarity searches powered by parallel computing technology. Nucleic Acids Res., 33, 535–539. REFERENCES 24. Li,W. and Godzik,A. (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. 1. Fitch,W.M. (1970) Distinguishing homologous from analogous Bioinformatics, 22, 1658–1659. proteins. Syst. Zool., 19, 99–113. 25. Consortium. (2007) The Universal Protein Resource (UniProt). 2. Koonin,E.V. (2005) Orthologs, paralogs, and evolutionary Nucleic Acids Res., 35, D193–D197. genomics. Annu. Rev. Genet., 39, 309–338. 26. Castresana,J. (2000) Selection of conserved blocks from multiple 3. Sonnhammer,E.L. and Koonin,E.V. (2002) Orthology, paralogy alignments for their use in phylogenetic analysis. Mol. Biol. Evol., and proposed classification for paralog subtypes. Trends Genet., 18, 17, 540–552. 619–620. 27. Edgar,R.C. (2004) MUSCLE: a multiple sequence alignment 4. Dayhoff,M.O. (1976) The origin and evolution of protein method with reduced time and space complexity. BMC superfamilies. Fed. Proc., 35, 2132–2138. Bioinformatics, 5, 113. 5. Jones,C.H., Tatti,K.M. and Moran,C.P. Jr (1992) Effects of amino 28. Guindon,S., Lethiec,F., Duroux,P. and Gascuel,O. (2005) PHYML acid substitutions in the À10 binding region of sigma E from Online–a web server for fast maximum likelihood-based Bacillus subtilis. J. Bacteriol., 174, 6815–6821. phylogenetic inference. Nucleic Acids Res., 33, W557–559. 6. Henikoff,S. and Henikoff,J.G. (1992) Amino acid substitution 29. Wyder,S., Kriventseva,E.V., Schro¨der,R., Kadowaki,T. and matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, Zdobnov,E.M. (2007) Quantification of ortholog losses in insects 10915–10919. and vertebrates. Genome Biol., (in press).