The Quest for Orthologs Benchmark Service and Consensus Calls in 2020

Nucleic Acids Research, 2020 1 doi: 10.1093/nar/gkaa308 The Quest for Orthologs benchmark service and consensus calls in 2020 Downloaded from https://academic.oup.com/nar/advance-article-abstract/doi/10.1093/nar/gkaa308/5831173 by University College London user on 13 May 2020 Adrian M. Altenhoff 1,2, Javier Garrayo-Ventas 3, Salvatore Cosentino 4, David Emms 5, Natasha M. Glover 1,6,7,AnaHernandez-Plaza´ 8, Yannis Nevers 1,6,7,9, Vicky Sundesha 3, Damian Szklarczyk1,10, Jose´ M. Fernandez´ 3, Laia Codo´ 3, the Quest for Orthologs Consortium, Josep Ll Gelpi 3,11, Jaime Huerta-Cepas 8, Wataru Iwasaki 4, Steven Kelly 5, Odile Lecompte 9, Matthieu Muffato 12, Maria J. Martin 12, Salvador Capella-Gutierrez 3, Paul D. Thomas 13, Erik Sonnhammer 14,* and Christophe Dessimoz 1,6,7,15,16,* 1SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland, 2ETH Zurich, Department of Computer Science, Zurich, Switzerland, 3Life Sciences Department, Barcelona Supercomputing Center (BSC), Barcelona, Spain, 4Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Tokyo, Japan, 5Department of Plant Sciences, University of Oxford, South Parks Road, Oxford, UK, 6Department of Computational Biology, University of Lausanne, Lausanne, Switzerland, 7Center for Integrative Genomics, University of Lausanne, Lausanne, Switzerland, 8Centro de Biotecnologia y Genomica de Plantas, Universidad Politecnica´ de Madrid (UPM) - Instituto Nacional de Investigacion´ y Tecnolog´ıa Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223, Pozuelo de Alarcon,´ Madrid, Spain, 9Department of Computer Science, ICube, UMR 7357, University of Strasbourg, CNRS, Fed´ eration´ de Medecine´ Translationnelle de Strasbourg, Strasbourg, France, 10Institute of Molecular Life Sciences, University of Zurich, Winterthurerstrasse 190, Zurich, 8057, Switzerland, 11Department of Biochemistry and Molecular Biomedicine. University of Barcelona. Barcelona, Spain, 12European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK, 13Division of Bioinformatics, Department of Preventive Medicine, University of Southern California, Los Angeles, USA, 14Science for Life Laboratory, Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden, 15Department of Genetics, Evolution & Environment, University College London, London, UK and 16Department of Computer Science, University College London, London, UK Received February 20, 2020; Revised April 16, 2020; Editorial Decision April 17, 2020; Accepted April 20, 2020 ABSTRACT tholog calls derived from public benchmark submissions are provided on the Alliance of Genome Re- The identiﬁcation of orthologs––genes in different sources website, the joint portal of NIH-funded model species which descended from the same gene in organism databases. their last common ancestor––is a prerequisite for many analyses in comparative genomics and molecular evolution. Numerous algorithms and resources INTRODUCTION have been conceived to address this problem, but The identification of orthologs––pairs of genes in different benchmarking and interpreting them is fraught with species which have evolved from a common gene in the last difﬁculties (need to compare them on a common in- ancestor of the species (1)––is essential for a wide range put dataset, absence of ground truth, computational of analyses in phylogenetics and comparative genomics (re- cost of calling orthologs). To address this, the Quest viewed in 2). However, because ortholog inference requires for Orthologs consortium maintains a reference set reconstructing evolutionary events which have taken place of proteomes and provides a web server for con- in single genes potentially hundreds of millions of years in tinuous orthology benchmarking (http://orthology. the past, inference can be a challenging task (reviewed in 3). benchmarkservice.org). Furthermore, consensus or- Forover10years,theQuest for Orthologs consortium has brought together orthology method developers, orthol- *To whom correspondence should be addressed. Tel: +41 21 692 4155; Email: [email protected] Correspondence may also be addressed to Erik Sonnhammer. Tel: +46 70 5586395; Email: [email protected] C The Author(s) 2020. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 2 Nucleic Acids Research, 2020 ogy and model organism database providers, and orthol- for download in different formats: the protein sequences ogy users to improve inference, provision and interpretation as FASTA and SeqXML files, CDS sequences for most of ortholog data (4–8). Among the main achievements of proteins as FASTA files, and, for an increasing number of the consortium are agreements on standard data file format species, the genomic locus coordinates are available in XML for orthology (9), an orthology ontology (10,11), curation format. Downloaded from https://academic.oup.com/nar/advance-article-abstract/doi/10.1093/nar/gkaa308/5831173 by University College London user on 13 May 2020 of a reference set of proteomes (5), curation of a reference The benchmark service aims to follow the yearly release species tree (12) and community-led benchmarking (13). cycle of the Reference Proteome datasets, with a lag of a few However, the available genomes and methods change months. Users are encouraged to always benchmark against over time, requiring both reference proteomes and bench- the latest available dataset. In this publication, we will dis- marking to be continuously updated and re-interpreted. cuss the results on the 2018 Reference Proteome dataset. Furthermore, since benchmarking requires gathering and comparing ortholog calls obtained from various methods, BENCHMARK SERVICE there is an opportunity to provide the community with consensus calls. As introduced in (13), the Quest for Orthologs benchmark Here, we provide updates of the Quest for Orthologs Ref- service takes ortholog predictions as input (in OrthoXML erence Proteomes and benchmark service, and introduce or tab-delimited text format) and returns performance esti- a new ortholog consensus service, which is integrated to mates on a variety of tests in comparison with other submis- the website of the Alliance for Genome Resources––the joint sions in the form of summary graphs and downloadable raw portal of NIH-funded model organism databases (14). results (Figure 1). Since its introduction, it has processed over 1000 user submissions. In the rest of this section, we describe improvements to the benchmark service, provide REFERENCE PROTEOMES an update of the results since the last publication, and con- Orthology benchmarking needs a consensus dataset of pro- clude with open problems in orthology benchmarking. teomes and common file formats to be used by the different orthology prediction methods for standardized bench- Improving the generalized species tree discordance test marks and result interpretation. Since 2011, the Quest for Orthologs (QfO) makes available a Reference Proteomes The most substantial change to the benchmarks themselves dataset, providing a representative protein for each gene relate to the generalized species tree discordance bench- in the genome of selected species. These datasets have mark. This benchmark assesses the quality of orthology been generated annually from the UniProt Knowledge- predictions by comparing gene trees built from predicted or- base (UniProtKB) (15) using an automatic gene-centric thologs with the underlying species phylogeny. The bench- pipeline which identifies all protein isoforms for a gene mark requires an undisputed, bifurcating species phylogeny. and selects the canonical protein sequence as represen- However, the set of species in the QfO proteome dataset in- tative of the set. The canonical sequences are typically cludes some unresolved nodes––either due to inference un- the longest isoform which best describes the sequence an- certainty or genuine biological complications (such as in- notations e.g. domains, isoforms, polymorphisms, post- complete lineage sorting, which causes bona fide local vari- translational modificationshttps://www.uniprot.org/help/ ( ation in the gene trees). When we introduced the benchmark canonical and isoforms). The QfO community has been (13), we showed how we can sample undisputed bifurcating working with UniProt over the years to identify species for species trees from the multifurcating QfO tree by randomly the reference data set, and now comprises well-annotated pruning the multifurcating nodes. In brief, we recursively model organisms, organisms of interest for biomedical re- traverse the multifurcating species tree and sample two ran- search and species broadly covering the tree of life for phy- dom subtrees whenever a visited node has more than two logenetic interpretation. The QfO Reference Proteomes are subtrees. This results in a fully bifurcating subtree. a manually compiled subset of the UniProt Reference Pro- Since the benchmark was introduced, we noticed some teomes and their generation is now an integral part of the biases in the resulting trees. Many gene trees covered only a UniProt release pipelines. few species, especially on the Last Universal Common An- The QfO

The Quest for Orthologs Benchmark Service and Consensus Calls in 2020

Applied Category Theory for Genomics – an Initiative

UNIVERSITY of CALIFORNIA SAN DIEGO Making Sense of Microbial

Standardised Benchmarking in the Quest for Orthologs

Multiscale Modeling in Systems Biology

OMA, a Comprehensive, Automated Project for the Identiﬁcation of Orthologs from Complete Genome Data: Introduction and First Achievements

REST API and the Packages for R and Python Omadb

Daniel Aalberts Scott Aa

1. Introduction

Annual Scientific Report 2011 Annual Scientific Report 2011 Designed and Produced by Pickeringhutchins Ltd

An Expanded Evaluation of Protein Function Prediction Methods Shows an Improvement in Accuracy

UCLA UCLA Electronic Theses and Dissertations

Ismb/Eccb 2019 Proceedings Papers Committee