Ensembl Genomes 2020––Enabling Non-Vertebrate Genomic Research Kevin L

Published online 10 October 2019 Nucleic Acids Research, 2020, Vol. 48, Database issue D689–D695 doi: 10.1093/nar/gkz890 Ensembl Genomes 2020––enabling non-vertebrate genomic research Kevin L. Howe 1, Bruno Contreras-Moreira1, Nishadi De Silva1, Gareth Maslen1, Wasiu Akanni1, James Allen1, Jorge Alvarez-Jarreta1, Matthieu Barba1, Dan M. Bolser 1, Lahcen Cambell1, Manuel Carbajo1, Marc Chakiachvili1, Mikkel Christensen1, Carla Cummins1, Alayne Cuzick2, Paul Davis1, Silvie Fexova1, Astrid Gall1, Nancy George1, Laurent Gil1, Parul Gupta3, Kim E. Hammond-Kosack 2, Erin Haskell1, Sarah E. Hunt 1, Pankaj Jaiswal 3, Sophie H. Janacek1, Paul J. Kersey1, Nick Langridge1, Uma Maheswari1, Thomas Maurel1, Mark D. McDowall1, Ben Moore1, Matthieu Muffato1, Guy Naamati1, Sushma Naithani 3, Andrew Olson4, Irene Papatheodorou1, Mateus Patricio1, Michael Paulini1, Helder Pedro 1,EmilyPerry 1, Justin Preece3, Marc Rosello1, Matthew Russell 1, Vasily Sitnik1, Daniel M. Staines1, Joshua Stein4, Marcela K. Tello-Ruiz4, Stephen J. Trevanion1, Martin Urban2, Sharon Wei4, Doreen Ware4,5, Gary Williams1, Andrew D. Yates 1 and Paul Flicek 1,* 1European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK, 2Department of Biointeractions and Crop Protection, Rothamsted Research, Harpenden, Hertfordshire AL5 2JQ, UK, 3Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331, USA, 4Cold Spring Harbor Laboratory, 1 Bungtown Rd, Cold Spring Harbor, NY 11724, USA and 5USDA ARS NAA Robert W. Holley Center for Agriculture and Health, Agricultural Research Service, Ithaca, NY 14853, USA Received September 14, 2019; Revised September 29, 2019; Editorial Decision September 30, 2019; Accepted October 02, 2019 ABSTRACT Ensembl project, which forms a key part of our fu- ture strategy for dealing with the increasing quantity Ensembl Genomes (http://www.ensemblgenomes. of available genome-scale data across the tree of life. org) is an integrating resource for genome-scale data from non-vertebrate species, complementing the resources for vertebrate genomics developed in the INTRODUCTION context of the Ensembl project (http://www.ensembl. The goal of Ensembl Genomes is to support and facili- org). Together, the two resources provide a consis- tate non-vertebrate genome research by integrating high- tent set of interfaces to genomic data across the quality reference genomes and annotation for every species tree of life, including reference genome sequence, for which this is available, providing tools and displays gene models, transcriptional data, genetic variation that allow users to interrogate and compare these genomes, and integrating with and connecting to other resources and comparative analysis. Data may be accessed providing non-vertebrate reference genomic data. To this via our website, online tools platform and program- end, we work closely with the Ensembl project (1)topro- matic interfaces, with updates made four times per duce five additional sites, each focused on a particular non- year (in synchrony with Ensembl). Here, we provide vertebrate clade of life: bacteria (http://bacteria.ensembl. an overview of Ensembl Genomes, with a focus on org), protists (http://protists.ensembl.org), fungi (http:// recent developments. These include the continued fungi.ensembl.org), plants (http://plants.ensembl.org)and growth, more robust and reproducible sets of ortho- invertebrate metazoa (http://metazoa.ensembl.org). These logues and paralogues, and enriched views of gene sites are updated four times a year, using the same platform expression and gene function in plants. Finally, we as, and in synchrony with, Ensembl. report on our continued deeper integration with the Core data available for all species include genome sequence and annotations of protein-coding and non-coding *To whom correspondence should be addressed. Tel: +44 1223 494468; Email: [email protected] C The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. D690 Nucleic Acids Research, 2020, Vol. 48, Database issue genes. Transcriptional data, genetic variation and com- tinued access to the website beyond when they have been parative analysis data are additionally available for many superseded by new releases. Archive releases to date are species. For most species, data are imported directly from 37 (e.g. http://eg37-plants.ensembl.org) and 40 (e.g. http: the archives of the International Nucleotide Sequence //eg40-fungi.ensembl.org). Database Collaboration (INSDC) (2), the European Vari- Finally, for programmatic access, we provide a compre- ation Archive (http://www.ebi.ac.uk/eva) and other open- hensive language-agnostic RESTful Application Program- access sources. For a few species of particular research or ming Interface (API). A significant update this year has socioeconomic importance, additional data sets are identi- been the consolidation of the APIs for Ensembl and En- fied and integrated. However, according to our agreement sembl Genomes into a single service (https://rest.ensembl. with NCBI and UCSC, we only present data on genome as- org), for the first time providing access to both vertebrate semblies that have been accessioned by the INSDC. and non-vertebrate data through a single programmatic in- We collaborate closely with a number of other inter- terface. A training manual for the consolidated API can national resources in various domains of life, including be found at https://www.ebi.ac.uk/training/online/course/ Gramene (http://www.gramene.org)(3) for plants, Vector- ensembl-rest-api. Base (http://www.vectorbase.org)(4) for invertebrate vectors of human pathogens and WormBase (http://www. NEW AND IMPROVED GENOMES wormbase.org)(5) for nematodes and flatworms. Together, we develop common data sets and integration/analysis The past 2 years have seen continued growth in our genome workflows, which are made available through both Ensembl collections (see Table 1). Contents of Ensembl Bacteria have Genomes and project-specific websites and services. been frozen while our focus has been on other divisions in this period. We are working with partners to develop a co- herent strategy for prokaryotes, especially in the context of ACCESS the growing availability of metagenomic data, and antici- All data in the resource are open access and can be ex- pate significant progress on this in the next year. plored in multiple ways. Interactive and visual access is pro- Of the 250 genomes added in Ensembl Fungi and En- vided through our websites. A key component of these sites sembl Protists, significant additions include Puccinia coro- is our embedded genome browser, enabling users to scroll nata, a new and fast-spreading pathogen of barley and oats; through a graphical representation of a DNA molecule at Phytophthora megakarya, an oomycete pathogen causing different levels of resolution and visualize various types of black pod disease in cocoa trees; and the ice-loving diatom annotations as tracks on the browser, including gene mod- Fragilariopsis cylindrus, found in Arctic and Antarctic sea- els, genomic variants, repetitive elements and homology se- water. quences aligned to the genome. We also display functional In metazoa, notable additions have been Folsomia can- information on genes and gene products, imported from dida and Orchesella cincta as examples of organisms impor- the UniProt Knowledgebase (6) and analysis of peptide se- tant for the assessment of soil quality; Bombus terrestris (the quence (using the classification tool InterProScan (7)). Var- buff-tailed bumblebee), a declining insect species; Daph- ious tools for text and sequence search, data upload and nia magna, which is used in studying toxicity (such as data analysis are provided, allowing users to examine their polystyrene nanoplastics, lead and pesticides) in aquatic en- own data in the context of the reference sequence and an- vironments; Teleopsis dalmanni (the stalk eyed fly), a model notation. organism for studying sexual selection mechanisms; and All data are made available as database dumps, and com- the disease vectors Leptotrombidium deliense (scrub typhus) mon data sets (e.g. DNA, RNA and protein sequence sets, and Culicoides sonorensis (bluetongue arbovirus in live- and sequence alignments) can be directly downloaded in stock). bulk via FTP (ftp://ftp.ensemblgenomes.org)inavarietyof In plants, we have significantly increased coverage of formats. For certain data types, the web application utilizes asterids, including Actinidia chinensis (kiwifruit), Coffea data files stored in archival resources (such as the European canephora (robusta coffee), Cynara cardunculus (globe arti- Nucleotide Archive), avoiding the need for database builds choke), Daucus carota (carrot), Capsicum annuum (hot pep- and improving the speed of response. These files are also per) and Helianthus annuus (sunflower). Elsewhere in the made available to users. taxonomy, we have included Arabidopsis halleri, a brassica Data are also made available through a series of data model for ecological genomics due to its heavy metal hyper- warehouses, optimized around common (gene- and variant- accumulation ability; and Marchantia polymorpha,aliver-

Ensembl Genomes 2020––Enabling Non-Vertebrate Genomic Research Kevin L

Ensembl Genomes: Extending Ensembl Across the Taxonomic Space P

Abstracts Genome 10K & Genome Science 29 Aug - 1 Sept 2017 Norwich Research Park, Norwich, Uk

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

Annual Scientific Report 2013 on the Cover Structure 3Fof in the Protein Data Bank, Determined by Laponogov, I

Strategic Plan 2011-2016

Browsing Genomes with Ensembl Annotation

The Genomic Basis of Circadian and Circalunar Timing Adaptations in a Midge Tobias S

ALEXA: a Microarray Design Platform for Alternative Expression Analysis

Gene3d: Multi-Domain Annotations for Protein Sequence and Comparative Genome Analysis Jonathan G

What's New in Rnacentral and Rfam

Browsing Genes and Genomes with Ensembl

Rfam 14: Expanded Coverage of Metagenomic, Viral and Microrna Families Ioanna Kalvari, Eric P