The European Bioinformatics Institute's Data Resources

# 2003 Oxford University Press Nucleic Acids Research, 2003, Vol. 31, No. 1 43–50 DOI: 10.1093/nar/gkg066 The European Bioinformatics Institute’s data resources Catherine Brooksbank*, Evelyn Camon, Midori A. Harris, Michele Magrane, Maria Jesus Martin, Nicola Mulder, Claire O’Donovan, Helen Parkinson, Mary Ann Tuli, Rolf Apweiler, Ewan Birney, Alvis Brazma, Kim Henrick, Rodrigo Lopez, Guenter Stoesser, Peter Stoehr and Graham Cameron EMBL–European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB101SD, UK Received October 10, 2002; Revised and Accepted October 14, 2002 ABSTRACT Wellcome Trust Sanger Institute and the Human Genome Mapping Project (HGMP) Resource Centre. As the amount of biological data grows, so does the The EBI fulfils three important functions: it provides some of need for biologists to store and access this informa- the world’s largest and most heavily used bioinformatics tion in central repositories in a free and unambig- services in a free and unrestricted manner; it carries out uous manner. The European Bioinformatics Institute research; and it trains the bioinformaticians of the future. We (EBI) hosts six core databases, which store informa- will limit our discussion here to the EBI’s databases and their tion on DNA sequences (EMBL-Bank), protein services (Fig. 1). sequences (SWISS-PROT and TrEMBL), protein structure (MSD), whole genomes (Ensembl) and gene expression (ArrayExpress). But just as a cell DATABASES would be useless if it couldn’t transcribe DNA or The six core molecular databases hosted by the EBI reflect the translate RNA, our resources would be compro- methods used by biologists to collect information on how cells mised if each existed in isolation. We have therefore and organisms work. These store information on DNA and developed a range of tools that not only facilitate the RNA sequences (EMBL-Bank), protein sequences (SWISS- deposition and retrieval of biological information, PROT and TrEMBL), protein structure (MSD), whole genomes but also allow users to carry out searches that reflect (Ensembl) and gene expression experiments (ArrayExpress). the interconnectedness of biological information. All the EBI’s databases are annotated: features pertaining to The EBI’s databases and tools are all available on gene structure and transcription mechanisms as well as those describing protein structure and function are stored in them our website at www.ebi.ac.uk. and are predicted, inferred, interpreted and validated from many sources. Much of this annotation is performed by highly qualified biologists and the automated annotation that we do is subjected to rigorous quality control. Furthermore, our INTRODUCTION: WHAT IS THE EBI? databases are extensively cross-referenced: our curators have added over 5 million database cross-references. This makes it The roots of the European Bioinformatics Institute (EBI) lie straightforward for users to determine the relationships that in the European Molecular Laboratory (EMBL) Nucleotide exist between the different types of molecules represented in Sequence Data Library (1), which was established in 1980 at the EBI’s databases, as well as providing easy ways for linking the EMBL in Heidelberg, Germany and was the world’s first out to resources that are provided by our collaborators at other nucleotide sequence database. What began as a modest task of institutions. abstracting information from the literature soon became a major database activity with direct electronic submissions of data and the need for highly skilled informatics staff. The task DNA AND RNA SEQUENCES grew with the start of the genome projects and, in 1992, the decision was taken to create a new EMBL Outstation dedicated It would now be impossible for molecular biologists to do their to bioinformatics. The EBI was established on the Wellcome research without free access to nucleotide sequence data. Trust Genome Campus in the United Kingdom, in close EMBL-Bank is Europe’s primary nucleotide sequence proximity to major sequencing efforts taking place at the resource. It is produced in an international collaboration with *To whom correspondence should be addressed. Tel: þ44 1223492525; Fax: þ44 1223494468; Email: [email protected] Correspondence may also be addressed to Rodrigo Lopez: Tel: þ44 1223494423; Fax: þ44 1223494468; Email: [email protected] 44 Nucleic Acids Research, 2003, Vol. 31, No. 1 Figure 1. Summary of the services available at the EBI’s website. GenBank in the USA (2) and the DNA Database of Japan sequencing projects that aim to deal with cancer, hunger (DDBJ) (3). Each of the three groups collects a proportion of (through the plant-genome-sequencing projects) and infectious the total sequence data reported worldwide, and all new and disease. More than 100 bacterial genomes have been updated database entries are exchanged between them on a completed in the past two years and EMBL-Bank contains daily basis. more than 2400 viral genomes. EMBL-Bank contains tens of millions of DNA and RNA sequences, ranging from as few as ten base pairs to entire genomes. Its sequences come from three main sources: Accessing EMBL-Bank individual research groups, genome-sequencing projects and The main ways to access EMBL-Bank are through a query tool patent applications. EMBL-Bank is a primary database: the called the Sequence Retrieval System (SRS; see later) (4), submitters own the data. Although large-scale sequencing which is available from the EBI at srs.ebi.ac.uk/, or by projects have now become EMBL-Bank’s main source of new downloading data from the EBI’s FTP server. Data can be sequence data, the importance of short sequences submitted by retrieved from any of the three collaborating databases, individual researchers should not be underestimated. These regardless of which one it was submitted to. The EBI also submissions often embody the results of detailed research into provides a range of tools (Table 1) so that users can run gene function and contribute to the body of knowledge to an sequence similarity searches against EMBL-Bank. At each extent that belies the modest volume of data. EMBL-Bank’s major release of the database a DVD set is produced with the curators make sure that this information is up to the high compressed content of the database. This set is available on standards of annotation required for their public distribution, request from [email protected]. and these highly annotated sequences are often updated as The EBI’s Genome Webserver—www.ebi.ac.uk/genomes/— more knowledge about them is determined by the original allows users to browse all the completed genomes that have authors (or other contributors; see below). By contrast, been submitted to EMBL-Bank. Unfinished genomic data can genome-sequencing projects are usually long-term collabora- also be accessed using the EBI’s Genome Monitoring Table tions between a sequencing group and EMBL-Bank, resulting (Genome MOT) (5), in which the data are sorted by in huge amounts of automatically annotated data being chromosome. Genome MOT is updated daily and provides submitted directly to the database. Finally, EMBL-Bank has direct access to individual EMBL-Bank entries. a direct pipeline from the European Patent Office, so all sequences in patent applications are incorporated into EMBL- Bank as soon as they become publicly available. EMBL-Bank has been growing exponentially for more than Submitting data to EMBL-Bank ten years now. More than 38% of the total nucleotide count in Most journals now require authors to submit sequence the database is from the human genome alone. This is followed information to EMBL-Bank, GenBank or DDBJ before closely by mouse and rat, which account for another 40%. publication. This maximizes its benefit to research, by ensuring Owing to their economic importance, plant genomes, particu- that it will always be available and readily accessible to larly that of rice, now form a significant proportion of EMBL- scientists. EMBL-Bank assigns each submitted sequence an Bank’s content, with three different strains available in the accession number that permanently identifies it, and that database. Model organisms such as the fruit fly and the authors include when they publish a sequence. This procedure nematode represent 2% and 3% of total nucleotide count, ensures availability and distribution of new sequence data in a respectively. Rapid and accurate DNA sequencing has also timely fashion. Upon submission, authors can either make their significantly enhanced the fight against disease. Most of the sequence publicly available immediately or wait until their completed genomes available in EMBL-Bank are from paper is published. Nucleic Acids Research, 2003, Vol. 31, No. 1 45 Table 1. A selection of data-access tools available at the EBI Tool Description Homology and similarity Fasta and blast tools Sequence similarity and homology searching against nucleotide and protein databases. Fasta can be very specific when identifying long regions of low similarity, especially for highly diverged sequences. Blast is fast and sensitive, yielding functional and evolutionary clues about the structure and function of your query sequence. Genome and Proteome Fasta3 server Sequence similarity and homology searching against complete proteome or genome databases using Fasta. SNP-Fasta3 server Fasta3 searches of the European database of single nucleotide polymorphisms (HGVbase). MPsrch Aneda’s very fast implementation of the true Smith and Waterman algorithm for biological sequence comparison. MPsrch identifies hits often missed by other database-searching methods. Scanps2.3 Scan Protein Sequence: fast implementation of the true Smith and Waterman algorithm for protein database searches. Sequence analysis ClustalW Produces biologically meaningful multiple sequence alignments of divergent sequences. GeneMark A gene-prediction service that assesses the coding potential of DNA sequences by using Markov models of coding and non-coding regions within a sliding window. Genetic Code Viewer Highlights genetic code differences in different organisms. Protein functional analysis CluSTr Search Search the EBI’s protein sequence databases (SWISS-PROT and TrEMBL) by accession numbers.

The European Bioinformatics Institute's Data Resources

A Semantic Standard for Describing the Location of Nucleotide and Protein Feature Annotation Jerven T

HTG Data Input: Annotation File Output in DDBJ Flat File

Nucleic Acid Databases on the Web Richard Peters and Robert S

Biological Databases

Biocuration Experts on the Impact of Duplication and Other Data Quality Issues in Biological Databases

2021.03.02.433662V1.Full.Pdf

DDBJ Progress Report

Aquatic Symbiosis Genomics at the Wellcome Sanger Institute Announcement of Opportunity & Call for Collaboration Proposals

Trends in Bioinformatics Technology

Flow of Genetic Information DNA --> RNA --> PROTEIN

Nucleotide and Protein Databases

CBD/DSI/AHTEG/2020/1/2 3 March 2020