Uniprot: the Universal Protein Knowledgebase the Uniprot Consortium1,2,3,4,*

D158–D169 Nucleic Acids Research, 2017, Vol. 45, Database issue Published online 28 November 2016 doi: 10.1093/nar/gkw1099 UniProt: the universal protein knowledgebase The UniProt Consortium1,2,3,4,* 1European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK, 2SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland, 3Protein Information Resource, Georgetown University Medical Center, 3300 Whitehaven Street NW, Suite 1200, WA 20007, USA and 4Protein Information Resource, University of Delaware, 15 Innovation Way, Suite 205, Newark DE 19711, USA Received September 19, 2016; Revised October 25, 2016; Editorial Decision October 26, 2016; Accepted November 05, 2016 ABSTRACT ple, ribosomal profiling experiments are telling us when and where ribosomes bind for translation while CHIP-seq and The UniProt knowledgebase is a large resource of CRAC give us information on the DNA and RNA binding protein sequences and associated detailed anno- properties of proteins. We are also gaining new insights into tation. The database contains over 60 million se- the mechanics of large assemblies of proteins through the in- quences, of which over half a million sequences credible strides being made in electron microscopy technol- have been curated by experts who critically review ogy. However, this wealth of molecular data will be worth experimental and predicted data for each protein. little without it being available to and interpretable by the The remainder are automatically annotated based on scientific community. rule systems that rely on the expert curated knowl- UniProt is a long-standing collection of databases edge. Since our last update in 2014, we have more that enable scientists to navigate the vast amount of than doubled the number of reference proteomes to sequence and functional information available for proteins. The UniProt Knowledgebase (UniProtKB) is the 5631, giving a greater coverage of taxonomic diver- central resource that combines UniProtKB/Swiss-Prot sity. We implemented a pipeline to remove redundant and UniProtKB/TrEMBL. UniProtKB/Swiss-Prot con- highly similar proteomes that were causing exces- tains over 550 000 sequences that have been created by our sive redundancy in UniProt. The initial run of this expert biocuration team. For these entries experimental in- pipeline reduced the number of sequences in UniProt formation has been extracted from the literature and orga- by 47 million. For our users interested in the acces- nized and summarized, greatly easing scientists access to sory proteomes, we have made available sets of pan protein information. UniProtKB/TrEMBL provides a fur- proteome sequences that cover the diversity of se- ther 60 million sequences that have been largely derived quences for each species that is found in its strains from high throughput sequencing of DNA. These entries and sub-strains. To help interpretation of genomic are annotated by our rule based automatic annotation sys- variants, we provide tracks of detailed protein infor- tems. We also provide a series of UniRef databases that provide sequence sets trimmed at various levels of sequence mation for the major genome browsers. We provide identity (1,2). Finally we provide the UniProt Archive (Uni- a SPARQL endpoint that allows complex queries of Parc) that provides a complete set of known sequences, in- the more than 22 billion triples of data in UniProt cluding historical obsolete sequences (3). (http://sparql.uniprot.org/). UniProt resources can be In this article we describe the major developments that we accessed via the website at http://www.uniprot.org/. have made since our last update published in this journal in 2014 (4). These updates have helped to improve the scala- bility and usability of UniProt for our end users. We also INTRODUCTION include a call to all life science authors to include UniProt Protein science is entering a new era that promises to un- accession numbers for proteins within their papers to help lock many of the mysteries of the cell’s inner workings. Next improve the connections between the literature and scien- generation sequencing is transforming the way that we ac- tific databases. cess DNA information and, as the variety of protein assays that can be linked to a DNA or RNA read-out grows, we are gaining protein information at an increasing rate. For exam- *To whom correspondence should be addressed. Tel: +44 1223 494 100; Fax: +44 1223 494 468; Email: [email protected] Present address: Alex Bateman, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK. C The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2017, Vol. 45, Database issue D159 Growth of reference proteomes The UniProt proteome pages (http://www.uniprot.org/ proteomes/) provide access to proteomes for over 56 000 (56 317, release 2016 07) species with completely sequenced genomes. The majority of these proteomes are based on the translation of genome sequence submis- sions to the International Nucleotide Sequence Database Consortium (INSDC) but also include predictions from Ensembl, RefSeq reference genomes and vector/parasite- specific databases like VectorBase (5)andWormBasePar- aSite (6). Some proteomes may also include protein sequences based on high quality cDNAs that cannot be mapped to the current genome assemblies (due to sequencing errors or gaps) and have been manually reviewed follow- ing supporting evidence, and/or careful analysis of homol- ogous sequences from closely related organisms. Proteomes can include both manually reviewed (UniProtKB/Swiss- Prot) and unreviewed (UniProtKB/TrEMBL) entries. The proportion of reviewed entries varies between proteomes, and is obviously greater for the proteomes of intensively curated model organisms. For example, the proteomes for Figure 1. Growth of the number of sequences in UniProt databases. The Saccharomyces cerevisiae 288C and E. coli strain K12 con- blue line shows the growth in UniProtKB/TrEMBL entries from January sist entirely of reviewed entries. The unreviewed records 2010 to date. The sharp drop in UniProtKB entries corresponds to the proteome redundancy minimization (PRM) procedure implemented in March are updated by our automatic annotation systems for ev- 2015. Note that the post-PRM growth in UniProtKB is no longer exponen- ery release, ensuring consistency and up-to-date annota- tial. tions. To help in tracking and displaying proteomes, we use a proteome identifier that uniquely identifies the set of proteins corresponding to a single assembly of a completely se- PROGRESS AND NEW DEVELOPMENTS quenced genome. An example of a proteome can be found Identifying redundant proteomes here: http://www.uniprot.org/proteomes/UP000005640. It is important to realize that there may be additional The accelerating growth of sequenced genomes poses great records for a particular species which do not belong to challenges for databases with their major growth being the defined proteome. For more information, please read driven by sequencing of very similar and almost identical http://www.uniprot.org/help/proteome. strains of the same bacterial species. UniProtKB saw an ex- As the number of new proteomes increases, it is im- ponential growth in the 2014–2015 time period, with April portant to organize the data in a way that allows users 2014 seeing a very high increase, reaching a peak of 90 mil- to effectively navigate the proteomes while keeping rele- lion sequences with a high level of redundancy (e.g. 4080 vant and pertinent information easily accessible. To achieve proteomes of Staphylococcus aureus comprising 10.88 mil- this, UniProt provides reference proteomes to give a selec- lion proteins), see Figure 1. While this wealth of protein tion of species representing a broad coverage of the tree of information presents our users with new opportunities for life. These reference proteomes are selected via consulta- proteome-wide analysis and interpretation, it also creates tion with the research community or computationally deter- challenges in capturing, searching, preserving and present- mined from proteome clusters (7) using an algorithm that ing proteome data to the scientific community. In 2015, considers the best overall annotation score, an automati- we introduced a method to identify and remove highly re- cally calculated score which provides a heuristic measure of dundant bacterial proteomes within species groups, for ex- the annotation content of a proteome. There are currently ample, strains of Escherichia coli, in UniProtKB while still 5,631 reference proteomes representing a cross-section of making them available for users in UniParc. Briefly, the the taxonomic diversity found in UniProtKB (Figure 2). method identifies redundant proteomes by performing pair- They are the focus of both manual and automatic anno- wise alignments of sets of sequences for pairs of proteomes tation, aiming to provide the best annotated protein sets and subsequently applies graph theory to find dominating for the selected

Uniprot: the Universal Protein Knowledgebase the Uniprot Consortium1,2,3,4,*

Clinical Spectrum and Genetic Landscape for Hereditary Spastic

Analysis of the Impact of Sequencing Errors on Blast Using Fault Injection

Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J

Identification of the Drosophila Melanogaster Homolog of the Human Spastin Gene

Microtubule-Severing Enzymes at the Cutting Edge

"Phylogenetic Analysis of Protein Sequence Data Using The

A Deficiency Screen for Genetic Interactions with Spastin Overexpression Reveals a Role for P21-Activated Kinase 3

EMBL-EBI Powerpoint Presentation

A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide

Genomic Sequencing of SARS-Cov-2: a Guide to Implementation for Maximum Impact on Public Health

Supplementary Table S4. FGA Co-Expressed Gene List in LUAD

The Biogrid Interaction Database