The Universal Protein Resource (Uniprot) Amos Bairoch, Rolf Apweiler1,*, Cathy H
Total Page:16
File Type:pdf, Size:1020Kb
D154–D159 Nucleic Acids Research, 2005, Vol. 33, Database issue doi:10.1093/nar/gki070 The Universal Protein Resource (UniProt) Amos Bairoch, Rolf Apweiler1,*, Cathy H. Wu2, Winona C. Barker3, Brigitte Boeckmann, Serenella Ferro, Elisabeth Gasteiger, Hongzhan Huang2, Rodrigo Lopez1, Michele Magrane1, Maria J. Martin1, Darren A. Natale2, Claire O’Donovan1, Nicole Redaschi and Lai-Su L. Yeh3 Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 rue Michel Servet, 1211 Geneva 4, Switzerland, 1The EMBL Outstation—The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK, 2Department of Biochemistry and Molecular Biology and 3National Biomedical Research Foundation, Georgetown University Medical Center, 3900 Reservoir Road, NW, Box 571414, WA 20057-1414, USA Received September 14, 2004; Revised and Accepted October 5, 2004 ABSTRACT residue level mapping of sequences from the The Universal Protein Resource (UniProt) provides Protein Data Bank entries onto corresponding the scientific community with a single, centralized, UniProt entries. For convenient sequence searches authoritative resource for protein sequences and we provide the UniRef non-redundant sequence functional information. Formed by uniting the databases. The comprehensive UniParc database Swiss-Prot, TrEMBL and PIR protein database activ- stores the complete body of publicly available ities, the UniProt consortium produces three layers of protein sequence data. The UniProt databases can protein sequence databases: the UniProt Archive be accessed online (http://www.uniprot.org) or down- (UniParc), the UniProt Knowledgebase (UniProt) and loaded in several formats (ftp://ftp.uniprot.org/pub). the UniProt Reference (UniRef) databases. The New releases are published every two weeks. UniProt Knowledgebase is a comprehensive, fully classified, richly and accurately annotated protein sequence knowledgebase with extensive cross- INTRODUCTION references. This centrepiece consists of two Previously, Swiss-Prot + TrEMBL (1) and PIR-PSD (2) coex- sections: UniProt/Swiss-Prot, with fully, manually isted as protein databases with differing sequence coverage curated entries; and UniProt/TrEMBL, enriched with and annotation priorities. In 2002, the Swiss-Prot + TrEMBL automated classification and annotation. During groups at the SIB (Swiss Institute of Bioinformatics) and 2004, tens of thousands of Knowledgebase records EBI (European Bioinformatics Institute) and the PIR (Protein got manually annotated or updated; we introduced a Information Resource) group at Georgetown University new comment line topic: TOXIC DOSE to store infor- Medical Center and National Biomedical Research Founda- tion joined forces as the UniProt consortium (3). mation on the acute toxicity of a toxin; the UniProt The UniProt consortium maintains three database layers: keyword list got augmented by additional keywords; we improved the documentation of the keywords and (i) The UniProt Archive (UniParc) provides a stable, com- are continuously overhauling and standardizing prehensive, non-redundant sequence collection by storing the annotation of post-translational modifications. the complete body of publicly available protein sequence Furthermore, we introduced a new documentation data. (ii) The UniProt Knowledgebase (UniProt) provides the file of the strains and their synonyms. Many new central database of protein sequences with accurate, database cross-references were introduced and we consistent and rich sequence and functional annotation. started to make use of Digital Object Identifiers. (iii) The UniProt Reference (UniRef) databases provide We also achieved in collaboration with the non-redundant data collections based on the UniProt Macromolecular Structure Database group at EBI an Knowledgebase and UniParc in order to obtain complete improved integration with structural databases by coverage of sequence space at several resolutions. *To whom correspondence should be addressed: Tel: +44 0 1223 494435; Fax: +44 0 1223 494468; Email: [email protected] The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use permissions, please contact [email protected]. ª 2005, the authors Nucleic Acids Research, Vol. 33, Database issue ª Oxford University Press 2005; all rights reserved Nucleic Acids Research, 2005, Vol. 33, Database issue D155 THE UniProt ARCHIVE information), we attach other annotation information both manually and automatically. Although most protein sequence data are derived from the Manual annotation is performed by biologists and is based translation of DDBJ/EMBL/GenBank (4) sequences, primary on literature curation and sequence analysis. The annotation protein sequence data are also submitted directly to UniProt principles were described in detail previously (3,12). During or appear in patent applications or in entries from the Protein 2004, tens of thousands of records were manually annotated or Data Bank (PDB) (5). The UniParc (6) is designed to capture updated. We also have introduced a new comment (CC) line all available protein sequence data—not just from the topic: TOXIC DOSE. This topic is used to store information aforementioned databases, but also from sources such as on the poisoning potential (acute toxicity) of a toxin. Generally Ensembl (7), the International Protein Index (IPI) (8), RefSeq this topic holds information on the LD50 and PD50. LD stands (9), FlyBase (10) and WormBase (11). This combination of for ‘Lethal Dose’. LD50 is the amount of a toxin, given all at sources makes UniParc the most comprehensive publicly once, which causes the death of 50% (one-half) of a group of accessible, non-redundant protein sequence database test animals. PD50 stands for ‘Paralytic dose’. It is the amount available. of a toxin, which causes the paralysis of 50% of a group of UniParc represents each protein sequence once and only test animals. once, assigning it a unique UniParc identifier. The UniParc Examples: release 2.6 from September 2004 contained 4 375 775 unique sequences from 11 978 094 original source records. UniParc CC -!- TOXIC DOSE: PD50 is 1.72 mg/kg by injection in cross-references the accession numbers of the source data- blowfly larvae. bases, using flags to indicate the status of the entry in the CC -!- TOXIC DOSE: LD50 is 0.015 mg/kg by intravenous original source database, with ‘active’ indicating that the injection for sarafotoxin-A and sarafotoxin-B, and 0.3 mg/kg entry is still present in the source database and ‘obsolete’ for sarafotoxin-C. indicating that the entry no longer exists in the source data- base. A UniParc sequence version is incremented each time Automatic classification and annotation. Much progress the underlying sequence changes, making it possible to was made during 2004 in our attempt to provide automatic observe sequence changes in all source databases. A sample large-scale functional characterization and annotation, which UniParc report can be found at http://www.uniprot.org/entry/ is generated with limited human interaction. UPI0000000C37. UniParc records carry no annotation, but this information can be found in the UniProt Knowledgebase InterPro classification. We use InterPro (13) to recognize or other underlying databases. domains and to classify all the protein sequences in UniProt into families and superfamilies. InterPro is an integrated resource of protein families, domains and sites that amal- THE UniProt KNOWLEDGEBASE gamates the efforts of the member databases: Pfam (14), PROSITE (15), PRINTS (16), ProDom (17), SMART (18), The UniProt Knowledgebase merges Swiss-Prot, TrEMBL PIRSF (19), Superfamily (20) and TIGRFAMs (21). Approxi- and PIR-PSD to provide a central database of protein mately 80% of all UniProt Knowledgebase records are sequences with annotations and functional information. All classified according to their InterPro domains and familes. suitable PIR-PSD sequences missing from Swiss-Prot + TrEMBL were incorporated into UniProt and bi-directional Automatic functional annotation of UniProt/TrEMBL. For cross-references were created to allow the easy tracking of PIR- automatic annotation, systems for standardized transfer of PSD entries. The transfer into UniProt of references and annotation from well-characterized proteins in the UniProt/ experimentally verified data present in PIR but missing Swiss-Prot to non-annotated UniProt/TrEMBL entries have from Swiss-Prot + TrEMBL is ongoing. been implemented. RuleBase (22) uses a semi-automatic The UniProt Knowledgebase has two parts: a section of approach, while the Spearmint approach is completely auto- fully, manually annotated records resulting from literature mated and is based on decision trees (23). InterPro is then used information extraction and curator-evaluated computational to assign UniProt entries into groups. The annotation shared by analysis, and a section with computationally analysed records the functionally characterized UniProt/Swiss-Prot proteins of awaiting full manual