
Article High-quality protein knowledge resource: SWISS-PROT and TrEMBL. O'DONOVAN, Claire, et al. Abstract SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc.), a minimal level of redundancy and a high level of integration with other databases. Together with its automatically annotated supplement TrEMBL, it provides a comprehensive and high-quality view of the current state of knowledge about proteins. Ongoing developments include the further improvement of functional and automatic annotation in the databases including evidence attribution with particular emphasis on the human, archaeal and bacterial proteomes and the provision of additional resources such as the International Protein Index (IPI) and XML format of SWISS-PROT and TrEMBL to the user community. Reference O'DONOVAN, Claire, et al. High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Briefings in Bioinformatics, 2002, vol. 3, no. 3, p. 275-84 DOI : 10.1093/bib/3.3.275 PMID : 12230036 Available at: http://archive-ouverte.unige.ch/unige:40346 Disclaimer: layout of this document may differ from the published version. 1 / 1 Claire O’Donovan High-quality protein is the large-scale annotation coordinator and is responsible for the TrEMBL database knowledge resource: production at the EMBL Outstation – EBI. SWISS-PROT and TrEMBL Maria Jesus Martin coordinates software Claire O’Donovan, Maria Jesus Martin, Alexandre Gattiker, Elisabeth Gasteiger, Amos Bairoch development and is responsible and Rolf Apweiler for the TrEMBL database Date received (in revised form): 25th June 2002 production at the EMBL Outstation – EBI. Alexandre Gattiker Abstract is undertaking a PhD in the SWISS-PROT is a curated protein sequence database which strives to provide a high level of SWISS-PROT group at the SIB. annotation (such as the description of the function of a protein, its domain structure, post- Elisabeth Gasteiger translational modifications, variants, etc.), a minimal level of redundancy and a high level of coordinates software integration with other databases. Together with its automatically annotated supplement development in the SWISS- PROT group at the SIB and is in TrEMBL, it provides a comprehensive and high-quality view of the current state of knowledge charge of the ExPASy server. about proteins. Ongoing developments include the further improvement of functional and Amos Bairoch automatic annotation in the databases including evidence attribution with particular emphasis heads the SWISS-PROT group on the human, archaeal and bacterial proteomes and the provision of additional resources such at the SIB and is a professor at as the International Protein Index (IPI) and XML format of SWISS-PROT and TrEMBL to the the Medical Biochemistry user community. Department of the University of Geneva. INTRODUCTION glycosylphosphatidylinositol (GPI)- Rolf Apweiler heads the SWISS-PROT, SWISS-PROT anchor; 1 TrEMBL and InterPro database SWISS-PROT is an annotated protein activities at the EMBL sequence database maintained by the • domains and sites, eg calcium-binding Outstation – EBI. Swiss Institute of Bioinformatics (SIB) and regions, ATP-binding sites, zinc the European Bioinformatics Institute fingers, homeoboxes, SH2 and SH3 Keywords: evidence (EBI). domains; attribution, protein sequence, The SWISS-PROT database functional annotation, distinguishes itself from other protein • secondary structure, eg alpha helix, beta automatic annotation sequence databases by three distinct sheet; criteria: (i) annotations, (ii) minimal redundancy and (iii) integration with • quaternary structure, eg homodimer, other databases. heterodimer; • similarities to other proteins; Annotation In SWISS-PROT two classes of data can • disease(s) associated with deficiency(ies) be distinguished: the core data and the in the protein; annotation. For each sequence entry the core data consist of the sequence data, the • sequence conflicts, variants, etc. citation information and the taxonomic data, while the annotation consists of the Claire O’Donovan, EMBL Outstation – European description of the following items: As much annotation information as Bioinformatics Institute, possible is included in SWISS-PROT. To Wellcome Trust Genome Campus, • function(s) of the protein; obtain this information we use, in addition Hinxton, Cambridge CB10 1SD, UK to the publications reporting new sequence • post-translational modification(s), eg data, review articles to periodically update Tel: +44 (0) 1223 494 460 Fax: +44 (0) 1223 494 468 glycosylation, phosphorylation, the annotations of families or groups of E-mail: [email protected] acetylation, proteins. We also make use of external & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 3. NO 3. 275–284. SEPTEMBER 2002 2 7 5 O’Donovan et al. experts who have been recruited to send us taxonomy and the PubMed literature their comments and updates concerning resource. specific groups of proteins.2 The systematic recourse, both to TrEMBL publications other than those reporting Owing to the increased data flow from the core data and to subject referees, genome projects to the sequence represents a unique and beneficial feature databases, SWISS-PROT faced a of SWISS-PROT. In SWISS-PROT, number of challenges to its time- and annotation is mainly found in the labour-intensive way of database comment lines (CC), in the feature table annotation. The rate-limiting step in the (FT) and in the keyword lines (KW). production of SWISS-PROT is the Comments are classified by ‘topics’ to careful and detailed annotation of every facilitate the easy retrieval of specific entry with information retrieved from categories of data from the database. the scientific literature and from rigorous sequence analysis. We do not wish to Minimal redundancy compromise on these standards but do Many sequence databases contain, for a wish to make new sequences available as given protein sequence, separate entries soon as possible. To address this, the which correspond to different literature EBI introduced TrEMBL (Translation of reports. In SWISS-PROT we try as much EMBL nucleotide sequence database) in as possible to merge all these data so as to 1996. TrEMBL1 consists of computer- minimise the redundancy of the database. annotated entries derived from the If conflicts exist between various translation of all coding sequences sequencing reports, they are indicated in (CDS) in the EMBL Nucleotide the feature table of the corresponding Sequence Database,4 which are not yet SWISS-PROT entry. integrated into SWISS-PROT. It is subdivided into two sections: SP- Integration with other databases TrEMBL, which contains those Each SWISS-PROT entry should be seen sequences that will eventually be as a central hub for the data available incorporated into SWISS-PROT, and Central hub of about each protein. It provides the core REM-TrEMBL, which contains the information data directly, but additionally links to all sequences that will not. These include relevant third party databases to provide immunoglobulins and access to the most comprehensive T-cell receptors, synthetic sequences, annotation for each protein. SWISS- patent application sequences, fragments PROT provides exhaustive cross- of fewer than eight amino acids and references to more than 43 external coding sequences where there is strong databases and is committed to increasing experimental evidence that the sequence 3 Exhaustive cross- this as more databases are developed. In does not code for a real protein. referencing addition to cross-references to the In addition, there is a weekly update to underlying DNA sequence database TrEMBL called TrEMBLnew. entries in the DDBJ/EMBL/GenBank TrEMBLnew is produced from new nucleotide sequence databases, cross- nucleotide sequences deposited in the references are derived from a number of EMBL nucleotide sequence database. At different resources. These include model each TrEMBL release, the annotation of organism databases, genome databases, the TrEMBLnew entries is upgraded, any signature databases, protein family entries redundant against TrEMBL/ characterisation databases, post- SWISS-PROT5 are merged and the translational modification (PTM) remainder then progress into TrEMBL. databases, 2D and 3D protein structure The same approach to extensive cross- databases, National Center for referencing described above for SWISS- Biotechnology Information (NCBI) PROT is implemented in TrEMBL. 276 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 3. NO 3. 275–284. SEPTEMBER 2002 High-quality protein knowledge resource There is also a serious commitment to characterisation and annotation, which is enhancement of the annotation present in generated with limited human the TrEMBL entries through automatic interaction. To enhance the annotation of annotation, which is described in more uncharacterised protein sequences in detail below. TrEMBL, the SWISS-PROT/TrEMBL SWISS-PROT and TrEMBL are group at the EBI developed a novel available from the web sites.6 Figure 1 method for the prediction of functional shows how SWISS-PROT and TrEMBL information.7 This methodology for the have grown. automated large-scale functional annotation of proteins requires three components: ONGOING DEVELOPMENTS • A reference database must serve as the InterPro and automatic source of annotation. SWISS-PROT is functional annotation in used
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages11 Page
-
File Size-