Protein Sequence Annotation in the Genomeera: the Annotation Concept of SWISS-PROT + TREMBL
Total Page:16
File Type:pdf, Size:1020Kb
From: ISMB-97 Proceedings. Copyright © 1997, AAAI (www.aaai.org). All rights reserved. Protein Sequence Annotation in the GenomeEra: The Annotation Concept of SWISS-PROT + TREMBL Rolf Apweiler, Alain Gateau(*), Sergio Contrino, Maria Jesus Martin, Vivien Junker, Claire O’Donovan, Fiona Lang, Nicoletta Mitaritonna, Stephanie Kappus, Amos Bairoch(*) The EMBLOutstation - The European Bioinformatics Institute Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD, UK [email protected] (*) Department of Medical Biochemistry University of Geneva Geneva, Switzerland Abstract The SWISS-PROTdatabase distinguishes itself from other SWISS-PROTis a curated protein sequence database protein sequencedatabases by three distinct criteria: whichstrives to provide a high level of annotation, a minimallevel of redundancyand high level of integration Minimal Redundancy with other databases. Ongoinggenome sequencing projects have dramatically increased the number of protein Manysequence databases contain, for a given protein sequences to be incorporated into SWlSS-PROT.Since we sequence, separate entries which correspond to different do not want to dilute the quality standards of SWISS- literature reports. In SWISS-PROTwe try as much as PROTby incorporating sequences without proper possible to merge all these data so as to minimize the sequenceanalysis and annotation, we cannot speed up the redundancy of the database. If conflicts exist between incorporation of newincoming data indefinitely. However, various sequencing reports, they are indicated in the as we also wantto makethe sequencesavailable as fast as feature table of the correspondingentry. possible, we introduced TREMBL(TRanslation of EMBL nucleotide sequence database), a supplementto SWISS- PROT.TREMBL consists of computer-annotated entries Integration with other databases in SWlSS-PROTformat derived fromthe translation of all It is important to provide the users of biomolecular coding sequences(CDS) in the EMBLnucleotide sequence databases with a degree of integration between the three database, except for CDSalready included in SWISS- types of sequence-related databases (nucleic acid PROT.While TREMBLis already of immensevalue, its sequences, protein sequences and protein tertiary computer-generatedannotation does not matchthe quality structures) as well as with specialized data collections. of SWISS-PROTs.The main difference is in the protein SWISS-PROTis currently cross-referenced with 26 functional informationattached to sequences.With this in mind,we are dedicating substantial effort to developand different databases (Table 1). Cross-references are apply computer methods to enhance the functional provided in the form of pointers to information related to information attached to TREMBLentries. SW/SS-PROTentries and found in data collections other than SWISS-PROT. IntroductionI Annotation SWISS-PROT (Bairoch and Apweiler 1997) is One of SWISS-PROT’sleading concepts from the very annotated protein sequence database established in 1986 beginning was to provide far more than a simple and maintained collaboratively, since 1987, by the collection of protein sequences, but rather a critical view Department of Medical Biochemistry of the University of of what is known or postulated about each of these Geneva and the EMBLData Library (now the EMBL sequences. Outstation - The European Bioinformatics Institute (EBI) In SWISS-PROT,as in most other sequence databases, (Stoesser et al. 1997)). two classes of data can be distinguished: the core data and the annotation. For each sequence entry the core data 1Copyright© 1997, AmericanAssociation for Artificial Intelligence consists of the sequence data, the citation information (www.aaai.org).All fights reserved. (bibliographical references), and the taxonomic data (description of the biological source of the protein), while Apweiler 33 EMBLDatabase EMBLNucleotide Sequence Database DICTYDB Dictyostelium discoideum genome database ECO2DBASE Escherichia coli gene-protein database (2D gel spots) ECOGENE Escherichia coli K12 genomedatabase (EcoGene) ENZYME ENZYMEdata bank FLYBASE Drosophila genome database (FlyBase) GCRDB G-protein--coupled receptor database (GCRDb) HIV HIV sequence database HSC-2DPAGE Hare field Hospital 2Dgel protein databases HSSP Homology-derivedsecondary structure of proteins database (HSSP) LISTA Yeast (Saccharomyces cerevisiae) genome database MAIZEDB Maize genome database (MaizeDB) MAIZE-2DPAGE Maize genome2D Electrophoresis database MEDLINE Medline from the National Library of Medicine (NLM) MIM Mendelian Inheritance in ManDatabase PDB Brookhaven Protein Data Bank PIR Protein sequence database of the Protein Information Resource PROSITE PROSITEdictionary of sites and patterns in proteins REBASE Restriction enzymedatabase AARHUS/GHENT-DPAGE Humankeratinocyte 2D gel protein database from Aarhus and Ghent universities SGD Saccharomyces Genome Database STYGENE Salmonella typhimurium LT2 genome database (StyGene) SUBTILIST Bacillus subtilis 168 genomedatabase (SubtiList) SWISS-2DPAGE Human2D Gel Protein Database from the University of Geneva TRANSFAC Transcription factor database (Transfac) WORMPEP Caenorhabditis elegans genomesequencing project protein database 0Normpep) YEPD Yeast electrophoresis protein database Table 1: List of the databases cross-referenced to SWISS-PROT the annotation consists of the description of the following experts, who have been recruited to send us their items: commentsand updates concerning specific groups of o Function(s) of the protein proteins. Webelieve that our having systematic recourse o Post-translational modification(s). For example both to publications other than those reporting the core carbohydrates, phosphorylation, acetylation, GPI-anchor, data and to subject referees represents a unique and etc. beneficial feature of SWISS-PROT. o Domains and sites. For example calcium binding However, due to the increased data flow from genome regions, ATP-binding sites, zinc fingers, homeobox, projects to the sequence databases we face a number of kringle, etc. challenges to our way of database curation. The o Secondary structure attachment of biological knowledge abstracted from o Quaternary structure publications to the sequences is a skilled and labour- o Similarities to other proteins intensive task. Maintaining the high quality of sequence o Disease(s) associated with deficiencie(s) in the protein and annotation in SWISS-PROTrequires careful sequence o Sequenceconflicts, variants, etc. analysis and detailed curation of every entry. It is the rate- limiting step in the production of SWISS-PROT.The In SWISS-PROT,annotation is mainly found in the ever-increasing rate of determination of new sequences commentlines (CC), in the feature table (FT) and in requires new approaches if SWISS-PROTis to keep up. keyword lines (KW). We use a controlled vocabulary While we do not wish to relax the high editorial standards whenever possible; this approach permits the easy of SWISS-PROT,it is clear that there is a limit to how retrieval of specific categories of data from the database. much we can speed up such painstaking work. On the We include as much annotation as possible in SWISS- other hand, it is also vital that we makenew sequences PROT.To obtain this information we use, in addition to available as quickly as possible. To address this concern, the publications that report new sequence data, review we introduced, with SWISS-PROTrelease 33, TREMBL articles to periodically update the annotations of families (TRanslation of EMBLnucleotide sequence database), or groups of proteins. Wealso makeuse of external supplement to SWlSS-PROT. 34 ISMB-97 EMEL FLAT Figure 1: FII.~ PRODUCTION OF TREMBL SP-~L I 1 Psmdo.d~ < PROSITE P~ta, n Searching Apweiler 35 Fortunately, in the near future, Genbank(Benson et al. 1997), EMBL,DDBJ (Tateno and Gojobori 1997) Recent Developments: TREMBL SWISS-PROT are going to adopt a new common taxonomic scheme. The EMBLkeywords are included in the TREMBLentry, The Production of TREMBL but only when they match a subset of SWISS-PROT keywords which have the same meaning. This occurs only in cases where the EMBLentry has just one CDSso that Translation and Entry Creation no ambiguity is possible. Someextra keywords derived TREMBLconsists of computer-annotated entries in from the features and description lines are added. SWISS-PROTformat derived from the translation of all A subset of SWISS-PROTfeatures can be derived from coding sequences (CDS) in the EMBLnucleotide the EMBLentry features. These are: sequence database, except for CDSalready included in SWlSS-PROT.The production of TREMBLis illustrated SIGNALfrom sig_peptide in Figure 1. All the EMBLnucleotide sequence database m TRANSITfrom transit peptide divisions are scanned for CDSfeatures and these are m CHAINfrom mat_peptide translated to produce TREMBLdivision files containing VARIANTfrom allele, variation, misc_difference TREMBLentries in SWISS-PROTformat. and mutation The program to produce TREMBLis written in C and W CONFLICTfrom conflict makesuse of the srs4_02 library (Etzold and Argos 1993), which provides the basis for a first level parsing of EMBL Two examples of TREMBLentries, created in the way database entries. This level allows text data to fit in described before, are shownin Figure 2. In addition to this structures such as ordered lists of features or bibliographic information parsed into TREMBLentries, data is put in references, to assemble the coding sequences and to the annotator’s section of the entry, whichis