<<

Article

High-quality protein knowledge resource: SWISS-PROT and TrEMBL.

O'DONOVAN, Claire, et al.

Abstract

SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domain structure, post-translational modifications, variants, etc.), a minimal level of redundancy and a high level of integration with other databases. Together with its automatically annotated supplement TrEMBL, it provides a comprehensive and high-quality view of the current state of knowledge about proteins. Ongoing developments include the further improvement of functional and automatic annotation in the databases including evidence attribution with particular emphasis on the human, archaeal and bacterial proteomes and the provision of additional resources such as the International Protein Index (IPI) and XML format of SWISS-PROT and TrEMBL to the user community.

Reference

O'DONOVAN, Claire, et al. High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Briefings in , 2002, vol. 3, no. 3, p. 275-84

DOI : 10.1093/bib/3.3.275 PMID : 12230036

Available at: http://archive-ouverte.unige.ch/unige:40346

Disclaimer: layout of this document may differ from the published version.

1 / 1 Claire O’Donovan High-quality protein is the large-scale annotation coordinator and is responsible for the TrEMBL database knowledge resource: production at the EMBL Outstation – EBI. SWISS-PROT and TrEMBL Maria Jesus Martin coordinates software Claire O’Donovan, Maria Jesus Martin, Alexandre Gattiker, Elisabeth Gasteiger, Amos Bairoch development and is responsible and for the TrEMBL database Date received (in revised form): 25th June 2002 production at the EMBL Outstation – EBI. Alexandre Gattiker Abstract is undertaking a PhD in the SWISS-PROT is a curated protein sequence database which strives to provide a high level of SWISS-PROT group at the SIB. annotation (such as the description of the function of a protein, its domain structure, post- Elisabeth Gasteiger translational modifications, variants, etc.), a minimal level of redundancy and a high level of coordinates software integration with other databases. Together with its automatically annotated supplement development in the SWISS- PROT group at the SIB and is in TrEMBL, it provides a comprehensive and high-quality view of the current state of knowledge charge of the ExPASy server. about proteins. Ongoing developments include the further improvement of functional and Amos Bairoch automatic annotation in the databases including evidence attribution with particular emphasis heads the SWISS-PROT group on the human, archaeal and bacterial proteomes and the provision of additional resources such at the SIB and is a professor at as the International Protein Index (IPI) and XML format of SWISS-PROT and TrEMBL to the the Medical Biochemistry user community. Department of the University of Geneva. INTRODUCTION glycosylphosphatidylinositol (GPI)- Rolf Apweiler heads the SWISS-PROT, SWISS-PROT anchor; 1 TrEMBL and InterPro database SWISS-PROT is an annotated protein activities at the EMBL sequence database maintained by the • domains and sites, eg calcium-binding Outstation – EBI. Swiss Institute of Bioinformatics (SIB) and regions, ATP-binding sites, zinc the European Bioinformatics Institute fingers, homeoboxes, SH2 and SH3 Keywords: evidence (EBI). domains; attribution, protein sequence, The SWISS-PROT database functional annotation, distinguishes itself from other protein • secondary structure, eg alpha helix, beta automatic annotation sequence databases by three distinct sheet; criteria: (i) annotations, (ii) minimal redundancy and (iii) integration with • quaternary structure, eg homodimer, other databases. heterodimer;

• similarities to other proteins; Annotation In SWISS-PROT two classes of data can • disease(s) associated with deficiency(ies) be distinguished: the core data and the in the protein; annotation. For each sequence entry the core data consist of the sequence data, the • sequence conflicts, variants, etc. citation information and the taxonomic data, while the annotation consists of the Claire O’Donovan, EMBL Outstation – European description of the following items: As much annotation information as Bioinformatics Institute, possible is included in SWISS-PROT. To Wellcome Trust Genome Campus, • function(s) of the protein; obtain this information we use, in addition Hinxton, Cambridge CB10 1SD, UK to the publications reporting new sequence • post-translational modification(s), eg data, review articles to periodically update Tel: +44 (0) 1223 494 460 Fax: +44 (0) 1223 494 468 glycosylation, phosphorylation, the annotations of families or groups of E-mail: [email protected] acetylation, proteins. We also make use of external

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 3. NO 3. 275–284. SEPTEMBER 2002 2 7 5 O’Donovan et al.

experts who have been recruited to send us taxonomy and the PubMed literature their comments and updates concerning resource. specific groups of proteins.2 The systematic recourse, both to TrEMBL publications other than those reporting Owing to the increased data flow from the core data and to subject referees, genome projects to the sequence represents a unique and beneficial feature databases, SWISS-PROT faced a of SWISS-PROT. In SWISS-PROT, number of challenges to its time- and annotation is mainly found in the labour-intensive way of database comment lines (CC), in the feature table annotation. The rate-limiting step in the (FT) and in the keyword lines (KW). production of SWISS-PROT is the Comments are classified by ‘topics’ to careful and detailed annotation of every facilitate the easy retrieval of specific entry with information retrieved from categories of data from the database. the scientific literature and from rigorous sequence analysis. We do not wish to Minimal redundancy compromise on these standards but do Many sequence databases contain, for a wish to make new sequences available as given protein sequence, separate entries soon as possible. To address this, the which correspond to different literature EBI introduced TrEMBL (Translation of reports. In SWISS-PROT we try as much EMBL nucleotide sequence database) in as possible to merge all these data so as to 1996. TrEMBL1 consists of computer- minimise the redundancy of the database. annotated entries derived from the If conflicts exist between various translation of all coding sequences sequencing reports, they are indicated in (CDS) in the EMBL Nucleotide the feature table of the corresponding Sequence Database,4 which are not yet SWISS-PROT entry. integrated into SWISS-PROT. It is subdivided into two sections: SP- Integration with other databases TrEMBL, which contains those Each SWISS-PROT entry should be seen sequences that will eventually be as a central hub for the data available incorporated into SWISS-PROT, and Central hub of about each protein. It provides the core REM-TrEMBL, which contains the information data directly, but additionally links to all sequences that will not. These include relevant third party databases to provide immunoglobulins and access to the most comprehensive T-cell receptors, synthetic sequences, annotation for each protein. SWISS- patent application sequences, fragments PROT provides exhaustive cross- of fewer than eight amino acids and references to more than 43 external coding sequences where there is strong databases and is committed to increasing experimental evidence that the sequence 3 Exhaustive cross- this as more databases are developed. In does not code for a real protein. referencing addition to cross-references to the In addition, there is a weekly update to underlying DNA sequence database TrEMBL called TrEMBLnew. entries in the DDBJ/EMBL/GenBank TrEMBLnew is produced from new nucleotide sequence databases, cross- nucleotide sequences deposited in the references are derived from a number of EMBL nucleotide sequence database. At different resources. These include model each TrEMBL release, the annotation of organism databases, genome databases, the TrEMBLnew entries is upgraded, any signature databases, protein family entries redundant against TrEMBL/ characterisation databases, post- SWISS-PROT5 are merged and the translational modification (PTM) remainder then progress into TrEMBL. databases, 2D and 3D protein structure The same approach to extensive cross- databases, National Center for referencing described above for SWISS- Biotechnology Information (NCBI) PROT is implemented in TrEMBL.

276 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 3. NO 3. 275–284. SEPTEMBER 2002 High-quality protein knowledge resource

There is also a serious commitment to characterisation and annotation, which is enhancement of the annotation present in generated with limited human the TrEMBL entries through automatic interaction. To enhance the annotation of annotation, which is described in more uncharacterised protein sequences in detail below. TrEMBL, the SWISS-PROT/TrEMBL SWISS-PROT and TrEMBL are group at the EBI developed a novel available from the web sites.6 Figure 1 method for the prediction of functional shows how SWISS-PROT and TrEMBL information.7 This methodology for the have grown. automated large-scale functional annotation of proteins requires three components: ONGOING DEVELOPMENTS • A reference database must serve as the InterPro and automatic source of annotation. SWISS-PROT is functional annotation in used as the reference database because TrEMBL of its reliable, well-annotated and With the rapid growth of sequence standardised information. databases, there is an increasing need for Automatic functional reliable functional characterisation and • The highly diagnostic protein family information annotation of newly predicted proteins. signature database InterPro8 supplies To cope with such large data volumes, the means to assign proteins to groups. faster and more effective means of protein InterPro allows the reliable sequence characterisation and annotation classification of proteins into families are required. One promising approach is and the recognition of the domain automatic large-scale functional structure of multi-domain proteins.

Growth of TrEMBL and SWISS-PROT 900

800

700

600

500

400 Entries in 1000

300

200

100

0 Nov-96 May-97 Nov-97 May-98 Nov-98 May-99 Nov-99 May-00 Nov-00 May-01 Nov-01 May-02 Publication Date

SWISS-PROT TrEMBL

Figure 1: Growth of TrEMBL and SWISS-PROT

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 3. NO 3. 275–284. SEPTEMBER 2002 2 7 7 O’Donovan et al.

InterPro can classify currently around SWISS-PROT release, all annotation 70 per cent of all known protein present in TrEMBL applied by the sequences. This information is RuleBase system in the previous run is incorporated into SWISS-PROT and deleted and the rules are reapplied. TrEMBL in the form of database cross- Around 25 per cent of all TrEMBL references to InterPro and its member entries receive additional annotation by databases. applying RuleBase. We are committed to further extending the scope of and • The final component is a database improving the functional information called RuleBase that stores and manages provided by this methodology to give the annotation rules, their sources and users a comprehensive view over the their usage. output of sophisticated and well- maintained data-mining routines at a The actual flow of information during single glance. the automatic annotation can be divided into five steps: Human Initiative Flow of information (HPI) from known to • Use InterPro to extract the information The goal of the HPI project9 is to unknown proteins necessary to assign proteins to groups annotate all known human protein (‘conditions’) and store the conditions sequences according to the high-quality in RuleBase. standards of SWISS-PROT. This also includes concomitant annotation of • Group the proteins in SWISS-PROT orthologues of human proteins in other by the conditions. mammalian species, mainly rat and mouse.10 Each item of information, • Extract from SWISS-PROT the coming mostly from the scientific common annotation shared by all literature and predictions by sequence functionally characterised proteins of analysis software, is manually verified each group and store this common before integration into the database. The annotation together with its conditions work carried out currently in the in RuleBase. Now every rule consists framework of the HPI project is therefore of conditions and the annotation still highly dependent on the expert common to all proteins of this group manual annotation skills of SWISS- characterised by these conditions. PROT curators, who are assisted by external scientific experts, and sometimes • Group the unannotated TrEMBL by the authors of the original journal entries by the conditions stored in publications. RuleBase. For each known protein, a wealth of information is provided that includes the • Add the common annotation to the description of its function, domain unannotated TrEMBL entries. The structure, subcellular location, similarities Annotation of all known predicted annotation is flagged with to other proteins, etc. Particular emphasis human proteins evidence tags, which will allow users to is placed on PTMs, isoforms produced by recognise the predicted of the alternative splicing, polymorphisms and annotation as well as the original source disease-linked variants. The majority of of the inferred annotation. proteins are the target of PTMs, most of which are not efficiently predicted on the The RuleBase system used in TrEMBL basis of genomic information alone. More production includes more than 500 rules than a hundred different types of PTM are that are frequently applied on TrEMBL currently known. In SWISS-PROT proteins. At each run, all the rules are entries, the high level of diversity reviewed in the context of the current produced by PTM is mainly reflected at

278 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 3. NO 3. 275–284. SEPTEMBER 2002 High-quality protein knowledge resource

the level of feature tables where cleavage of microbial genome sequences. Today sites, modified residues and bonds are close to a hundred of these genomes are indicated. An additional level of available in public databases. Collectively complexity is introduced by differential they encode over 160,000 different alternative splicing. All known splice protein sequences. Such a large number of variants of a given protein are described. sequences makes classical manual As far as possible, the splice variant annotation an intractable task. The sequences are supplied and the effect on SWISS-PROT group at SIB therefore the protein function, subcellular location, initiated a project that aims to annotate etc. is discussed. This annotation is to be automatically a significant percentage of found in the feature table and in the proteins originating from microbial comment lines. The sequences of such genome sequencing projects. This project isoforms in SWISS-PROT and TrEMBL, is termed HAMAP (High-quality which can be reconstructed from the Automated Microbial Annotation of feature tables are extracted and made Proteomes). It differs from other currently available to the users in separate files.11 existing automatic annotation systems in While all human proteins are, in the that it does not try to hunt for distant context of this project, equally important, similarity or to annotate all potential a special effort has been made to annotate proteins originating from a microbial proteins encoded on chromosome 21, in a genome. Rather, it is being developed to fruitful collaboration with the group of deal specifically with two subsets of Prof. Stylianos Antonarakis (Division of bacterial and archaeal proteins: Medical Genetics of the University of Geneva Medical School) who has been • It will automatically annotate proteins involved in chromosome 21 sequencing12 that have no recognisable similarity to and is now pursuing the analysis of the any other microbial or non-microbial chromosome 21 transcriptome.13 At the proteins (these are generally called end of 2001, SWISS-PROT was ‘ORFans’). This task mainly implies completely synchronised with the current automatic recognition and annotation state of knowledge of proteins encoded of features such as signal sequences, on this chromosome. Currently, transmembrane domains, coiled-coil annotation efforts are focused on proteins regions, inteins and ATP/GTP-binding encoded by the other two chromosomes sites. that have been fully sequenced and whose encoded genes have been partially • The most challenging part of the annotated, namely chromosomes 20 and project is aimed at automatically 22, while keeping chromosome 21 up to annotating proteins that are part of date. SWISS-PROT release 40.16 of 2nd well-defined families or subfamilies. In May, 2002, contains 8,110 annotated most cases, these are well-characterised human sequences, and 2,216 additional protein families where it is possible, alternatively spliced isoforms. The using software tools, to automatically annotation of these proteins includes build a SWISS-PROT entry of a information about 13,174 variants and quality identical to that produced 20,832 PTMs. Up-to-date statistics are manually by an expert curator. In order available.14 to do this, a rule system is being built that describes, for each well-defined HAMAP (sub)family, the level and extent of Complete microbial In July 1995, the complete genomic annotation that can be assigned by proteomes sequence of a bacterium, Haemophilus similarity with a prototype manually influenzae, became available. That of an annotated entry. Such a rule system also archaeon, Methanoccocus jannaschii, quickly includes a carefully edited multiple followed it. It was the prelude to a flood alignment of the (sub)family.

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 3. NO 3. 275–284. SEPTEMBER 2002 2 7 9 O’Donovan et al.

In both cases described above, the idea genome paper and the ‘Complete is to annotate proteins to the highest level proteome’ keyword. Any annotation of quality. The programs in development improvements required to conform to are specifically designed to track down the SWISS-PROT standard for each ‘eccentric’ proteins. Among the proteome are implemented at this time. peculiarities recognised by the programs are: size discrepancy, absence or • Redundancy clean-up. For most divergence of regions involved in activity complete genomes, a number of or binding (to metals, nucleotides, etc.), protein sequences (or even a complete presence of paralogues, inconsistencies genome of a different strain) will have with the biological context (ie if a protein already been submitted to the databases. belongs to a pathway apparently absent in In this case, the entries are manually a particular organism), etc. Such merged, the appropriate references kept ‘problematic’ proteins are not annotated and possible conflicts in the sequence automatically and are flagged for further highlighted. In some cases where the analysis by SWISS-PROT expert evolutionary distance between curators. This allows curators in the sequences in two strains is too high, SWISS-PROT groups at SIB and EBI to they are not merged and are considered concentrate on the proteins that really to be different proteomes (eg need careful manual annotation. Helicobacter pylori strains J99 and 26695, Complete proteome While the manual preparation and Neisseria meningitidis serovars A and B). prioritisation and automatic annotation in the framework of processing HAMAP are limited to SWISS-PROT, • Similarity searches. At this point a there are a number of steps that need to number of similarity searches in the be carried out before a genome is run protein databases are necessary to assert through the HAMAP pipeline on the whether each protein is (i) an ORF level of the nucleotide sequence databases with no similarities except in very close and TrEMBL. In particular, the entry of species (ORFan), (ii) a member of a complete microbial genomes from characterised family or (iii) a protein TrEMBLnew into TrEMBL is prioritised. with similarities to proteins in the Although usually proteins move from database for which a function is known TrEMBLnew into TrEMBL only during or predicted, but that do not belong to quarterly TrEMBL releases, complete a family yet characterised in the context of microbial genomes are included in HAMAP. In case (i) no annotation can TrEMBL whenever their sequence be propagated to it, until further becomes available. It ensures that biochemical characterisation has given HAMAP curators and programs are able an insight into its function. However, a to work on the most recent data, with all number of prediction programs are run the intrinsic automated TrEMBL on the entry to detect putative signal annotation (accession numbers, sequences, transmembrane regions, and taxonomy, domain-related cross- other types of domains. In case (ii) the references, etc.) already present, and decision must be taken with caution, as several operations based on the individual it should not produce false positive analysis of a genome and its annotation in results if it is aimed to maintain the EMBL are then carried out. quality and consistency of the database. The method will take into account • TrEMBL processing. The nucleotide similarity to most or all members of a sequences are translated into family throughout the length of the TrEMBLnew entries, enter the protein. It will also use pattern-based TrEMBL protein database, and receive detection of family signatures to detect an accession number. At this point they the functional class of a protein. are recognisable by a reference to the However, not all families have a known

280 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 3. NO 3. 275–284. SEPTEMBER 2002 High-quality protein knowledge resource

function, hence the concept of UPFs databases usually combine data from a (uncharacterised protein families) that broad variety of sources. TrEMBL, in are interesting candidates for functional particular, contains data automatically or structural studies. In case (iii) the imported from the underlying DDBJ/ entry undergoes manual annotation. EMBL/GenBank coding sequences, The curators assert the relevance of the partial manual curation, data imported match, and if a set of reciprocally from other databases, data from specific homologous proteins is found, may use programs, and the results of automatic them as the seed of a new family. annotation systems. Although every effort is made to ensure correct and consistent • Automated sequence annotation. data, the data quality is often limited by The program PicoHamap has been the quality of the input data. Currently, it developed to carry out the annotation is often difficult for database users to automatically by reading instructions recognise where individual data items from a family rule, calling external come from. To address these issues, the prediction and characterisation EBI started, in June 2000, to add evidence programs, running multiple sequence tags to the internal version of TrEMBL alignments and writing a modified for two purposes: entry. • Data tracing to allow users to see where In addition to bacterial and archaeal each data item comes from and to assess proteomes, the use of HAMAP in the its reliability. context of organelles such as chloroplast and mitochondrion is envisaged. The • Data correction to allow database staff technologies and expertise that will be to automatically correct or update data. developed thanks to the HAMAP project will also be important for future Each evidence tag contains: automatic annotation projects in the framework of eukaryotic proteomes. • Type: the description of the data Already 570 rules containing alignments source. Examples: Curator (information and extensive annotation of protein added by a curator according to families have been created. Even though judgement), Similarity (added by a the HAMAP project is not yet fully curator due to sequence similarity to operational, it has facilitated the another entry), Rulebase (TrEMBL annotation of about 6,000 new microbial system for automatic annotation), sequence entries in 2001. Thus the EMBL (directly copied from the proportion of archaeal and bacterial annotation of the DDBJ/EMBL/ proteins in SWISS-PROT increased from GenBank coding sequence), 37 to 39 per cent. In 2002, it is planned to MGD_ADD (data imported from further improve the annotation of MGD). TrEMBL, to increase the number of protein family rules to 1,000 and to • Category: evidence types are grouped complete the automatic annotation into four major categories: curation, pipeline so that a significant fraction of data import, automatic annotation and every new microbial proteome can be predicted data from programs. annotated confidently by programs. For more information, please see the web • Initials: initials of the person who has site.15 last touched the data item. If a program has made the change, a dash is used. Evidence attribution Evidence tagging Evidence attribution is of growing • Attributes: one or more attributes, importance since large biomolecular which describe the data source, eg the

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 3. NO 3. 275–284. SEPTEMBER 2002 2 8 1 O’Donovan et al.

accession number of an entry in the between experimentally and external data source or the version computationally derived data. This is number of a program for already possible, in part, both in SWISS- transmembrane prediction. PROT and TrEMBL16 through the use of a number of qualifiers or status tags that • Date: The date of the last update or differentiate between experimental and confirmation of the data item. non-experimental data. However, the introduction of an evidence attribution This method allows each piece of system that allows the source of each data information to be easily traced back to its item to be easily traced while also easily source and, as each piece of data may have distinguishing between experimental and more than one evidence tag, this system computational data will increase the value caters for data items, which are derived of the SWISS-PROT/TrEMBL resource from multiple sources. This method will for external users and will also be of use to be further extended to allow tagging of us for effective error tracking and error known false information, eg if a keyword correction. It is intended that an generated by an automated annotation evidence-tagged version of the TrEMBL system is known to be wrong by curator database will be available within the year. judgment. This will allow the Please see the web site17 for more incorporation of feedback from users and information. We would welcome any curators to improve the rules for feedback from the user community. automatic annotation. All data items tagged with a ‘negative’ evidence tag will The International Protein be removed from the entry before Index (IPI) publication. IPI18 provides a top-level guide to the The use of evidence tags during the main databases that describe the human annotation process will allow users to proteome, namely SWISS-PROT, Distinguishing between trace the source of each data item added TrEMBL, RefSeq and Ensembl. IPI experimental and by a curator and to readily distinguish maintains a database of cross-references predicted data between experimental and predicted data. between the primary data sources with It will also prevent the overwriting by a the aim of providing a minimally program of any data, which has been redundant yet maximally complete set of edited by a curator. This means that it will human proteins (one sequence per be possible to use programs to add transcript). Stable identifiers (with information to a curated entry without incremental versioning) are maintained touching manually curated data items. within IPI facilitating the tracking of During the annotation process, curators sequences between IPI releases. will assess any information that has been IPI is produced automatically through added by a program, and if it is correct, mapping on the basis of protein similarity they will confirm it using the appropriate between the different data sets. Each IPI evidence tag. This will increase the entry consists of a cluster of related entries reliability of all program-added from the constituent databases, together information. It will also allow for with a sequence and a description line improvements to programs through taken from a master entry. The data are feedback from curators to programmers in presented in FastA format. IPI is also cases where curators disagree with the available in SWISS-PROT format. The results from a program. data in SWISS-PROT format contains The addition of evidence tags by additional cross-references linking IPI to computer programs and during the GO, HUGO, LocusLink and InterPro, literature-based curation process will and identifies the chromosome on which allow users to view the source of all data the gene encoding each IPI entry is items in each record and to distinguish found.

282 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 3. NO 3. 275–284. SEPTEMBER 2002 High-quality protein knowledge resource

SP-ML: the SWISS-PROT and SUMMARY TrEMBL XML format The combination of SWISS-PROT and SP-ML provides the users with an easily TrEMBL provides a comprehensive and parsable view on the rich data in these high-quality protein resource to the two databases. By using standards such as biological community. With the XML, XML Schema and Xlink, it continuing commitment to expert manual relieves developers from the task of functional curation in SWISS-PROT and creating their own parsers and conversion the expansion of automatic annotation in tools. The first draft release was made TrEMBL, the databases will continue to available to the bioinformatics community contribute significantly to the exploitation in February 2002. For more information, of the sequence avalanche. please see the web site.19 References Documentation files 1. Bairoch, A. and Apweiler, R. (2000), ‘The SWISS-PROT and TrEMBL are SWISS-PROT protein sequence database and distributed with a large number of its supplement TrEMBL in 2000’, Nucleic Acids documentation files. Some of these files Res., Vol. 28(1), pp. 45–48. have been available for a long time (the 2. URL: http://www.expasy.org/cgi-bin/ user manual, release notes, the various experts/ indices for authors, citations, keywords, 3. Gasteiger, E., Jung, E. and Bairoch, A. (2001), ‘SWISS-PROT: Connecting biological etc.), but many have been created knowledge via a protein database’, Curr. Issues recently and new files are always being Mol. Biol., Vol. 3, pp. 47–55. added. For a list of all documents that are 4. Stoesser, G., Baker, W., van der Broek, A. et currently available, please see the web al. (2001), ‘The EMBL Nucleotide Sequence sites.20 Database’, Nucleic Acids Res., Vol. 30(1), pp. 21–26. PRACTICAL 5. O’Donovan, C., Martin, M. J., Glemet, E. et al. (1999), ‘Removing redundancy in INFORMATION SWISS-PROT and TrEMBL’, Bioinformatics, Every week we also produce a complete Vol. 15, pp. 258–259. non-redundant protein sequence 6. URLs: http://www.ebi.ac.uk/swiss-prot/and collection SPTR by providing three http://www.expasy.org/sprot/ compressed files sprot.dat.gz, 7. Fleischmann, W., Moeller, S., Gateau, A. and trembl.dat.gz and trembl_new.dat.gz (in Apweiler, R. (1999), ‘A novel method for automatic functional annotation of proteins’, the directory /pub/databases/sp_tr_nrdb Bioinformatics, Vol. 15, pp. 228–233. on the EBI server and /databases/ 8. Apweiler, R., Attwood, T. K., Bairoch, A. sp_tr_nrdb on the ExPASy FTP). The et al. (2001), ‘The InterPro database, an files in this directory contain the weekly integrated documentation resource for protein release versions of SWISS-PROT and families, domains and functional sites’, Nucleic Acids Res., Vol. 29(1), pp. 37–40. TrEMBL. They immediately reflect the improvements that are continuously being 9. URL: http://www.expasy.org/sprot/hpi/ done to SWISS-PROT and TrEMBL. 10. O’Donovan, C., Apweiler, R. and Bairoch, A. We recommend using these versions (2001), ‘The human proteomics initiative’, Trends Biotechnol., Vol. 19(5), pp. 178–181. instead of SWISS-PROT/TrEMBL 11. Kersey, P., Hermjakob, H. and Apweiler, R. releases plus updates as they also reflect (2000), ‘VARSPLIC: alternatively-spliced merges between entries and deletions of protein sequences derived from SWISS- entries. In addition, if entries are moved PROT and TrEMBL’, Bioinformatics, Vol. from TrEMBL to SWISS-PROT, a 16(11), pp. 1048–1049. combination of the SWISS-PROT 12. Hattori, M., Fujiyama, A., Taylor, T. D. et al. (2000), ‘The DNA sequence of human release, SWISS-PROT updates and the chromosome 21’, Nature, Vol. 405, pp. TrEMBL release will contains these 311–319. entries twice, while they are appropriately 13. Reymond, A., Friedli, M., Neergaard removed from the sp_tr_nrdb versions. Henrichsen, C. et al. (2001), ‘From PREDs

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 3. NO 3. 275–284. SEPTEMBER 2002 2 8 3 O’Donovan et al.

and open reading frames to cDNA isolation: environment’, J. Biotechnol., Vol. 78, pp. Revisiting the human chromosome 21 221–234. transcription map’, Genomics, Vol. 78, pp. 17. URL: ftp:»ftp.ebi.ac.uk/pub/databases/ 46–54. trembl/evidenceDocumentation.html 14. URL: http://www.expasy.org/sprot/hpi/ 18. URL: http://www.ebi.ac.uk/IPI/ hpi_stat.html 19. URL: http://www.ebi.ac.uk/swissprot/ 15. URL: http://www.expasy.org/sprot/hamap/ SP-ML/index.html 16. Junker, V., Contrino, S., Fleischmann, W. 20. URLs: http://www.expasy.org/sprot/ et al. (2000), ‘The role SWISS-PROT and sp_docu.html and http://www.ebi.ac.uk/ TrEMBL play in the genome research swissprot/Documents/

284 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 3. NO 3. 275–284. SEPTEMBER 2002