From: ISMB-97 Proceedings. Copyright © 1997, AAAI (www.aaai.org). All rights reserved. Sequence Annotation in the GenomeEra: The Annotation Concept of SWISS-PROT + TREMBL

Rolf Apweiler, Alain Gateau(*), Sergio Contrino, Maria Jesus Martin, Vivien Junker, Claire O’Donovan, Fiona Lang, Nicoletta Mitaritonna, Stephanie Kappus, (*)

The EMBLOutstation - The European Institute Wellcome Trust Genome Campus Hinxton, Cambridge CB10 1SD, UK [email protected]

(*) Department of Medical University of Geneva Geneva, Switzerland

Abstract The SWISS-PROTdatabase distinguishes itself from other SWISS-PROTis a curated protein sequence database protein sequencedatabases by three distinct criteria: whichstrives to provide a high level of annotation, a minimallevel of redundancyand high level of integration Minimal Redundancy with other databases. Ongoinggenome sequencing projects have dramatically increased the number of protein Manysequence databases contain, for a given protein sequences to be incorporated into SWlSS-PROT.Since we sequence, separate entries which correspond to different do not want to dilute the quality standards of SWISS- literature reports. In SWISS-PROTwe try as much as PROTby incorporating sequences without proper possible to merge all these data so as to minimize the sequenceanalysis and annotation, we cannot speed up the redundancy of the database. If conflicts exist between incorporation of newincoming data indefinitely. However, various sequencing reports, they are indicated in the as we also wantto makethe sequencesavailable as fast as feature table of the correspondingentry. possible, we introduced TREMBL(TRanslation of EMBL nucleotide sequence database), a supplementto SWISS- PROT.TREMBL consists of computer-annotated entries Integration with other databases in SWlSS-PROTformat derived fromthe translation of all It is important to provide the users of biomolecular coding sequences(CDS) in the EMBLnucleotide sequence databases with a degree of integration between the three database, except for CDSalready included in SWISS- types of sequence-related databases (nucleic acid PROT.While TREMBLis already of immensevalue, its sequences, protein sequences and protein tertiary computer-generatedannotation does not matchthe quality structures) as well as with specialized data collections. of SWISS-PROTs.The main difference is in the protein SWISS-PROTis currently cross-referenced with 26 functional informationattached to sequences.With this in mind,we are dedicating substantial effort to developand different databases (Table 1). Cross-references are apply computer methods to enhance the functional provided in the form of pointers to information related to information attached to TREMBLentries. SW/SS-PROTentries and found in data collections other than SWISS-PROT.

IntroductionI Annotation SWISS-PROT (Bairoch and Apweiler 1997) is One of SWISS-PROT’sleading concepts from the very annotated protein sequence database established in 1986 beginning was to provide far more than a simple and maintained collaboratively, since 1987, by the collection of protein sequences, but rather a critical view Department of Medical Biochemistry of the University of of what is known or postulated about each of these Geneva and the EMBLData Library (now the EMBL sequences. Outstation - The European Bioinformatics Institute (EBI) In SWISS-PROT,as in most other sequence databases, (Stoesser et al. 1997)). two classes of data can be distinguished: the core data and the annotation. For each sequence entry the core data 1Copyright© 1997, AmericanAssociation for Artificial Intelligence consists of the sequence data, the citation information (www.aaai.org).All fights reserved. (bibliographical references), and the taxonomic data (description of the biological source of the protein), while

Apweiler 33 EMBLDatabase EMBLNucleotide Sequence Database DICTYDB Dictyostelium discoideum genome database ECO2DBASE Escherichia coli gene-protein database (2D gel spots) ECOGENE Escherichia coli K12 genomedatabase (EcoGene) ENZYME ENZYMEdata bank FLYBASE Drosophila genome database (FlyBase) GCRDB G-protein--coupled receptor database (GCRDb) HIV HIV sequence database HSC-2DPAGE Hare field Hospital 2Dgel protein databases HSSP Homology-derivedsecondary structure of database (HSSP) LISTA Yeast (Saccharomyces cerevisiae) genome database MAIZEDB Maize genome database (MaizeDB) MAIZE-2DPAGE Maize genome2D Electrophoresis database MEDLINE Medline from the National Library of Medicine (NLM) MIM Mendelian Inheritance in ManDatabase PDB Brookhaven Protein Data Bank PIR Protein sequence database of the Protein Information Resource PROSITE PROSITEdictionary of sites and patterns in proteins REBASE Restriction enzymedatabase AARHUS/GHENT-DPAGE Humankeratinocyte 2D gel protein database from Aarhus and Ghent universities SGD Saccharomyces Genome Database STYGENE Salmonella typhimurium LT2 genome database (StyGene) SUBTILIST Bacillus subtilis 168 genomedatabase (SubtiList) SWISS-2DPAGE Human2D Gel Protein Database from the University of Geneva TRANSFAC Transcription factor database (Transfac) WORMPEP Caenorhabditis elegans genomesequencing project protein database 0Normpep) YEPD Yeast electrophoresis protein database

Table 1: List of the databases cross-referenced to SWISS-PROT

the annotation consists of the description of the following experts, who have been recruited to send us their items: commentsand updates concerning specific groups of o Function(s) of the protein proteins. Webelieve that our having systematic recourse o Post-translational modification(s). For example both to publications other than those reporting the core carbohydrates, phosphorylation, acetylation, GPI-anchor, data and to subject referees represents a unique and etc. beneficial feature of SWISS-PROT. o Domains and sites. For example calcium binding However, due to the increased data flow from genome regions, ATP-binding sites, zinc fingers, homeobox, projects to the sequence databases we face a number of kringle, etc. challenges to our way of database curation. The o Secondary structure attachment of biological knowledge abstracted from o Quaternary structure publications to the sequences is a skilled and labour- o Similarities to other proteins intensive task. Maintaining the high quality of sequence o Disease(s) associated with deficiencie(s) in the protein and annotation in SWISS-PROTrequires careful sequence o Sequenceconflicts, variants, etc. analysis and detailed curation of every entry. It is the rate- limiting step in the production of SWISS-PROT.The In SWISS-PROT,annotation is mainly found in the ever-increasing rate of determination of new sequences commentlines (CC), in the feature table (FT) and in requires new approaches if SWISS-PROTis to keep up. keyword lines (KW). We use a controlled vocabulary While we do not wish to relax the high editorial standards whenever possible; this approach permits the easy of SWISS-PROT,it is clear that there is a limit to how retrieval of specific categories of data from the database. much we can speed up such painstaking work. On the We include as much annotation as possible in SWISS- other hand, it is also vital that we makenew sequences PROT.To obtain this information we use, in addition to available as quickly as possible. To address this concern, the publications that report new sequence data, review we introduced, with SWISS-PROTrelease 33, TREMBL articles to periodically update the annotations of families (TRanslation of EMBLnucleotide sequence database), or groups of proteins. Wealso makeuse of external supplement to SWlSS-PROT.

34 ISMB-97 EMEL FLAT Figure 1: FII.~ PRODUCTION OF TREMBL

SP-~L I 1

Psmdo.d~

<

PROSITE P~ta, n Searching

Apweiler 35 Fortunately, in the near future, Genbank(Benson et al. 1997), EMBL,DDBJ (Tateno and Gojobori 1997) Recent Developments: TREMBL SWISS-PROT are going to adopt a new common taxonomic scheme. The EMBLkeywords are included in the TREMBLentry, The Production of TREMBL but only when they match a subset of SWISS-PROT keywords which have the same meaning. This occurs only in cases where the EMBLentry has just one CDSso that Translation and Entry Creation no ambiguity is possible. Someextra keywords derived TREMBLconsists of computer-annotated entries in from the features and description lines are added. SWISS-PROTformat derived from the translation of all A subset of SWISS-PROTfeatures can be derived from coding sequences (CDS) in the EMBLnucleotide the EMBLentry features. These are: sequence database, except for CDSalready included in SWlSS-PROT.The production of TREMBLis illustrated SIGNALfrom sig_peptide in Figure 1. All the EMBLnucleotide sequence database m TRANSITfrom transit peptide divisions are scanned for CDSfeatures and these are m CHAINfrom mat_peptide translated to produce TREMBLdivision files containing VARIANTfrom allele, variation, misc_difference TREMBLentries in SWISS-PROTformat. and mutation The program to produce TREMBLis written in C and W CONFLICTfrom conflict makesuse of the srs4_02 library (Etzold and Argos 1993), which provides the basis for a first level parsing of EMBL Two examples of TREMBLentries, created in the way database entries. This level allows text data to fit in described before, are shownin Figure 2. In addition to this structures such as ordered lists of features or bibliographic information parsed into TREMBLentries, data is put in references, to assemble the coding sequences and to the annotator’s section of the entry, whichis not visible to translate them. It is not possible to rely on the/translation the public. This is used for further analysis both by qualifier in the EMBLdatabase entries, since the programsand by biologists and consists of: TREMBLproduction program has to report extra features like conflicts and variants at the amino acid level. Each The EMBLentry description lines CDSleading to a correct translation results in one entry EMBLCC lines whose ID is the PID of the CDS. In the next step the Bibliographic reference titles structures are scannedto extract relevant data, to filter it Full CDSfeature text and eventually to insert it properly formatted into the B Full text of other relevant features within the CDS TREMBLentry. Only bibliographic references relevant to range the given CDSare kept in the TREMBLentry. This is H Number of CDSin the EMBLentry achieved by scanning the RP (Reference Position) lines The date of the last entry update the EMBLentry and matching with the CDSposition in Information if the organism already exists in SWISS- the sequence. The RC (Reference Comment)line is built PROT by assigning the SWlSS-PROT equivalent of the following EMBLqualifiers:

"/plasmid","PLASMID=", Sorting the Entries "/strain","STRAIN=", In the process of building TREMBL,different types of "/isolate","STRAIN=", (2nd choice) entries are put into different output files: "/cultivar","STRAIN=CV. " "/tissue_type", "TISSUE=", CDS with a /dbxref="SWISS-PROT" or a "/transposon","TRANSPOSON=", /dbxref="SPTREMBL"are not translated (already in SWISS-PROT + TREMBL) The description line comes from the /product qualifier, H CDSfrom mhc genes -> mhc.dat when present, otherwise we make use of the EMBLDE ! CDSfrom patent data -> patent.dat line, the/gene and/note qualifiers. The EMBLDE line is CDSfrom immunoglobulins and t-cell receptors -> only considered if the EMBLentry holds only one cds and immuno.dat is stripped of non-pertinent information such as the B CDSsmaller than 8 amino acids -> smails.dat organism name, or things like ’complete cds’. The /gene CDSfrom artificial, synthetic or chimeric genes -> qualifier is also used for the TREMBLGN line. synthetic.dat At the moment, because the EMBLand SWISS-PROT E CDSfrom pseudogenes-> pseudo.dat taxonomies are slightly different, we use equivalence remaining CDS-> stay in their relative taxonomic tables to assign OSand OClines in the entries. Whereno TREMBLdivisions equivalent is found, the EMBLOS and OClines are kept.

36 ISMB-97 ID G34313 PRELIMINARY; PRT; 332AA. AC X02152 1; DT 23-DEC-1996 (EMBLREL. 49, CREATED) DT 23-DEC-1996 (EMBLREL. 49, LAST SEQUENCEUPDATE) DT 23-DEC-1996 (EMBLREL. 49, LAST ANNOTATIONUPDATE) DE LACTATE DEHYDROGENASE. OS HOMOSAPIENS (HUMAN). OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; OC EUTHERIA; PRIMATES. RN [1] RP SEQUENCE FROM N.A. RX MEDLINE; 85127030. RA TSUJIBOH., TIANOH.F., LI S.S.-L.; RL EUR. J. BIOCHEM.147:9-15(1985). DR EMBL; X02152; G34313; -. SQ SEQUENCE332 AA; 36689 MW; FF7595E2 CRC32; //

ID G780261 PRELIMINARY; PRT; 332 AA. AC X03077 1; DT 23-DEC-1996 (EMBLREL. 49, CREATED) DT 23-DEC-1996 (EMBLREL. 49, LAST SEQUENCEUPDATE) DT 23-DEC-1996 (EMBLRE!.. 49, LAST ANNOTATIONUPDATE) DE LACTATE DEHYDROGENASE. OS HOMOSAPIENS (HUMAN). OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; OC EUTHERIA; PRIMATES. RN[1] RP SEQUENCE FROM N.A. RX MEDLINE; 86076881. RA CHUNGF.Z., TSUJIBO H., BHATTACHARYYAU., SHARIEFF.S., LI S.S.-L.; RL BIOCHEM.J. 231:537-541(1985). DR EMBL; X03077; G780261; -. DR EMBL; X03078; G780261; JOINED. DR EMBL; X03079; G780261; JOINED. DR EMBL; X03080; G780261; JOINED. DR EMBL;X03081; 13780261; JOINED. DR EMBL; X03082; G780261; JOINED. DR EMBL; X03083; G780261; JOINED. SQ SEQUENCE332 AA; 36689 MW; FF7595E2 CRC32; //

Figure 2: First level TREMBLentries (after translation and entry creation, sequence not shown)

At this stage the entries from the composite divisions of REM-TREMBLconsists of 5 files (patent.dat, the EMBLdatabase (STS, EST, and UNC)are added immuno.dat, smalls.dat, synthetic.dat, and pseudo.dat). their relative taxonomic TREMBLdivisions. SP-TREMBLconsists of 12 files (fun.dat, inv.dat, Then all files are searched for entries that have recently hum.dat, mam.dat, mhc.dat, org.dat, phg.dat, pin.dat, been added to SWISS-PROTbut which do not yet have a pro.dat, rod.dat, vrl.dat, and vrt.dat) which will undergo /dbxref="SWlSS-PROT" qualifier in EMBL. These further post-processing. entries are removed and TREMBLis split into two different sections. SP-TREMBL (SWlSS-PROT TREMBL)which contains entries that will be added, after Post-processing the SP-TREMBLEntries complete annotation, to SWISS-PROT and REM- To post-process the SP-TREMBLentries, a collection of TREMBL(REMaining TREMBL)which contains entries shell scripts and C programsare used. not for inclusion in SWlSS-PROT.

Apweiler 37 ID G34313 PRELIMINARY; PRT; 332AA. AC X02152_1; DT 23-DEC-1996 (EMBLREL. 49, CREATED) DT 23-DEC-1996 (EMBLREL. 49, LAST SEQUENCEUPDATE) DT 23-DEC-1996 (EMBLREL. 49, LAST ANNOTATIONUPDATE) DE LACTATE DEHYDROGENASE. OS HOMOSAPIENS (HUMAN3. OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; OC EUTHERIA; PRIMATES. RN [1] RP SEQUENCE FROM N.A. RX MEDLINE; 85127030. RA TSUJIBOH., TIANOH.F., LI S.S.-L.; RL EUR. J. BIOCI-IEM.147:9-15(1985). Rig [21 RP SEQUENCE FROM N.A. RX MEDLINE; 86076881. RA CHUNGRZ., TSUJIBO H., BHATTACHARYYAU., SHARIEF F.S., LI S.S.-L.; RL BIOCHEM.J. 231:537-541(1985). DR EMBL;X02152; G34313; -. DR EMBL;X03077; G780261; -. DR EMBL; X03078; G780261; JOINED. DR EMBL; X03079; G780261; JOINED. DR EMBL; X03080; G780261; JOINED. DR EMBL; X03081; G780261; JOINED. DR EMBL; X03082; G780261; JOINED. DR EMBL; X03083; G780261; JOINED. SQ SEQUENCE332 AA; 36689 MW; FF7595E2 CRC32; //

Figure 3: Second level TREMBLentry (after merging, sequence not shown)

The first step is the reduction of redundancy. All full- LASSAP(LArge Scale Sequence comparison Package), length proteins in SP-TREMBLwith the same sequence programmable, high performance system designed to are mergedinto one entry. All fragment proteins with the overcome current limitations of sequence comparison same sequence from the same organism are merged programsin order to fit the needs of large scale analysis provided they do not belong to a highly variable category (Glemet and Codani 1997. LASSAP provides an of proteins like MHCproteins or viral proteins. For all Application Programming Interface allowing the SWISS-PROTentries, the CRC32checksums of all the integration of any generic pairwise-based algorithm. different annotated sequence reports are calculated and Whichever pairwise algorithm is used in LASSAP,it compared with the checksums of all SP-TREMBLentries. shares numerous enhancements with all other algorithms Identified matches are removed from SP-TREMBLand such as: integrated into the corresponding SWISS-PROTentries. Z intra and inter data bank comparisons Figure 3 shows an example of an automatically merged computational requests (selections and computations TREMBLentry, created by merging of the two TREIVlBL are achieved on the fly) entries shownin Figure 2. frame translations on queries and data banks Currently we are working on a further reduction of structured results allowing easy and powerful post- redundancy by establishing rules to automatically merge analysis sub-fragments with full-length sequences and on the performance improvementsby parallelization and the identification of sequence differences due to driving of specialised hardware. polymorphisms, strain variations and sequencing errors with the goal of establishing rules to automatically merge LASSAPallows the use of several sequence comparison conflicting sequence reports about the same sequence into methods: BLAST(Altschul et al. 1990), FASTA(Pearson one entry. and Lipman 1988), dynamic programming with local This work is done in collaboration with Jean-Jacques (Smith and Waterman 1981) and global (Needleman Codani from INRIA, France. His group developed Wunsch1970) similarity searches, string matching with

38 lSMB-97 ID P00338 PRELIMINARY; PRT; 332AA. AC P00338; DT 01-FEB-1997 (TREMBLREL. 02, CREATED) DT 01-FEB-1997 (TREMBLREL. 02, LAST SEQUENCEUPDATE) DT 01-FEB-1997 (TREMBLREL. 02, LAST ANNOTATIONUPDATE) DE L-LACTATE DEHYDROGENASE(EC 1.1.1.27). OS HOMO SAPIENS (HUMAN). OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; OC EUTHERIA; PRIMATES. RN [1] RP SEQUENCE FROM N.A. RX MEDLINE; 85127030. RA TSUJ1BOH., TIANOH.F., LI S.S.-L.; RL EUR. J. BIOCHEM.147:9-15(1985). RN [2] RP SEQUENCE FROM N.A. RX MEDLINE; 86076881. RA CHUNGF.Z., TSUJIBO H., BHATI~ACHARYYAU., SHARIEFF.S., LI S.S.-L.; RL BIOCHEM.J. 231:537-541(1985). CC -!- CATALYTICACTIVITY: L-LACTATE + NAD(+) = PYRUVATE+ NADH. CC -!- SUBUNIT: HOMOTETRAMER(BY SIMILARITY). CC -!- PATHWAY:FINAL STEP IN ANAEROBICGLYCOLYSIS. DR EMBL; X02152; G34313; -. DR FaMBL;X03077; (3780261; -. DR EMBL; X03078; G780261; JOINED. DR EMBL; X03079; G780261; JOINED. DR EMBL; X03080; G780261; JOINED. DR EMBL; X03081; G780261;JOINED. DR EMBL; X03082; G780261; JOINED. DR EMBL; X03083; (3780261; JOINED. DR PROSH’E; PS00064; L_LDH. KW OXIDOREDUCTASE; NAD; GLYCOLYSIS. FT ACT_SITE 193 193 BY SIMILARITY. SQ SEQUENCE332 AA; 36689 MW; FF7595E2 CRC32; //

Figure 4: Third level TREMBLentry (after complete post-processing, sequence not shown)

(Landau and Vishkin 1986, Hunt and Szymanski 1977) differing SP-TREMBLsequence reports about the same without errors (Boyer and Moore 1977) and pattern sequence into one entry. matching with (Baeza-Yates and Gonnet 1989, Wuand Manber 1991) or without errors. We already use LASSAP The second post-processing step is the information to find sequences in SP-TREMBLwhich are sub- enhancing process. For SP-TREMBLto act as a fragments of SWlSS-PROTor other SP-TREMBLentries, computer-annotated supplement to SWISS-PROT,new and to find sequences in SP-TREMBLwhich are the full- procedures have been introduced whereby valuable length sequences of sequence fragments in SWlSS-PROT. annotation can be added automatically. Firstly, all SP- The comparisons are done with the Boyer-Moore TREMBLentries are scanned for PROSITE(Bairoch, algorithm. Identified matches are removed from SP- Bucher, and Hofmann 1997) patterns compatible with TREMBLand integrated into the corresponding SWISS- their taxonomic range. The results are added to the PROTor SP-TREMBLentries. annotator’s section of the SP-TREMBLentry which is not Although we have already identified thousands of SP- visible to the public. Someof the patterns are knownto be TREMBLentries with exact matches in SWlSS-PROTor very reliable (i.e. no knownfalse positive). Theseare used SP-TREMBL,there are tens of thousands of subtle to enhance the information content of the DE, CC, DR, redundancies due to polymorphisms, strain variations and and KWfields by adding information about the potential sequencing errors which we would like to eliminate. Also, function of the protein, metabolic pathways, active sites, we need to establish further rules to automatically merge cofactors, binding sites, domains,subcellular location, and

Apweiler 39 ID LDHM_HUMAN STANDARD; PRT; 331 AA. AC P00338; DE L-LACTATEDEHYDROGENASE M CHAIN (EC 1. I. 1.27) (LDH-A). GN LDHA. OS HOMOSAPIENS (HUMAN). RN [1] RP SEQUENCE FROM N.A. RX MEDLINE; 85127030. RA TSUJIBOH., TIANOH.F., LI S.S.-L.; RL EUR. J. BIOCHEM.147:9-15(1985). RN [2] RP SEQUENCE FROM N.A. RX MEDLINE; 86076881. RA CHUNGF.Z., TSUJIBO H., BHATTACHARYYAU., SHARIEFF.S., LI S.S.-L.; RL BIOCHEM.J. 231:537-541(1985). RN [3] RP VARIANT CYS-314. RX MEDLINE; 93075246. RA SUDOK., MAEKAWAM., SHIOYA M., IKEDA K., TAKAHASHIN., ISOGAI Y., RA LI S.S.-L., KANNOT., MACHIDAK., TOR/UMIJ.; RL BIOCHEM.INT. 27:1051-1057(1992). RN [4] RP VARIANT GLU-221. RX MEDLINE;94199831. RA MAEKAWAM., SUDOK., KOBAYASHIA., SUGIYAMAE., LI S.S.-L., KANNOT.; RL CLIN. CHEM.40:665-668(1994). CC -!- CATALYTICACTIVITY: L-LACTATE + NAD(+) = PYRUVATE+ NADH. CC -!- SUBUNIT: HOMOTETRAMER. CC -!- PATHWAY:FINAL STEP IN ANAEROBICGLYCOLYSIS. CC -!- THERE ARE THREE TYPES OF LDH CHAINS: M (LDH-A) FOUNDPREDOMINANTLY CC IN MUSCLETISSUES, H (LDH-B) FOUNDIN HEARTMUSCLE AND X (LDH-C) CC WHICH IS PRESENT IN THE SPERMATOZOAOF MAMMALS,IN THE COLUMBIDAE CC OF BIRDS AND IN ACTINOPTERYGIANFISH. CC -!- DISEASE: EXERTIONALMYOGLOBINURIA IS DUE TO A DEFECT IN LDH-A. DR EMBL;X02152; G34313; -. DR EMBL;X03077; G780261; -. DR EMBL;X03078; G780261 ; JOINED. DR EMBL;X03079; G780261 ; JOINED. DR EMBL;X03080; G780261 ; JOINED. DR EMBL; X03081; G780261; JOINED. DR EMBL;X03082; G780261 ; JOINED. DR EMBL;X03083; G780261 ; JOINED. DR PIR; A00347; DEHULM. DR HSSP; P00344; 1LDB. DR AARHUS/GHENT-2DPAGE;2207; NEPHGE. DR MIM; 150000;-. DR PROSITE; PS00064; L_LDH. KW OXIDOREDUCTASE; NAD; GLYCOLYSIS; MULTIGENE FAMILY; DISEASE MUTATION; KW POLYMORPHISM. FT INIT_MET 0 0 FT ACT_SITE 192 192 ACCEPTS A PROTON DURING CATALYSIS. FT VARIANT 221 221 K->E. FT VARIANT 314 314 R -> C (IN LDHADEFICIENCY). SQ SEQUENCE 331 AA; 36557 MW;DF367487 CRC32; I/

Figure 5: Fully annotated SWISS-PROTentry (DT lines, OClines and sequence not shown)

40 ISMB-97 other annotation to the entry wherever appropriate. We Another category of data which will not be included in also use the ENZYMEdatabase (Bairoch 1996), using the SWISS-PROTis synthetic sequences (SWISS-PROT EC numberas a reference point, to generate standardised represents only naturally occurring sequences). Again, we description lines for enzyme entries and to allow do not wish to leave these entries in TREMBL.Ideally information such as catalytic activity, cofactors and one should build a specialized database for artificial relevant keywords to be taken from ENZYMEand to be sequences as a further supplement to SWISS-PROT.The added automatically to SP-TREMBLentries. Furthermore remainder of the REM-TREMBLentries are patents, we use specialised databases like Flybase (The FlyBase pseudogenes (SWISS-PROTdoes not represent genes Consortium 1997), SGD, GDB(Fasman et al. 1997), knownnot to be expressed), and sequence fragments of MGD(Blake et al. 1997) to parse information like the aminoacid residues or less. correct gene nomenclature and cross-references to these The EMBLNucleotide Sequence Database release 50 databases into SP-TREMBLentries. contained 20,000 new CDS, leading to the creation of a The now fully post-processed TREMBLentry, already corresponding number of new TREMBLentries in only 3 used as an examplebefore, is shownin Figure 4. Although months (EMBLreleases are created every 3 months). this computer-generated annotation is already enhancing However, in the same period we integrated 10,000 SP- the informationabout the sequencedrastically, it is still a TREMBLsequence reports into SWISS-PROTor merged long way to the quality of the corresponding SWISS- them with existing SWISS-PROT+ TREMBLentries. In PROTentry (shown in Figure 5), fully annotated 1997, we expect approximately 80,000 new CDSin the biologists. Also, this is not a representative exampleof a EMBLNucleotide Sequence Database. Although we hope TREMBLentry. A (fortunately also not representative) to increase the numberof sequence entries integrated into example of a TREMBLentry with low annotation quality SWlSS-PROTfrom 59,000 at the end of 1996 to 75,000 at is shown in Figure 6. In this case, a badly annotated the end of 1997, the number of entries in TREMBLwill nucleotide sequence resulted in a badly annotated probably increase (after removal of redundancy) from TREMBLentry, which we could not improve with our 105,000 at the end of 1996 to 150,000 at the end of 1997. current, limited range of automatedannotation tools. This prediction underlines the fact that the ever-increasing automation of SP-TREMBLannotation methods is the only long-term viable approach to the constantly The current status of TREMBL increasing data flow.

The TREMBLrelease created from the EMBLNucleotide Sequence Database release 49 contains (February 1997) The future of annotation in TREMBL 116,379 sequence entries, comprising 31,293,053 amino acids. This TREMBLrelease was distributed with Most of the sequence data nowadays is coming from SWISS-PROTrelease 34 (containing 59,021 sequence genome projects and lacks biochemical evidence to entries with 21,210,389 amino acids). provide hard data on the function of the protein. The Most of the 96,757 sequence entries currently in SP- prediction of functional information from primary TREMBLare additional sequence reports of entries sequence information is a comparative problem based on already in SWISS-PROTand will lead to updates of those a set of general rules and relationships derived from the SWISS-PROTentries. However, some 30,000 to 40,000 current set of known proteins. Sequence similarity entries now in SP-TREMBLwill eventually be included searches, pattern and profile searches, and clustering of as new sequence entries in SWISS-PROT.Approximately sequences are currently helping us to take in the 20% of the SP-TREMBLentries have been post- annotation process advantage of the relationship between processed. primary sequence and function. The majority of REM-TREMBLentries (currently Modemsensitive database search algorithms find already approximately 13,000 of 19,620) are immunoglobulins characterised sequences similar to new sequences and and T-cell receptors. In SWISS-PROT, we have enable us to annotate new sequences by analogy to old translations of the germ line genes for immunoglobulins sequences. Secondary pattern and profile databases are and T-ceil receptors but we do not wish to add all known used to enhance SP-TREMBL entries by adding somatic recombined variant sequences as this would bias information about the potential functions of proteins, database-wide searches. Such entries will be placed in metabolic pathways,active sites, cofactors, binding sites, IMGT-TREMBL(ImMunoGeneTics-TREMBL). We will, domains, subcellular location, and other annotation. We in collaboration with IMGT(Guidicelli et al. 1997), are automating the similarity and motif searches to develop IMGT-TREMBLas a specialist protein database accelerate the upgrading of SP-TREMBLentries to for immunoglobulins and T-cell receptors. This SWISS-PROTstandard. The annotation task, whether supplement to SWISS-PROTwill be presented in SWISS- automated or carried out by database curators, can PROTformat and cross-referenced to SWISS-PROT,the proceed far more quickly if large groups of related EMBLNucleotide Sequence Database, and IMGT. proteins, such as families of sequencessharing a similar

Apweiler 4x

...... -.-.-...-...... ¯ ...... ID P77135 PRELIMINARY; PRT; 127AA. AC P77135; DT 01-FEB-1997 (TREMBLREL. 02, CREATED) DT 01-FEB-1997 (TREMBLREL. 02, LAST SEQUENCEUPDATE) DT 0 I-FEB-1997 (TREMBLREL. 02, LAST ANNOTATIONUPDATE) DE 094 AND HYPOTHETICALPROTEIN GENES, PARTIAL CDS DE (FRAGMENT). GN 094. OS ESCHERICHIA COLI. OC PROKARYOTA; GRACILICUTES; SCOTOBACTERIA; FACULTATIVELY ANAEROBIC RODS; OC ENTEROBACTERIACEAE. RN [1] RP SEQUENCE FROM N.A. RC STRAIN=MG1655; RA ROBERTSD., ALLENE., ARAUJOR., APARICIOA., CHUNGE., DAVIS K., RA DUNCANM., FEDERSPIEL N., HYMANR., KALMANS., KOMPC., KURDI O., RA LEWH., LIND., NAMATHA., OEFNERP., SCHRAMMS., DAVIS R.W.; RL SUBMITI"ED (JAN-1997) TO EMBL/GENBANK/DDBJDATA BANKS. DR EMBL; U83187; G1773217; -. FT NON_TER 1 1 SQ SEQUENCE127 AA; 14752 MW; 883882DD CRC32; II

Figure 6: TREMBLentry with low annotation quality (sequence not shown)

motif, can be annotated together. Weare attempting to do With this annotation concept of SWISS-PROT + this by creation of alignments of all proteins and building TREMBLwe try to combine the strengths of annotation of clusters of proteins with high percentage identities, carefully done by human experts with biological indicating which sequence reports belong to the same knowledgeand after consultation of the relevant literature group of proteins or which proteins share similar domains. and thorough sequence analysis with the power of These clusters will also be used for automatic motif automation of sequence analysis and computer-generation detection and as a starting point to develop profile related of annotation. Since predicted annotation assignments and algorithms. assignments based on hard experimental evidence are However,currently most of these results are not included clearly distinguishable, we will present in TREMBL in the entries as seen by the public. These results are only highly reliable although putative functional predictions, accessible internally by the curators. Weare developing a without lowering the high editorial standards of the rule-based system which considers all the results obtained standard SWISS-PROTentries. by the different methods to add annotation to SP- TREMBLentries. This system consists of growing numbers of sequence analysis methods, rules, and References hierarchical classifications of the annotation content of SWISS-PROT entries, where all nodes in these hierarchical trees are linked to certain annotation. The Altschul S.F., Gish W., Miller W., Myers E.W., and rules consider the sequence analysis results to decide LipmanD.J. 1990. Basic local alignment search tool. J. whichnode(s) in the classification tree(s) are sufficiently Mol. Biol. 215:403-410. similar to the query sequence and lead subsequently to the incorporation of the appropriate annotation (linked to the Baeza-Yates R.A., and Gonnet G.H. 1989. Efficient text node) in the SP-TREMBLentry. The incorporated searching of regular expressions. Proceedings ICALP annotation is flagged as annotation based on sequence 16:46-62. analysis methods. These results internally accessible by the curators only are used by the SWISS-PROTteam to Bairoch A. 1996. The ENZYMEdata bank in 1995. establish additional rules to enhancethe annotation of SP- Nucleic Acid Res. 24: 221-222. TREMBLentries. We only add information based on our automatic analysis to SP-TREMBLentries, if we are Bairoch A., Bucher P., and Hofmann K. 1997. The convinced that the computer-generation creates correct PROSITEdatabase, its status in 1997. Nucleic Acid Res. annotation in more than 90%of the cases. 25:217-221.

4z ISMB-97 Bairoch A., and Apweiler R. 1997. The SWlSS-PROT Stoesser G., Sterk P., Tuli M.A., Stoehr P.J., and Cameron protein sequence data bank and its supplement TREMBL. G.N. 1997. The EMBLNucleotide Sequence Database. Nucleic Acids Res. 25:31-36. Nucleic Acids Res. 25:7-13.

Benson D.A., Boguski M., Lipman D.J., and Ostell J. Tateno Y., and Gojobori T. 1997. DNAData Bank of 1997. GenBank.Nucleic Acid Res. 25:1-6. Japan in the age of information biology. Nucleic Acid Res. 25:14-17. Blake J.A., Richardson J.E., Davisson M.T., Eppig J.T., and the Mouse Genome Inforrnatics Group 1997. The WuS., and Manber U. 1991. Fast text searching with Mouse Genome Database (MGD). A comprehensive errors. Technical Report TR91-11, University of Arizona. public resource of genetic, phenotypic and genomic data. Nucleic Acid Res. 25:85-91.

Boyer R., and Moore S. 1977. A fast string searching algorithm. C.A.C.M. 20:762-772.

Etzold T., and Argos P. 1993. SRS, an indexing and retrieval tool for fiat file data libraries. CompurAppl. Biosci. 9:49-57.

Fasman K.H., Letovsky S.I., Li P., Cottingham R.W., Kingsbury D.T. 1997. The GDB Human Genome Database Anno1997. Nucleic Acid Res. 25:72-80.

FlyBase Consortium 1997/ FlyBase: a Drosophila database. Nucleic Acid Res. 25:63-66.

Glemet E., and Codani, J.-J. 1997. LASSAP,a Large Scale Sequence comparison Package. CABIOS:In Press.

Guidicelli V., ChaumeD., BodmerJ., Mueller W., Busin C., Marsh S., Bontrop R., Marc L., Malik A., and Lefranc M.-P. 1997. IMGT,the international ImMunoGeneTics database. Nucleic Acid Res. 25:206-211.

Hunt J., and Szymanski T.G. 1977. A fast algorithm for computing longest commonsubsequences. Theoretical Computer Science 20:350-353.

Landau G.M., and Vishkin U. 1986. Efficient string matching with k mismatches. Theoretical Computer Science 43:239-249.

Needleman S.B., and Wunsch C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48:443-453.

Pearson W.R., and LipmanD.J. 1988. Improved tools for biological sequence comparison. Proc. Natl. Acad. Sci. USA85:2444-2448.

Smith T.F., and WatermanM.S. 1981. Identification of commonmolecular subsequences. J. Mol. Biol. 147:195- 197.

Apweiler 43

......