<<

Article

Swiss-Prot: juggling between evolution and stability

BAIROCH, Amos Marc, et al.

Abstract

We describe some of the aspects of Swiss-Prot that make it unique, explain what are the developments we believe to be necessary for the database to continue to play its role as a focal point of protein knowledge, and provide advice pertinent to the development of high-quality knowledge resources on one aspect or the other of the life sciences.

Reference

BAIROCH, Amos Marc, et al. Swiss-Prot: juggling between evolution and stability. Briefings in , 2004, vol. 5, no. 1, p. 39-55

DOI : 10.1093/bib/5.1.39 PMID : 15153305

Available at: http://archive-ouverte.unige.ch/unige:38278

Disclaimer: layout of this document may differ from the published version.

1 / 1 Amos Bairoch Swiss-Prot: Juggling between heads the Swiss-Prot group at the SIB and is a professor at the Department of Structural evolution and stability Biology and Bioinformatics of the University of Geneva. Amos Bairoch, Brigitte Boeckmann, Serenella Ferro and Elisabeth Gasteiger Date received (in revised form): 22nd December 2003 Brigitte Boeckmann has been working in the Swiss- Prot group for 16 years. She Abstract has been involved in annotation and tool development and is We describe some of the aspects of Swiss-Prot that make it unique, explain what are the now coordinating automatic developments we believe to be necessary for the database to continue to play its role as a focal annotation in Swiss-Prot. point of protein knowledge, and provide advice pertinent to the development of high-quality Serenella Ferro knowledge resources on one aspect or the other of the life sciences. has a background in biochemistry and chemistry and has worked as a Swiss-Prot head annotator for 15 years. Elisabeth Gasteiger INTRODUCTION are the particular aspects of Swiss-Prot coordinates software development in the SIB Swiss- The goal of this article is not to depict the that make it unique, and hopefully derive 1 Prot group and is in charge of history of Swiss-Prot, as this has already some advice that would be pertinent to the ExPASy server. been done elsewhere,2 but rather to someone embarking on the development explore some of the consequences of of a high-quality knowledge resource on decisions taken about 20 years ago, to one aspect or the other of the life Keywords: protein sequence, database, functional discuss how the database has constantly sciences. But before we do so, we want to annotation, automatic evolved and to describe the challenges enumerate six observations that we annotation, sequence analysis, that it currently faces. To say that the past believe are important to communicate to user feedback 20 years have been exciting would be a any would-be developers of such major understatement. Most young databases: scientists now starting a career in the life science fields are not aware of how much • Your task will be much more complex the combined technological revolutions and far bigger that you ever thought it that led to high-throughput sequencing could be. and the WWW have quantitatively and • If your database is successful and useful qualitatively changed the universe of to the user community, then you will knowledge on proteins. Yet, while we have to dedicate all your efforts to now have to cater in the Swiss-Prot and develop it for a much longer period of TrEMBL sections of the UniProt time than you would have thought knowledgebase3 for more than 1 million possible. protein sequences, there is a continuously • You will always wonder why life widening chasm between truly scientists abhor complying with characterised proteins and those that have nomenclature guidelines or been solely predicted by genome- standardisation efforts that would sequencing projects. For us, in Swiss- simplify your and their life. Prot, the ultimate in terms of a well- • You will have to continually fight to characterised protein is one for which not obtain a minimal amount of funding. Amos Bairoch, Swiss Institute of Bioinformatics, only the exact sequence, post-translational • As with any service efforts, you will be Centre Me´dical Universitaire, modifications, subcellular location, tissue told far more what you do wrong 1 Rue Michel Servet, specificity, interaction partners and 3D rather than what you do right. 1211 Geneva 4, Switzerland structure are known, but more crucially • But when you will see how useful for which a functional role can be your efforts are to your users, all the Tel: +41 22 379 50 50 Fax: +41 22 379 58 58 assigned. above drawbacks will lose their E-mail: swiss-prot@.org What we hope to convey in this paper importance!

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 3 9 Bairoch et al.

A SMALL BIT OF database. With foresight they HISTORICAL immediately accepted. The collaboration INTROSPECTION that grew from this early decision gave How Swiss-Prot started and rise to the current situation: Swiss-Prot is how it institutionally evolved a fully collaborative endeavour of what In 1965, the late Margaret Dayhoff has become the Swiss-Prot group at the published the first edition of the ‘Atlas of Swiss Institute of Bioinformatics (SIB) Protein Sequence and Structure’.4 It and the European Bioinformatics contained information on 65 protein Institute (EBI), an outstation of EMBL. sequences. In the introduction she The last institutional development was expressed the mission of the Atlas as the decision, in late 2003, of the NIH to award a major grant to a consortium locating all of the relevant publications; composed of the EBI, the SIB and PIR critically reviewing the data and to produce a universal resource on resolving conflicting reports; proteins, known as UniProt. Swiss-Prot contains transforming the data into a uniform Today, in 2004, more than 120 people mostly manual format to reflect those aspects of the directly work on Swiss-Prot and annotated entries structure that have been TrEMBL (see below) or on resources that experimentally determined and those evolved out of Swiss-Prot. While the first that could reasonably be inferred by reaction to this figure can be ‘that’s a lot homology; identifying the material of people’, it pales when compared with with regard to chemical function, the amount of work to be carried out. In biological source, genetic control, and fact this is a major issue shared by all life evolutionary origin... sciences information resources: long- This ambitious and still highly pertinent term, high-quality curation of mission statement is a tribute to the vision information is not cheap. It is not as shown by Margaret Dayhoff. She pursued glamorous as whole genome sequencing her task until her untimely death in 1983. projects or any such well-defined At that time the Atlas had evolved into a scientific and technological efforts, yet it protein sequence data bank known as the needs to be adequately and stably funded. Protein Identification Resource (PIR) of Sadly, this is not yet widely recognised by TrEMBL consists of the National Biomedical Research funding bodies. computer-annotated Foundation (NBRF). When in 1985, one entries, which are not of us (Amos Bairoch) was, in the context Why TrEMBL was developed yet in Swiss-Prot of a PhD thesis, developing a software In the mid-1990s it was already clear that package (PC/Gene5) to analyse protein the increased data flow from genome sequences, he was faced with some projects was going to be a major challenge deficiencies and omissions in the PIR for Swiss-Prot. As will be explained database. As he did not receive satisfactory further on, maintaining the high quality feedback from PIR, he resolved to of the database requires careful sequence develop a version of PIR in the format of analysis and detailed annotation of every the European Molecular Biology entry. This was, and still is, a major rate- Laboratory (EMBL) nucleotide sequence limiting step. We did not wish to relax database that would contain additional the editorial standards of Swiss-Prot and sequences and, more crucially, additional there was a limit to how much the annotations on various aspects of the annotation procedures could be protein universe. accelerated. Yet it was vital to make new In mid-1986, the first release of Swiss- sequences available as quickly as possible. Prot came out. Almost immediately we To address this concern, we introduced in approached the EMBL to see if they 1996 TrEMBL (Translation of EMBL). were interested in distributing and TrEMBL consists of computer-annotated helping with the maintenance of the entries derived from the translation of all

40 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 Swiss-Prot: Juggling between evolution and stability

coding sequences in the EMBL database, machines. In 1986, most nucleotide except for those already included in sequences submitted to the DNA Swiss-Prot. TrEMBL is therefore a databases originated from individual complement to Swiss-Prot and sequence laboratories that were sequencing a single entries only move out from TrEMBL and gene or a small region of a genome. enter Swiss-Prot after having been Today, the biggest (in terms of quantity) manually curated by an annotator. contributors are major sequencing centres From 1996 to the end of 2003, Swiss- that either provide complete genomic Prot grew by 83,000 sequences to reach a sequences or massive amounts of data total of 140,000 entries. In this period of from full-length cDNAs. time, TrEMBL grew from the 86,000 As we depend on primary sequence entries in its first release to about 1.1 data that have been submitted to the million entries! nucleotide sequence databases, it would seem at first glance that there is not really WHAT MAKES SWISS-PROT anything we can do to improve the The correct protein sequence is the basis for SPECIAL quality of the derived protein sequences. high-quality annotation Aiming for the perfect This is far from being true, and in fact sequence there are many things we can do by Even if it may be obvious to many of its comparing sequences. Sequence users, it is important to restate that Swiss- comparison is essential to the process of Prot is a corpus of knowledge centred on creating or updating a Swiss-Prot entry. protein sequences. As will become One needs to remember that Swiss-Prot is apparent in the following sections of this a non-redundant database. What this paper, we add many layers of information means is that we took the decision from around the sequence data, yet most of that the very beginning to merge the protein information is in one way or another sequences from the same organism dependent on the sequence. It is therefore originating from the same gene. Thus we important to capture and to represent the are often faced with many complete or most correct sequence. This is an partial sequences that need to be merged important aspect of the work of Swiss- and whose discrepancies have to be taken Prot that escapes the notice of most of its into account. Sequence discrepancies are users. annotated with the feature (FT) keys The overwhelming majority (.99 per CONFLICT, VARIANT, MUTAGEN Redundancy removal: cent) of the sequence data represented in or VARSPLIC. The FT key VARIANT Merging entries Swiss-Prot originates from the translation is used to describe polymorphisms and point out of nucleotide sequences submitted to the disease mutations, MUTAGEN for sequence discrepancies EMBL/Genbank/DDBJ database. Only a experimentally altered sites and very small proportion of the sequences are CONFLICT for sequence differences of obtained directly at the amino acid level any other reason. Insertions or gaps using Edman degradation or mass within alignments of otherwise identical spectrometry. This situation already sequences are usually due to alternative Splice isoforms existed in 1986. What has happened since splicing events, which are annotated using was obviously an enormous quantitative the FT key VARSPLIC. increase in the amount of nucleotide Thus sequence comparisons can already sequence data, but also, more relevant to help us in determining what is the most our quest toward quality, a significant correct sequence. This is especially true in increase in nucleotide sequence quality organisms that are the focus of many and a sociological change in the sequencing efforts. For example, we breakdown of the originators of sequence currently have an average of 3.7 data. The increase in sequence quality is independent sequence reports (cDNA or mainly due to the growing use of very genomic DNA) for each human protein. sophisticated automated sequencing Such a redundancy in the nucleotide

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 4 1 Bairoch et al.

sequence database helps flagging potential Finally, the work of the Swiss-Prot sequencing errors. Further errors can be annotators is also to reject putative protein found when comparing orthologous and sequences that are obviously bogus, either paralogous sequences across species. The because they originate from a pseudogene relevance of such approaches is increasing or because they were incorrectly as more and more full genome sequences predicted either from non-coding DNA are becoming available. or a wrong open reading frame (ORF). One of the advantages of comparing If you take all the above factors and many sequences is the detection of tasks into consideration, you can see why Frameshifts probable frameshift errors. They stand up we believe that the correction of amino in multiple protein sequence alignments acid sequences is an important part of the as locally divergent regions. If the annotation process, and that it is far from divergence can be explained at the trivial to achieve. This is not necessarily nucleotide level by the insertion or apparent to the user, but it is one of the deletion of a single nucleotide, it is likely reasons why Swiss-Prot has always been (but not certain) that it is due to a considered as the reference database for sequencing error. The total number of protein sequences. Of course the potential frameshift errors that were drawback of such an approach is that it is corrected by Swiss-Prot annotators is time consuming and can be applied only difficult to estimate as it often happens to manually annotated entries. Such an that incorrect DNA sequences are later approach can consequently not be applied resubmitted by the original authors, to TrEMBL, where the represented correcting sequencing errors, generally by protein sequences are those that have taking into account the correction made been indicated by the submitters of the in the corresponding Swiss-Prot entries. original nucleotide sequence entry. It In the current release we have 1 per cent would therefore be important to develop of the entries that are flagged with at least semi-automatic systems that allow some one potential frameshift error in one of aspects of sequence correction to be the cross-referenced nucleotide sequence applied to TrEMBL. entries. In many cases, the N-terminal Extracting information from Initiation sites and exon boundaries initiation sites of bacterial or archaeal the literature genes or the exon/intron boundaries of Fifteen years ago, Swiss-Prot annotators eukaryotic genes are incorrectly typically went through the following Access to published predicted. It is important to note that process: they photocopied all relevant information before the these predictions are of a very papers from the reference list of the entry internet era heterogeneous quality and to recognise they were annotating. The publications that not all sequencing centres produce were read and important information was the same level of quality in terms of both marked in the paper copy. Information sequences and of protein-coding gene was then added to the entry in either free predictions. Swiss-Prot annotators are text (comments lines) or structured aware of this heterogeneity and know feature lines. Access to reference databases what data can be more or less trusted. We and computing tools considerably Annotation of CDs not currently observe that in 7.1 per cent of facilitated the above procedures, but also annotated in our entries we disagree with the brought along a higher level of the nucleotide translation provided by the submitter. complexity. Being an annotator in the sequence databases It often happens that annotators have to early 1990s was already not a trivial job, translate, from a nucleotide entry, protein but it has since become a much more sequences that have been overlooked by demanding task. the original submitters. Currently we When Medline became available at the have 2.5 per cent of our entries that workplace first on CD-ROMs, and later contain such translations. via the internet, most journal abstracts

42 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 Swiss-Prot: Juggling between evolution and stability

could immediately be read – or discarded protein sequence analysis tools. When if not relevant – and information was contradictory results have been published retrieved directly from here, which was and there is not enough information to The primary source of particularly helpful when the journal was prefer one hypothesis over the others, the protein knowledge are journal articles not available from local libraries. But it is annotation is performed in a way that online access to full text articles that has draws the user’s attention to the completely changed the life of annotators. contradictory conclusions. Finally the They can look at many more relevant content of an entry is summarised in form papers than they used to do when they of a list of keywords (KW line) from a needed to go to the library. This is controlled vocabulary. particularly useful nowadays as Both abstracts and full text articles are Text mining tools will information on a given protein is the target of text-mining tools, which will guide annotators generally spread between many different soon become an indispensable help for through the wealth reports in a wide variety of journals. Such annotators to quickly find the publications of publications a trend is exemplified by the journal of interest from the wealth of information citation statistics of Swiss-Prot: in 1993, available. We believe that efforts to build 461 different journals were cited in the efficient software tools allowing the semi- database, while today the number has automated extraction of information from risen to about 1,400. Although some repositories of full text articles will be journals (such as J. Biol. Chem. and Proc. essential to anyone trying to build Natl Acad. Sci USA) were and still are comprehensive information resources for major sources of articles useful for the life scientists. The fact that we will rely on annotation process, there has been a clear such tools to hunt and extract information trend toward a ‘decentralisation’ of the is paradoxical. Anyone outside the life sources of protein-related publications. Of sciences field would believe that such course, journal articles are not the only important information would be source of information, and we also make immediately made available in a use of electronic journals, book articles, structured way by the experimentalists to theses, patent applications and external the relevant databases. As we will see in information resources, but the next section, this is unfortunately not overwhelmingly the primary source of the case. experimental information remains published journal articles. User submissions and updates We are often asked whether annotators We have always strongly encouraged user are ‘really sitting there and reading feedback, as well as the submission of publications’. Yes, they are. Knowledge updates and corrections, initially by asking extracted from the articles is mostly added people to contact us by e-mail. Also, very to the appropriate topics of the comment early on, a list of ‘on-line experts’ was (CC) lines, and to the feature table (FT), compiled, ie a list of email addresses of whenever a description concerns a scientists working with specific protein defined region or site within the families or domains, who agreed to sequence. But we also add new synonyms review protein sequences in Swiss-Prot for protein names (DE line), gene names relevant to their field of research. This list (GN line), compare or complete author is regularly updated and the 150 experts’ names with the ones given in a reference e-mail addresses, grouped by fields of block (RA line), annotate a reference expertise, are listed in the document.6 block (RP and RC lines), add additional However, it does not seem clear to relevant references to an entry, and much most users – who have grown more. All experimental findings and accustomed to the repository of authors’ conclusions are compared with the nucleotide sequence databases, where the knowledge available on related only the original authors are allowed to proteins and the results from various correct and update existing entries – that

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 4 3 Bairoch et al.

Swiss-Prot is extremely different in that quantity from venom and are generally respect, and that we do have an ongoing quite small, thus they are easily sequenced editorial policy. We do indeed highly at protein level. value our users’ expertise, and we believe We have to admit that we are User-submitted updates are highly valued that it is only with the assistance of our disappointed by the low level of input user community that we can do our job from users in the updating of the database. of being comprehensive and up to date. We may have been insufficiently efficient We are therefore actively seeking any in publicising our willingness and type of updates and/or corrections, eagerness to welcome any type of help. whether they have been published or not, Yet, after years of discussions with and would like to be notified about researchers, we believe that the root of annotations to be updated, eg if the the project is of a sociological nature. The function of a protein has been clarified, or career of life scientists is driven by the if new post-translational modification famous ‘publish or perish’ injunction and information has become available. In submitting data to a database does not get order to increase the visibility of these any credit points on a CV. So we have to aspects, and to encourage our users to let rely on the altruism of some individuals. us know about outdated protein entries or We are indeed indebted to those persons errors, we have implemented update who take the time to make sure that we forms on the ExPASy server (see the adequately represent the results of their section below, ‘Making Swiss-Prot research in our database. However, we available to the users’). The forms, believe it is time that the community as a accessible from the bottom of every whole addresses this issue and initiates a Swiss-Prot entry, prompt users to provide process of responsibility toward the their corrections and updates in any biomolecular databases. Web submission forms format. Update requests are treated with a for updates very high priority by annotators. We Tools for annotation currently receive about 300 update The basic data organisation, the editor and requests for Swiss-Prot entries per year, a the syntax checker number that we would very much like to The working copy of Swiss-Prot is see growing in the future! arranged in flat files, grouping proteins by On the other hand, annotators send family or other functional criteria. newly annotated entries to the original Although it was apparent from the authors of reports cited in these entries so beginning that the complexity of protein as to check the validity of the annotations. relationships could not be simulated We generally get useful feedback, but not simply by grouping entries one- as much as we would like! dimensionally into separate files, this Another point of interaction with users system allows curators to immediately is sequence submission directly to Swiss- find orthologues, which can all be Direct protein Prot and TrEMBL. We accept submission updated when new findings become sequence submission of sequences that have been obtained only available for at least one protein, or when as amino acid sequence. A web a review article summarises relevant submission tool (SPIN) has just been knowledge on a protein family or made available, which guides the subfamily and comes to new conclusions. submitter through the process, and The quick availability of all related entries prompts for all required pieces of (all in the same file) also ensures consistent information. There are about 300 such annotation of all relevant entries. The sequence submissions per year. It is 140,000 entries in the current release interesting to note that 10 per cent of the are thus split into 3,000 files. proteins originate from venomous Most of the annotation is done animals. This is explained by the fact that manually with the help of a continuously toxins can easily be purified in large growing number of tools. We currently

44 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 Swiss-Prot: Juggling between evolution and stability

use a text editor, Crisp (from Vital, Inc.), treated, from within the editor, in a that is easy to use and comes with a remarkable speed. One major From a single text powerful C-like macro language that we disadvantage of this environment is that it editor to an adapted use extensively both for literature-driven relies heavily on the flat file format. We annotation platform textual annotation and as a platform to are now developing a Swiss-Prot specific launch sequence analysis programs (see editor, which will work with the version next section). An extensive series of of the databases formatted by extensible macro-commands have been developed mark-up language (XML), and will to reformat references, comment lines, include many consistency checks and feature lines or sequences, to check context-specific menus. The new controlled vocabulary or syntax, and to annotation platform will also include retrieve entries from other databases. many graphical features, eg visualisation Analysis tools are also run directly from of domain and site predictions along the the editor with the help of macro- sequence. We believe that such a commands that send the sequence and development is highly desirable, as it will other relevant information to the analysis allow the implementation of consistency program, and then retrieve the result and checks directly at the level of the format it in the annotation platform. All annotation platform while we now have commands are available both from to rely on a regular post-processing check keyboard shortcuts (which are preferred of the data, using the syntax checker to by experienced annotators) and from enforce consistency. menus and dialogue boxes that are fully integrated in the editor’s graphical user Sequence analysis tools interface (GUI) environment. The task of annotating Swiss-Prot entries Swiss-Prot annotation has always been has always relied on the use of the most Only well-structured subjected to very strict rules and data is easily accessible appropriate sequence analysis programs so guidelines. All entries are reviewed before as to predict important sequence features. they enter the database, which guarantees Over the years we have implemented many the homogeneity of the annotation. We different methods and programs in our developed a ‘syntax checker’ so as to annotation platform. We have also spent a make sure that our annotation and format considerable amount of time testing new rules are enforced. This syntax checker, methods and selecting the most appropriate implemented in Perl, is much more than a ones. In some cases, when no existing program that verifies the basic syntax of a program could satisfy our needs, we have Swiss-Prot entry. It also enforces the use developed our own set of predictive of controlled vocabularies (see section methods.7,8 All these activities are carried below, ‘Standardisation and controlled out by a small research component within vocabularies’) and checks for the Swiss-Prot group whose missions are to dependencies and consistencies between carry out technological watch and to different portions of an entry. In develop new methodologies for protein December 2003, the syntax checker sequence analysis. contained almost 1,100 different rules, Currently we use software tools (a full each of which can lead to the detection of list with references is available in the errors or inconsistencies. Swiss-Prot document annbioch.txt) to Many people are surprised to hear that predict the following sequence features: Swiss-Prot annotation is done from within a text editor. However, those same • signal sequences of type 1, type 2 people are usually even more surprised The number of (lipoprotein) and type 3; prediction methods once they see how powerful the • mitochondrial and plastid targeting used in Swiss-Prot annotation platform developed around sequences; steadily increases that text editor is, and that almost every • transmembrane domains; command can be launched, and its results • coiled coil domains;

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 4 5 Bairoch et al.

• specific repeats (leucine-rich repeat Automation: Trying to simulate the (LRR), tetratricopeptide repeat expertise of annotators (TRR), WD (Trp-Asp) repeat, etc); Thanks to genome sequencing efforts, • statistically significant runs of amino there has been a tremendous rise in the acids and regions enriched in number of available protein sequences. particular amino acids; Yet clearly this is only the beginning and • N-glycosylation sites; what exists now will represent only a drop • glycosylphosphatidylinositol (GPI) in an ocean of uncharacterised sequences. anchors; And there lies both the problem and a • sulphation sites; possible solution: on one hand the • N-terminal myristoylation sites. overwhelming majority of genome- derived sequences are currently not the In addition to the above list, we make target of experimental characterisation PROSITE was created extensive use of domain/family databases and are probably not going to be so in the for the annotation of next decade. On the other hand, we have conserved domains and to annotate specific domains. In fact the 9 functional sites development of the PROSITE database, encapsulated in Swiss-Prot a tremendous which was first released in 1990, was amount of knowledge, some of which is specifically driven by the need to detect specific to a given protein, while the and annotate protein domains. The majority can be carefully propagated to combined usage of profiles and patterns well-defined orthologous sequences. allows the detection of domains (profile) Automatic annotation is far from being a and the functional sites within domains novel concept. But what we want to (pattern). As mentioned in the section on achieve in Swiss-Prot differs from what cross-references below, there are now others expect from such systems. Their many other protein domain databases and aim is to analyse new genomic sequences we occasionally make use of most of them and predict a maximum of potential to annotate specific domains not yet information items so as to be able to infer covered by PROSITE. The reasons of our hypotheses on the potential biological preference for PROSITE over other processes present in the organism. Our similar databases are pragmatic: PROSITE aim is to make sure that we produce high- domain descriptors are specifically tailored quality annotation with a minimal for their use in the context of protein amount of incorrect inferences. sequence annotation in order not to Our first automatic annotation project predict overlapping domains. Cut-off is called HAMAP,10 which stands for values are selected conservatively to High-quality Automated and Manual minimise the number of false positives: we Annotation of microbial Proteomes. In HAMAP proved that the context of this project, proteins from automated annotation prefer to miss the occurrence of a domain is not necessarily rather than to over-predict its existence. complete bacterial and archaeal accompanied by a We believe that the use of the most up- proteomes, together with the related decrease in quality to-date sequence analysis tools is essential plastid proteins, are automatically to any protein sequence annotation effort. annotated based on manually created In addition anyone considering applying family rules for complete protein such methods on a large scale needs to annotation, with template-based feature develop internal benchmarks so as to propagation. Proteins with no similarity objectively judge the validity and the to other proteins in Swiss-Prot, which we scope of the methods. In many instances call ORFans, undergo an automated we have observed that the claims of protein sequence analysis procedure that developers of sequence analysis methods looks for many of the sequence features are slightly overblown and that one described in the preceding section. These obtains unexpected results when using features are then automatically annotated such methods on large and highly according to rules of consistency and heterogeneous sets of sequences. dependency. A paper with further

46 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 Swiss-Prot: Juggling between evolution and stability

statistics on HAMAP is currently under Standardisation and review by another journal. controlled vocabularies We have just developed a second A long tradition of using controlled system called Anabelle that strives to vocabularies in Swiss-Prot Anabelle extends the scope of automated annotate not only ORFans and well- To allow effective and precise database annotation to proteins defined proteins, but also any protein retrieval and searches, the same concepts with a complex domain with one or more conserved or functional need to be described with the same terms architecture domains or sites detected by one of the everywhere in the database. Controlled methods carefully selected for their vocabularies or indexing terms can serve accuracy by the Swiss-Prot team. The this purpose. A controlled vocabulary is information retrieved from all results is defined as ‘an organized list of words and logically combined according to selection phrases, or notation system, that is used to rules and logical rules, thus coming to initially tag content, and then to find it more trustworthy conclusions than through navigation or search’.11 possible when just looking at one result at Since its creation, Swiss-Prot has stored a time. Anabelle is integrated in the information under specific line types, annotator’s workbench: the automatically many of which are structured in such a pre-selected analysis results are visualised way as to facilitate text searches in the in a graphical system, from which the database. Even the fields that appear to annotator can choose the true positive contain unstructured text are often results and easily generate annotation written according to strict guidelines to based on sequence similarity and sequence ensure consistency. In some cases, lists are analysis. Not only does this speed up made where ‘preferred’ terms are annotation, but it also promotes the associated with synonyms, spelling consistent transfer of entire information differences, abbreviations, or yet other blocks that logically group together, terms considered as equivalents. ensuring the usage of standardised Table 1 provides a partial description of vocabulary and minimising the probability where and how Swiss-Prot either makes of errors and typos. use of existing controlled vocabularies or We believe that careful application has developed such corpora. This list, Controlled vocabularies of rules to produce automatically or even if incomplete, is impressive; yet it demand continuous attention semi-automatically annotated protein does not capture the whole complexity of entries brings about many advantages for issues surrounding the use of users of Swiss-Prot. We know that many nomenclature and controlled vocabularies are apprehensive of the word in the life sciences. We need to state here ‘automation’ and are afraid that we will that if physicists or chemists behaved like drown high-quality manually annotated biologists, we would probably live in a entries with lower-quality ‘automated’ world without computers or plastic (this entries. We are very aware of this may sound like an attractive proposition danger and are almost paranoid in our to some!). Life scientists do not receive, effort to ensure that automatic during their training, the perception of annotation will produce data of a quality the importance of following up to that of manual curation. Finally it nomenclature rules. Yet, they are the first must be noted that one of the important to complain when they look for specific changes planned in the Swiss-Prot information across one or many databases format (see section on ‘Evolution of and fail to obtain a comprehensive answer entry structure and format’ below) is because that information is very pertinent to this issue: the heterogeneously described. Therefore we introduction of ‘evidence tags’, should always felt that Swiss-Prot had a mission allow us to unambiguously flag whether to fulfil in enforcing existing rules and an information item has been derived more and more, as time passed by, to manually or automatically. actively participate in the development of

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 4 7 Bairoch et al.

Table 1: Standardisation efforts and use of existing or in-house controlled vocabularies in Swiss-Prot, listed by line type. (Note: Refer to the Swiss-Prot user manual for further information on all the information present in a Swiss-Prot record)

Protein names (DE line) We use as primary name the one that seems to be the most appropriate according to the function of a protein, to the nomenclature adopted by the specialists in that field or to the gene name, etc. We keep all synonyms used in publications and authors’ submissions except if they are misleading. Furthermore we transfer the same name to the orthologues of related organisms. Gene names (GN line) Whenever a nomenclature committee (HUGO, FlyBase, etc.) provides ‘official’ gene names for a given organism, we try to enforce their choice of gene names, yet keeping what authors originally provided as synonyms. Species names (OS line) The species names used in Swiss-Prot are listed in a document (speclist.txt). From the very beginning, care has been taken to store not only the official (scientific) name, but also the most useful common names and synonyms. Species taxonomy (OC and OX We make use of the taxonomy compiled by NCBI which is used by most major biomolecular sequence databases. lines) Organelle (OG line) We standardise plasmid name usage and list them in a Swiss-Prot document (plasmid.txt). Reference comments (RC line) Among other uses, the RC line allows us to indicate the tissue from which a protein originates (TISSUE), or the strain (STRAIN). The tissues are reported in the file tisslist.txt and the strains in strains.txt. Both lists contain indications on synonyms. Reference authors (RA line) As far as possible, the names of authors are stored according to consistent rules. For example the German umlaut is replaced by an ’e’ following the vowel on which the umlaut was perched, the hyphen is retained between two initials (which is removed in Medline/PubMed), we keep all the initials (even where PubMed only keeps two) and we often correct misspelling in author names! Reference location (RL line) Journal abbreviations in Swiss-Prot follow whenever possible those used by the National Library of Medicine (NLM). We provide a journal list (jourlist.txt) that, in addition to the journal names and abbreviations, also provides ISSN (International Standard Serial Number), CODEN number, publishers and journal home page web addresses. Comments (CC line) The CC lines mainly contain free text comments classified under 24 different topics. If a piece of information cannot be classified under a specific topic, it is put under ’MISCELLANEOUS’. However, with time, the information in the CC lines is becoming less ‘free’ so to speak, and more and more CC line topics are subjected to controlled vocabularies. For example, this is the case of the ‘CATALYTIC ACTIVITY’ topic whose text is taken from the ENZYME database12 for all known enzymes, referred to by their EC (Enzyme Classification) numbers in the DE lines. We are currently standardizing the use of the ‘COFACTOR’, ‘PATHWAY’ and ‘SUBCELLULAR LOCATION’ topics. Keywords (KW line) Keywords were one of the first sets of controlled vocabulary in Swiss-Prot. They were introduced to summarise the content of an entry and to group entries according to different aspects related to biological processes, molecular function, subcellular location, domains, ligands, sequence modifications and diseases. We provide a keyword list (keywlist.txt) that is being superseded by a dictionary that provides the precise definition of the usage of a keyword in the context of Swiss-Prot. The dictionary also includes synonyms, groups keywords into categories and provides a mapping between Swiss-Prot keywords and GO terms (see section ‘Going ahead with GO in Swiss-Prot’). Feature table (FT line) We are currently establishing a controlled vocabulary for the features describing post-translational modifications (PTMs).13 We are also building a PTM database to store, for each type of modification, information such as the general description, target(s), chemical formula, subcellular localisation of modified site, enzyme(s) carrying out the PTM, etc. Domain-type (DOMAIN, REPEAT, DNA_BIND, ZN_FING, etc.) feature descriptions are also standardised across all of Swiss-Prot. Sequence The sequences are stored in the one-letter code adopted by the commission on Biochemical Nomenclature of the IUPAC-IUBMB.

new nomenclature and controlled point out inconsistencies and/or vocabularies. Anecdotally such an active errors. role can have some unexpected • Do not be afraid to take a firm stand consequence: we were once threatened toward your users when they request with a lawsuit because we did not accept the representation in your database of to use as a valid gene symbol the one terms that do not follow a specific proposed by an author. guideline. You can always (and you All of this leads us to give the following should!) store this information as a advice to would-be developers of synonym. databases: Going ahead with GO in Swiss-Prot • Try to follow as much as possible If we assume, as mentioned above, that existing controlled vocabularies and ‘users and database should agree on the nomenclatures. meaning of the term being used’, given • Do not hesitate to contact the groups the large number of biomolecular maintaining these resources and to databases available, this indirectly implies

48 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 Swiss-Prot: Juggling between evolution and stability

that all databases should agree on the references (see section on ‘Cross- meaning of a term! In an attempt to references in Swiss-Prot’ below) from achieve this ambitious goal, maintainers of Swiss-Prot to GO. We added them in all FlyBase, MGD and SGD joined forces cases where they originated from manual and formed the GeneOntology (GO) annotation efforts. We also are in the Consortium.14 They established three process of introducing GO terms for all ontologies, gathering key terms for members of microbial protein families cellular components, biological process that fall under the scope of the HAMAP and molecular function, thus catering for annotation project. a large need for standardisation that could be observed all across the scientific Evolution of entry structure community. and format Swiss-Prot introduces From the beginning of the GO Since its creation in 1986, the basic Gene Ontology terms activities, we were repeatedly approached structure of a Swiss-Prot entry has not very carefully by users wondering when we would changed significantly. The distinct line introduce GO terms to Swiss-Prot and types defined by a two-letter code are TrEMBL. However, while clearly generally relevant to all entries and cover welcoming the effort made by the GO the core data, while the actual protein Consortium, we were reluctant to add information is given in the comment links to GO at that time: given the (CC) lines and in the feature table (FT). initially small scope (GO specialised in While the general framework has been three major organism groups, whereas very stable, we have carried out many Swiss-Prot has to deal with thousand of changes over the years. New line types different species), and the fact that many were introduced, the structure of existing mappings had been created automatically line types was constantly refined and new and were thus likely to assign GO terms sub-fields (comments topics, feature keys) to unrelated proteins, we considered it were added. Such changes are always dangerous to mislead users into incorrect documented (in release notes and other assumptions. We did not want to risk the documents) and users are warned in situation where someone would happily advance of pending changes so that they accept a GO assignment indicating a can adapt their software tools. While the function for an otherwise uncharacterised general stability of the Swiss-Prot flat file protein, without further questioning the format may be seen as a proof of foresight, assignment because they trust the careful planning and experience, one can judgment of Swiss-Prot annotators and also say that in some respect Swiss-Prot the high quality of the manual had become a victim of its own success: annotations. even the smallest modification to the flat It was only in 2003 that we felt it had file format, or the introduction of new become ‘safe’ to start introducing GO fields, needs to be considered carefully, terms in Swiss-Prot. We felt that GO had and it happens that ideas are discarded for indeed considerably matured and had the sole reason that ‘this will cause the increased its coverage. What is more, crash of thousands of programs out several species-specific databases have there...’. established manually curated mappings Swiss-Prot and TrEMBL have between GO terms and their gene traditionally been maintained and catalogues. The EBI GO team has distributed as flat files. An inherent mapped Swiss-Prot keywords to GO problem of flat file databanks is that their terms. Evidence tags are available in GO maintenance becomes increasingly to indicate whether an assignment has difficult when they grow in size and many been done automatically or by manual people are involved in the production of curation. The time had come to follow the data. Since 2002, Swiss-Prot and the demands, and to introduce cross- TrEMBL have also been distributed in

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 4 9 Bairoch et al.

XML,15 the extensible mark-up language representation of the information that makes it possible to define the concerning gene names. The new format content of a document separately from its will allow distinguishing official gene formatting, making it easy to reuse that name, synonyms, ordered locus name and XML format content in other applications or for other ORF names. This change allows a better presentation environments. XML allows, representation of the complexity of gene in contrast to HTML, the authors of a and locus naming schemes. document to create their own mark-up As we described in the section on tags suiting their needs and allowing the automatic annotation, it is important to best structure for the data. But what is provide users with a means to track down more, XML allows implementing rules the origin of all information items in a that are not limited to formatting, but can Swiss-Prot entry. Such a need was not be used to formulate dependencies. We apparent in the early days of Swiss-Prot as are also in the process of porting the most information was derived from a production of Swiss-Prot and TrEMBL to single paper that both reported the a relational database management system. sequence and its characterisation. This is In order to develop the relational and no longer true and some entries contain XML schema, we have designed information originating from up to 110 conceptual data models, using the Unified references as well as the results of many Relational database Modelling Language (UML) notation, to sequence analysis tools. It is therefore represent the structure and constraints necessary to provide ’evidence tags’. present in the data. These are links between an information In the meantime, until the production item and its source, whether a reference, copy of Swiss-Prot is managed in a the judgment of annotator or the result of relational database management system, a program. Such evidence tags already Evidence tags we still need to introduce certain format exist in TrEMBL. We have been very changes to the flat file in order to slow in the process of providing them in accommodate more complex concepts. Swiss-Prot, partly because they are Such changes can be quite substantial and difficult to implement in the current time-consuming, as they are always annotation platform and because they are introduced in a way that not only new very cumbersome in the current flat file annotation is performed according to the format. Evidence tags are therefore going new format, but all existing entries need to be implemented in the XML and to be converted. As a consequence, this relational versions of Swiss-Prot and will can involve, in addition to the creation of probably not be available in the flat file conversion software, and to the distribution. modification of documentation and annotation tools, a lot of manual cleaning. Cross-references That we need to embark on such manual Cross-references in Swiss-Prot cleaning steps is not due to the structure Cross-references as a way to access related or the format of the database, but rather information in other databases have been to our pathological urge to make sure that an integral part of Swiss-Prot almost since all aspects of Swiss-Prot are self- the beginning (they were introduced in consistent. Therefore, whenever we release 4 of April 1987). Navigating introduce a new type of data, we try as between databases is much less of a much as possible to update all the entries challenge now, thanks to the web, than it where such data have some relevance. was back in the late 1980s. The early There are many changes we plan to presence of DR (Database cross- make to the flat file format. For example, Reference) lines in Swiss-Prot shows how in the near future, we plan to overhaul anticipatory we were in conceiving the the format of the GN (gene) line so that it database in a way that facilitates data will allow a more structured integration. One of the first important

50 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 Swiss-Prot: Juggling between evolution and stability

software applications that made use of Prot. When cross-references to Swiss-Prot cross-references was the PROSITE were introduced in 1990, Sequence Retrieval System (SRS),16 there was an average of 0.42 per Swiss- developed by Thure Etzold at EMBL, Prot entry. In 2003, this number was from 1990 on. In addition to providing a more than twice as high, an increase that The Sequence Retrieval search interface for multiple databases with can be explained by improved methods to System (SRS) a single query, an important feature of SRS detect domains, but also by the fact that is its ability to combine all indexed PROSITE increasingly reacts to the databanks into a network, where new demands from Swiss-Prot annotators: ways of linking information from different Whenever a newly annotated protein sources can be explored. One of the main family carries a particular domain that is reasons why this became possible was the not yet present in PROSITE, the fact that Swiss-Prot, one of the first PROSITE staff creates a discriminator databases indexed under SRS, was so (pattern or profile) for that domain. Many highly cross-referenced. SRS other family/domain databases were documentation contained in 1990, and still created in the past ten years, most of contains in 2003, an image showing which are cross-referenced in Swiss-Prot biological databases linked to each other in and also incorporated in the InterPro17 form of a network, the centre of which is resource which unites these databases Swiss-Prot, connected with practically all ‘under one roof’. Today a Swiss-Prot the other databases indexed under SRS. entry contains an average of 5.2 links to Link statistics The first databases cross-referenced in family/domain databases. These cross- Swiss-Prot were the primary DNA and references can also be seen as a pointer to protein sequence databases EMBL and the existence of a specific domain in a PIR, and the PDB protein structure given protein sequence. database. New links were regularly added As mentioned above, in 2003, we have at each of the major Swiss-Prot releases. added cross-references to the three GO Currently Swiss-Prot is linked to 55 ontologies. These cross-references have a different databases and each entry contains dual purpose: they allow navigation an average of 9.1 links. One would toward an external resource (here GO), naively assume that an entry does not and they also serve as information items. contain more than a single cross-reference This may be better explained by the to a given external database. This is not following example: always true, for a variety of reasons that DR GO; GO:0012501; P:programmed generally depend on the structure of the external database. For example, there is an cell death; TAS: average of 1.92 cross-references to the EMBL DNA sequence database per In the above line, the GO accession Swiss-Prot entry. This reflects the number ‘GO:0012501’ provides a handle redundant archival nature of the to access the GO database (navigation), nucleotide databases. However, this the ‘P:programmed cell death’ indicates overall average does not convey the true that the protein is involved in the nature of the situation: 58 per cent of all biological process (‘P’) of programmed Swiss-Prot entries contain only one cross- cell death and the ‘TAS’ stands for reference to EMBL, while 6.2 per cent ‘Traceable Author Statement’. contain more than five such cross- references. Cross-referencing versus integrating A special emphasis should be given to Over the years, it became clear that our the cross-references to family/domain strategy to ‘delegate’ specialist tasks to the databases. PROSITE was the first of these specialists (and establish reciprocal links), databases to be created and accordingly while concentrating on the more the first to be cross-referenced in Swiss- ‘generalist’ annotation was satisfactory.

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 5 1 Bairoch et al.

This was facilitated and influenced by the resource that will not be updated or appearance of more and more databases: that is likely to lose funding – with the WWW made it a lot easier to publish the consequence of being forced to expert knowledge. Existing and well- remove those links after a short while. established databases (eg FlyBase) took advantage of the increased visibility However, these disadvantages are easily offered by the web, and many additional outweighed by a gain in time and the new information resources burgeoned. A relief not to ‘have to be an expert in every Explicit and number of these databases were field’, as well as the reward of fruitful implicit links constructed around the primary sequence collaborations and exchanges. Procedures or organism-specific gene nomenclature have been established to obtain mappings databases, and used the accession numbers between Swiss-Prot sequences on one of the sequence databases (or the primary side, and relatively heterogeneous gene names) as their set of unique information on the other: nucleotide identifiers. An example is GeneCards, a sequences, gene names, modification sites, database of ‘information cards’ on every domain descriptors, ontologies, etc. Many human protein in Swiss-Prot and cross-references, in particular those that TrEMBL. Such databases are usually are based on sequence searches, ie domain cross-referenced to Swiss-Prot via and family classification, are now already ‘implicit’ links, created on the fly by the applied to TrEMBL. This means that an NiceProt tool (see section below, entry comes with a certain number of DR ‘Making Swiss-Prot available to the users’) lines before manual annotation even that displays a Swiss-Prot entry on starts. Some other DR lines, however, ExPASy. In addition to the explicit cross- require careful checking by an annotator, references ‘hard-coded’ in the Swiss-Prot and yet others have to be added DR lines, the concept of implicit links completely ‘manually’ as they can only be enforces the role of Swiss-Prot as a central established after perusal of literature and hub for molecular biology information.18 other sources (eg MIM). While the list of There may seem to be certain cross-referenced databases keeps growing, drawbacks related to the strategy of it does happen that we are obliged to establishing extensive cross-links v. the remove links to certain databases. This idea of integration of all data locally: can have several different reasons, the most frequent ones being a lack of • ‘loss of control’; funding and subsequent discontinuation Cross-referencing • cross-references create a certain of a database, or the decision of a database strategies dependency (when free public access maintainer to commercialise a resource to the Yeast Proteome Database and discontinue free web access even for (YPD) was discontinued, expectations academic users. grew again for Swiss-Prot to provide more extensive annotation for Some thoughts on unique and stable Saccharomyces cerevisiae); identifiers • necessity to rely on the willingness to There are some important observations to collaborate of providers of the make about cross-referencing in general. specialised cross-referenced databases To implement cross-referencing to a (eg use of standard nomenclature and database, that database needs to provide common identifiers, provide or at least unique and stable identifiers (USI) for help with mappings between Swiss- each of their entries. These USI are often Prot accession numbers and their known as accession numbers. Such a database); requirement may seem obvious, but it is • some foresight and knowledge of the still often the case that databases do not related field is necessary, in order not see the need for stable identifiers. For to make the effort of adding links to a example, a species-specific database may

52 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 Swiss-Prot: Juggling between evolution and stability

use gene names as their unique identifiers. month. It has now been accessed more The problem is that such identifiers may than 300 million times by a total of more be unique but are certainly not stable as it than three million computer hosts from is most probable that some of the gene 200 countries. Seven mirror sites, ie exact names will change over time. Far more copies of the main site in Switzerland, important for future developments is our have been established in Australia, belief that major objects in a database Bolivia, Canada, China, Korea, Taiwan require their own independent sets of and the USA. It is also noteworthy to USI. We became aware of this when we mention that ExPASy and the EBI saw the need to add USI to a number of server21 are far from being the only web objects in Swiss-Prot, thus allowing servers that redistribute Swiss-Prot and external databases to seamlessly TrEMBL, we estimate that there are implement cross-references to a specific about 50 such sites worldwide. object in Swiss-Prot rather than at the ExPASy has constantly evolved in its level of the entire entry. A good example ten years of existence. It is outside the of such developments is the creation of scope of this paper to describe all of what feature identifiers (FTId) for all human is available on the server, yet we want to protein sequence variants in Swiss-Prot. point out two significant developments These identifiers allow specialised that reflect our response to the needs of databases that report mutations users. The NiceProt view concerning a specific set of genes to make In autumn 1998, we initiated a cross-reference to the representation of ‘NiceProt’, with the intention to provide that mutation in Swiss-Prot. scientists with a more user-friendly way of looking at Swiss-Prot and TrEMBL MAKING SWISS-PROT entries. Instead of showing the raw Swiss- AVAILABLE TO THE USERS Prot data format (with its two-letter line In prehistoric times – ie before the Web types), we decided to make use of HTML – Swiss-Prot reached its users by a variety tables to group certain fields under of means. It was sent on computer tapes common headings, to replace the line by the EMBL, it was distributed on type by a more explicit key (eg ‘Cross- floppy disks by companies selling references’ instead of ‘DR’). This was sequence analysis software and, in 1989, it initially targeted at users who are not became the first major biomolecular familiar with the Swiss-Prot data format, database to be distributed on CD-ROM. but rapidly caught on in the scientific In parallel to the physical distribution of community. Gradually, more and more Swiss-Prot, the database was made functionalities were added, including available by anonymous ftp and was many implicit cross-references, and links searchable from a number of on-line to context-specific documentation. resources such as BIONET and the NCBI During the first eight months of 2003, IRX database retrieval software. ExPASy treated about 1 million requests When the World-Wide Web began in for individual Swiss-Prot or TrEMBL 1993, Swiss-Prot became available on the entries on average per month. An ExPASy19 server,20 which was born on overwhelming majority of these hits (85 1st August, 1993. At that date there were per cent) are for NiceProt, whereas the fewer than 150 web servers worldwide. remaining 15 per cent account for To the best of our knowledge ExPASy accesses to the raw text version, or the was the first web server for the life science ‘htmlised’ view that was prevalent prior to community. We were very pleased to see September 1998. that it was accessed 7,295 times during its The NEWT22 taxonomy browser23 is a first month of activity. We never service introduced in 2002 that serves as imagined that a few years later it would be an entry point into Swiss-Prot and accessed at a rate of 8–10 million hits per TrEMBL using taxonomic search criteria.

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 5 3 Bairoch et al.

The core of NEWT consists in the indispensable. The explosive growth in integration of Swiss-Prot specific uncharacterised sequence data has led us taxonomy information with the NCBI to the implementation of automatic and taxonomy data in a relational database. semi-automatic processes. They are The NEWT Taxonomy browser Taxonomic nodes are stored in a designed to ensure the same high-quality hierarchical tree; this allows easy standards that have always been the navigation through the taxonomy lineage hallmark of Swiss-Prot. Automation has from every taxon. The web interface to to go in parallel with the introduction of NEWT allows users to search and browse evidence tags that will allow the daily updated taxonomy data. Users distinguishing data sources and inferences. can navigate through the taxonomy tree We strongly believe that the future of and access corresponding Swiss-Prot and Swiss-Prot and of any similar curated TrEMBL protein entries. Additionally, a information resource relies on the active manually curated selection of over 24,000 participation of the life sciences external links (including more than community. This will require an 13,000 photographs) provides specific increased educational effort on our part. It information on selected species. is also dependent on the commitment of Both UniProt and NEWT are scientific societies, publishers and funding representatives of the trend toward a agencies to provide a framework to ‘customisation’ of the representation of facilitate community efforts and give due knowledge. We believe that this trend credit to the participating scientists. will not abate; there are many specific As a closing remark, we would like to communities of life scientists that require thank all the persons involved in the information on proteins, yet want them to development of Swiss-Prot at the SIB and be represented in a style or perspective EBI as well as all the funding agencies and specific to their field of research. We are companies that have financially in the process of developing new types of contributed to the continuous evolution views. of the Swiss-Prot knowledgebase. Mining the server We also believe that the ExPASy server log files access log files are a valuable source of Acknowledgments information as to the most frequently The work described in this paper covers activities consulted TrEMBL entries (ie funded by various sources including NIH:1 U01 unannotated entries that will greatly HG02712–01, EU:BioMinT; QLRT-2001– benefit from manual annotation) 02770, EU:Temblor; QLRT-2001–00015, scientists’ use of search engines, the EU:BioBabel; QLRI-CT-2001–00981, SNF:3100–063879. context in which certain entries are consulted etc. We therefore plan to mine the ExPASy log files and expect to be able References to draw enlightening conclusions! 1. Boeckmann, B., Bairoch, A., Apweiler, R. et al. (2003), ‘The SWISS-PROT protein CONCLUSIONS knowledgebase and its supplement TrEMBL in 2003’, Nucleic Acids Res., Vol. 31, pp. 354– Being a well-established database, we can 370. say that the tireless effort of juggling 2. Bairoch, A. (2000), ‘Serendipity in between evolution and stability has been bioinformatics, the tribulations of a Swiss an exhausting but suitable strategy for the bioinformatician through exciting times!’, development of the Swiss-Prot protein Bioinformatics, Vol. 16, pp. 48–64. knowledgebase. Early design features of 3. Apweiler, R., Bairoch, A., Wu, C. H. et al. the database such as the detailed (2004), ‘UniProt: The universal protein knowledgebase’, Nucleic Acids Res., Vol. 32, structuring of the entry format, the pp. D115–D119. standardisation of nomenclature, the 4. Dayhoff, M. O., Eck, R. V., Chang, M. A. regular review of the annotation of and Sochard, M. R. (1965), ‘Atlas of Protein protein families have been shown to be Sequence and Structure’, Vol. 1, National

54 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 Swiss-Prot: Juggling between evolution and stability

Biomedical Research Foundation, Silver post-translational modifications in the Swiss- Spring, MD. Prot knowledgebase’, , in press. 5. Moore, J., Engelberg, A. and Bairoch, A. 14. Ashburner, M., Ball, C. A., Blake, J. A. et al. (1988), ‘Using PC/GENE for protein and (2000), ‘Gene ontology: Tool for the nucleic acid analysis’, Biotechniques, Vol. 6, unification of biology. The Gene Ontology pp. 566–572. Consortium’, Nature Genet., Vol. 25, pp. 25–29. 6. URL: http://www.expasy.org/cgi-bin/ experts 15. URL: http://www.ebi.ac.uk/swissprot/SP- ML 7. Monigatti, F., Gasteiger, E., Bairoch, A. et al. (2002), ‘The Sulfinator: Predicting tyrosine 16. Etzold, T. and Argos, P. (1993), ‘SRS – an sulfation sites in protein sequences’, indexing and retrieval tool for flat file data Bioinformatics, Vol. 18, pp. 769–770. libraries’, Comput. Appl. Biosci., Vol. 9, pp. 49–57. 8. Bologna, G., Veuthey, A.-L., Yvon, C. et al. (2004), ‘N-terminal myristoylation predictions 17. Mulder, N. J., Apweiler, R., Attwood, T. K. by ensembles of neural networks’, Proteomics, et al. (2003), ‘The InterPro Database, 2003 in press. brings increased coverage and new features’, Nucleic Acids Res., Vol. 31, pp. 315–318. 9. Hulo, N., Sigrist, C., LeSaux, V. et al. (2004), ‘Recent improvements to the PROSITE 18. Gasteiger, E., Jung, E. and Bairoch, A. (2001), database’, Nucleic Acids Res., Vol. 32, pp. ‘SWISS-PROT: Connecting biological D134–D137. knowledge via a protein database’, Curr. Issues Mol. Biol., Vol. 3, pp. 47–55. 10. Gattiker, A., Michoud, K., Rivoire, C. et al. (2003), ‘Automated annotation of microbial 19. Gasteiger, E., Gattiker, A., Hoogland, C. et al. proteomes in Swiss-Prot’, Comput. Biol. (2003), ‘ExPASy – the proteomics server for Chem., Vol. 27, pp. 49–58. in-depth protein knowledge and analysis’. Nucleic Acids Res., Vol. 31, pp. 3784–3788. 11. Warner, A. (URL: http:// www.lexonomy.com/publications/ 20. URL: http://www.expasy.org aTaxonomyPrimer.html). 21. URL: http://www.ebi.ac.uk 12. Bairoch, A. (2000), ‘The ENZYME database 22. Phan, I. Q., Pilbout, S. F., Fleischmann, W. in 2000’, Nucleic Acids Res., Vol. 28, pp. 304– and Bairoch, A. (2003) ‘NEWT, a new 305. taxonomy portal’. Nucleic Acids Res., Vol. 31, pp. 3822–3823. 13. Farriol-Mathis, N., Garavelli, J. S., Boeckmann B. et al. (2004), ‘Annotation of 23. URL: http://www.ebi.ac.uk/newt/

& HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 5. NO 1. 39–55. MARCH 2004 5 5