Exploring the Data in SWISS-PROT and TrEMBL

SERGIO CONTRINO, YOULA KARAVIDOPOULOU European Bioinformatics Institute Genome Campus, Hinxton, Cambridge CB10 1SD UNITED KINGDOM [email protected] http://www.ebi.ac.uk/~contrino

Abstract: - It is very important to be able to organize and allow easy access to the vast amount of information that is now available to the researcher. We developed a search facility that makes use of the quality and relational implementation of SWISS-PROT and TrEMBL databases. Our system offers a useful and immediate entry point for human genome data, with a specific interest in gene-disease links. This 'portal' approach allows an easy integration of a remarkable number of sources of relevant information and we think it will be valuable to the researcher.

Key-Words: - human, proteome, , disease, gene, database

1 Data integration models Bioinformatic Institute (EBI) [5]. We will refer to The Internet age has brought an unprecedented these databases collectively as 'SPTr'. A short wealth of information to the researcher's fingertips. documentation on the schema is publicly available The problem of organizing and being able to take [6]. The system also makes use of the SRS [7] advantage of this enormous potentiality is certainly installation at EBI and links to resources pooled in one of the more practical and pressing ones of our the Proteome Analysis databases [8,9]. Among times. In biology the deluge of information from them, InterPro [10,11] and CluSTr [12, 13] are also genomics projects accentuates this fact. implemented in a relational format and maintained We can roughly identify two approaches to this at EBI. problem of data management: a 'deep' data The direct access of the database data is done via integration model and a 'loose' one. The 'deep' Java Servlets [14] using a small pool of Java approach tries to achieve actual data integration, programs. The user interface is html based and with many disparate sources utilized to build a utilizes a web browser. The chromosome tables take comprehensive and uniform repository of advantage of html 4.0 features for a faster display. information. This is then used to gain knowledge. It A brief description of these components is given is the 'data warehouse' model. The 'loose' approach below. in a way surrenders to the variability, inconsistency, errors of data, data format and supporting software and merely tries to put relevant information as close 2.1 SWISS-PROT and TrEMBL databases as possible to a certain entry-point. It is the 'portal' SWISS-PROT [1,2] is a non-redundant, manually model. curated sequence database, which has been In our gene search facility we are trying to give an successfully maintained and constantly enriched and easy access point to human genome data, with some improved since it was first established in 1986. possibilities of mining and a remarkable collection However, the number of protein sequences awaiting of related information. This portal approach has its to be incorporated into the database has dramatically strong basis in the relational representation of the increased in the last few years, due to advances in SWISS-PROT and TrEMBL databases [1,2]. gene sequencing and whole genome mapping projects. As this newly produced information should be available to the public as quickly as possible but 2 System components without diluting the quality standards of SWISS- PROT, a computer-annotated protein sequence Our search engine can be found in the context of the database supplementing SWISS-PROT was created human proteomics initiative (HPI) [3,4]. It relies for in November 1996. This computer-annotated its basic functions on a relational implementation of sequence databank, TrEMBL [1,2], contains the the SWISS-PROT and TrEMBL databases. This translations of all coding sequences (CDS) present implementation is maintained at the European in the EMBL-Bank nucleotide sequence database [15, 16] which have not yet been integrated into ) database [12, 13] offers an automatic SWISS-PROT. classification of SPTr proteins into groups of related The latest SWISS-PROT release (release 39.22 of proteins. Structural information is presented in the 20-Jun-2001) contains 98739 sequence entries Proteome Analysis Database and includes primary, whereas there are 473505 entries in the current secondary and tertiary structure information for each TrEMBL release (release 17.0 of June 2001). There proteome. Functional classification of each are currently 7340 annotated human sequences in proteome using (GO) [18, 19] is also the two databases (7138 and 202 in TrEMBL). available. A program that is designed to carry out InterPro comparisons for any one proteome against any other one or more of the proteomes in the 2.2 Human Proteomics Initiative database has been constructed [20]. In addition to the above-mentioned 7340 annotated Analysis is carried out for all the completely human entries in SPTr, there are many more known sequenced organisms present in SPTr database human awaiting high level annotation and (currently 47 organisms; will be 55 organisms in integration into the database. Therefore, in order to early August 2001 and many more are in the enrich the biological knowledge of the human pipeline) and also a preliminary proteome analysis is genome, the Swiss Institute of Bioinformatics (SIB) produced for the incomplete human genome. [17] and EBI initiated a project to annotate, describe and distribute to the scientific community all known human protein sequences as soon as possible. This is 3 Human Chromosome tables known as the 'Human Proteomics Initiative' (HPI) As part of the Human Proteome Analysis, mappings [3, 4]. have been created between human entries in SPTr and the corresponding official HUGO gene symbol or NCBI LocusLink provisional symbol. Starting 2.3 Proteome Analysis with data available from the LocusLink database The Proteome Analysis Database [8, 9] was created [21], this was achieved by tracking protein in March 2000 by the SWISS-PROT and TrEMBL identifiers. Up to now, 13811 gene mappings have group at EBI and integrates information from a been made, 8094 of which have an assigned official variety of sources to facilitate the functional HUGO gene name. 12393 of the total gene classification of the proteins encoded in complete mappings have a known chromosome location. genomes (the proteome). This information is stored in a relational format and The database provides a perspective on domain is available through the Proteome Analysis pages as structure and function, gene duplication and protein gene sets for each of the human [22]. families in different genomes by offering statistical, In addition, there is a special set for mitochondrial structural, functional and comparative analysis of genes and one for genes that are not already the predicted protein coding sequences for the mapped. The chromosome tables aim to provide a completely sequenced organisms present in SPTr, comprehensive reference of the human genome data spanning archaea, bacteria and eukaryotes. in SPTr, for which the gene name and chromosome Complete proteome sets have been built for each of location is known. Each gene set is an alphabetic the organisms, providing reliable, well-annotated listing of the genes with either a HUGO approved data as the basis for the analysis. gene symbol or an NCBI LocusLink provisional symbol encoded on that chromosome, together with 2.3.1 InterPro and CluSTr databases its chromosomal position, information about the The InterPro and CluSTr databases are the main protein it encodes and useful links to other resources used in Proteome Analysis. InterPro and databases. Links are provided to the respective CluSTr give a new perspective on families, domains Human Database (HGNC), The and sites and cover 31% to 67% (InterPro statistics) Genome Database (GDB), GeneCard, SPTr, OMIM, of the proteins from each of the complete genomes. EMBL-Bank, Ensembl, InterPro and CluSTr entries. InterPro [10, 11] is an integrated documentation There are 2 different global views of this data, one resource of protein families, domains and functional listing all the genes in a specific chromosome, sites that has been developed initially as a means of another one listing only the ones that are annotated rationalising the complementary efforts of the as linked to one or more diseases. PROSITE, PRINTS, Pfam, ProDom and SMART database project. The CluSTr (Clusters of SPTr 4 Gene search 4.1 An example A more agile interface to this information is given For example let's imagine we are interested in by the gene search facility. This allows accessing the GPCR (G-protein-coupled receptors) proteins. More system information using either a gene name or a than half of the current prescription drugs are keyword and it can be found at the same Internet targeted to GPCRs [23]. A first set of questions location of the Human Chromosome tables [22]. would be to know how many of such proteins there Using the gene name, a synoptic gene page is are in the human genome, where they are located produced. It contains, as in the chromosome tables, and which of those are linked to some sort of the location of the gene, a link to the protein that the disease. All this can be easily accomplished by the gene encodes and information concerning this keyword query with our system. We may then be protein. These are accession number, entry name, interested in the uncharacterized potential GPCRs in description, diseases linked to the gene (if any), the database. Those too are easily detected and can cross references to nucleotide sequences where the be further explored. coding region of the protein can be found, associated As pointed out by Boyer et al [24] regarding the entries in the SPTr database and cross references to identification of potential oncogenes, the availability OMIM and Ensembl. of the "unique sequence and chromosomal address The protein accession number is also used to link to of each and every gene should expedite the rapid the InterPro and CluSTr databases, while the gene identification of the relative few that undergo name provides a link to GDB and Gene Card. There mutagenics alteration in association with specific is also the possibility to extract from MEDLINE all cancer". Our tool provides a useful and immediate the references to the gene in the scientific literature. collection point for this kind of information. The system can also do an approximate search, looking for the genes with a name containing a certain string (i.e. querying with this option using References: 'reg' the system will retrieve AREG, CREG, EREG, [1] Bairoch A., Apweiler R., The SWISS-PROT REG1A, REG1B). It will display an intermediate protein sequence database and its supplement page with some synthetic information and links to TrEMBL, Nucleic Acids Res., Vol.28, 2000, the relevant genes. pp.45-48. Another way to interrogate the system is using a [2] http://www.ebi.ac.uk/swissprot/. keyword. Also in this case two search options are [3]http://www.expasy.ch/sprot/hpi/ offered: exact match and a more general text search. [4]O'Donovan C., Apweiler R., Bairoch A., The For the exact match the system will look for SPTr human proteomics initiative (HPI), Trends entries related to human genes sharing a certain Biotechnol., Vol.19, No.5, 2001, pp.178-181. keyword. The keyword is in this case a SWISS- [5] http://www.ebi.ac.uk/ PROT keyword. It belongs to a controlled list of [6] http://www.ebi.ac.uk/~contrino/sp/ approximately 850 keywords that can be accessed [7] http://srs6.ebi.ac.uk/ from the search form. They appear in SPTr KW [8] Apweiler R., Biswas M., Fleischmann W., lines and can be composed by more than one word. Kanapin A., Karavidopoulou Y., Kersey P., The system will display an intermediate page with Kriventseva E.V., Mittard V., Mulder N., Phan summary information about the relevant genes: gene I., Zdobnov E., Proteome Analysis Database: name, location, protein accession number, online application of InterPro and CluSTr for the description and disease (if relevant). functional classification of proteins in whole If no record has been retrieved, a direct link to the genomes, Nucleic Acids Res., Vol.29, No. 1, alternative keyword search is given. This will look 2001, pp.44-48 for the submitted word in the keywords and [9] http://www.ebi.ac.uk/proteome/ description (DE lines) of the relevant SPTr entries. [10] Apweiler R., Attwood T.K., Bairoch A., This is a simple and rather powerful way of Bateman A., Birney E., Biswas M., Bucher P., associating related genes that can be further Cerutti L., Corpet F., Croning M.D., Durbin R., exploited using the InterPro link. In fact InterPro Falquet L., Fleischmann W., Gouzy J., entries link all proteins of the same protein family or Hermjakob H., Hulo N., Jonassen I., Kahn D., containing the same domain. Kanapin A., Karavidopoulou Y., Lopez R., Marx B., Mulder N.J., Oinn T.M., Pagni M., Servant F., Sigrist C.J., Zdobnov E.M., The InterPro database, an integrated documentation resource for protein families, domains and functional sites, Nucleic Acids Res., Vol.29, No.1, 2001, pp.37-40 [11]. http://www.ebi.ac.uk/interpro/ [12] Kriventseva E.V., Fleischmann W., Zdobnov E.M., Apweiler R., CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins, Nucleic Acids Res., Vol.29, No.1, 2001, pp.33- 36. [13] http://www.ebi.ac.uk/clustr/ [14] http://java.sun.com/products/servelet/ [15] Stoesser G., Baker W., van den Broek A., Camon E., Garcia-Pastor M., Kanz C., Kulikova T., Lombard V., Lopez R., Parkinson H., Redaschi N., Sterk P., Stoehr P., Tuli M.A., The EMBL Nucleotide Sequence Database, Nucleic Acids Res., Vol.29, No.1, 2001, pp.17-21. [16] http://www.ebi.ac.uk/embl/ [17] http://www.isb-sib.ch/ [18] Ashburner M., Ball C.A., Blake J.A., Botstein D., Butler H., Cherry J.M., Davis A.P., Dolinski K., Dwight S.S., Eppig J.T., Harris M.A., Hill D.P., Issel-Tarver L., Kasarskis A., Lewis S., Matese J.C., Richardson J.E., Ringwald M., Rubin G.M., Sherlock G., Gene Ontology: for the unification of biology, Nature Genetics, Vol.25, 2000, pp.25-29. [19] http://www.geneontology.org/ [20]http://www.ebi.ac.uk/proteome/comparisons.html [21] http://www.ncbi.nlm.nih.gov:80/LocusLink/ [22] http://www.ebi.ac.uk/proteome/HUMAN/ [23] Moeller S., Vilo J, Croning M. D. R., Prediction of the coupling specificity of GPCRs to their G proteins, Bioinformatics, 2001, in press [24] Boyer T. G., Chen Phang-Lang, Lee Wen-Hwa, Genome mining for human cancer genes: wherefore art thou?, TRENDS in Molecular Medicine, Vol. 7, No. 5, 2001, pp. 187-189