Exploring the Human Genome Data in SWISS-PROT and Trembl

Exploring the Human Genome Data in SWISS-PROT and TrEMBL SERGIO CONTRINO, YOULA KARAVIDOPOULOU European Bioinformatics Institute Genome Campus, Hinxton, Cambridge CB10 1SD UNITED KINGDOM [email protected] http://www.ebi.ac.uk/~contrino Abstract: - It is very important to be able to organize and allow easy access to the vast amount of information that is now available to the researcher. We developed a gene search facility that makes use of the quality and relational implementation of SWISS-PROT and TrEMBL databases. Our system offers a useful and immediate entry point for human genome data, with a specific interest in gene-disease links. This 'portal' approach allows an easy integration of a remarkable number of sources of relevant information and we think it will be valuable to the researcher. Key-Words: - human, proteome, chromosome, disease, gene, database 1 Data integration models Bioinformatic Institute (EBI) [5]. We will refer to The Internet age has brought an unprecedented these databases collectively as 'SPTr'. A short wealth of information to the researcher's fingertips. documentation on the schema is publicly available The problem of organizing and being able to take [6]. The system also makes use of the SRS [7] advantage of this enormous potentiality is certainly installation at EBI and links to resources pooled in one of the more practical and pressing ones of our the Proteome Analysis databases [8,9]. Among times. In biology the deluge of information from them, InterPro [10,11] and CluSTr [12, 13] are also genomics projects accentuates this fact. implemented in a relational format and maintained We can roughly identify two approaches to this at EBI. problem of data management: a 'deep' data The direct access of the database data is done via integration model and a 'loose' one. The 'deep' Java Servlets [14] using a small pool of Java approach tries to achieve actual data integration, programs. The user interface is html based and with many disparate sources utilized to build a utilizes a web browser. The chromosome tables take comprehensive and uniform repository of advantage of html 4.0 features for a faster display. information. This is then used to gain knowledge. It A brief description of these components is given is the 'data warehouse' model. The 'loose' approach below. in a way surrenders to the variability, inconsistency, errors of data, data format and supporting software and merely tries to put relevant information as close 2.1 SWISS-PROT and TrEMBL databases as possible to a certain entry-point. It is the 'portal' SWISS-PROT [1,2] is a non-redundant, manually model. curated protein sequence database, which has been In our gene search facility we are trying to give an successfully maintained and constantly enriched and easy access point to human genome data, with some improved since it was first established in 1986. possibilities of mining and a remarkable collection However, the number of protein sequences awaiting of related information. This portal approach has its to be incorporated into the database has dramatically strong basis in the relational representation of the increased in the last few years, due to advances in SWISS-PROT and TrEMBL databases [1,2]. gene sequencing and whole genome mapping projects. As this newly produced information should be available to the public as quickly as possible but 2 System components without diluting the quality standards of SWISS- PROT, a computer-annotated protein sequence Our search engine can be found in the context of the database supplementing SWISS-PROT was created human proteomics initiative (HPI) [3,4]. It relies for in November 1996. This computer-annotated its basic functions on a relational implementation of sequence databank, TrEMBL [1,2], contains the the SWISS-PROT and TrEMBL databases. This translations of all coding sequences (CDS) present implementation is maintained at the European in the EMBL-Bank nucleotide sequence database [15, 16] which have not yet been integrated into proteins) database [12, 13] offers an automatic SWISS-PROT. classification of SPTr proteins into groups of related The latest SWISS-PROT release (release 39.22 of proteins. Structural information is presented in the 20-Jun-2001) contains 98739 sequence entries Proteome Analysis Database and includes primary, whereas there are 473505 entries in the current secondary and tertiary structure information for each TrEMBL release (release 17.0 of June 2001). There proteome. Functional classification of each are currently 7340 annotated human sequences in proteome using Gene Ontology (GO) [18, 19] is also the two databases (7138 and 202 in TrEMBL). available. A program that is designed to carry out InterPro comparisons for any one proteome against any other one or more of the proteomes in the 2.2 Human Proteomics Initiative database has been constructed [20]. In addition to the above-mentioned 7340 annotated Analysis is carried out for all the completely human entries in SPTr, there are many more known sequenced organisms present in SPTr database human genes awaiting high level annotation and (currently 47 organisms; will be 55 organisms in integration into the database. Therefore, in order to early August 2001 and many more are in the enrich the biological knowledge of the human pipeline) and also a preliminary proteome analysis is genome, the Swiss Institute of Bioinformatics (SIB) produced for the incomplete human genome. [17] and EBI initiated a project to annotate, describe and distribute to the scientific community all known human protein sequences as soon as possible. This is 3 Human Chromosome tables known as the 'Human Proteomics Initiative' (HPI) As part of the Human Proteome Analysis, mappings [3, 4]. have been created between human entries in SPTr and the corresponding official HUGO gene symbol or NCBI LocusLink provisional symbol. Starting 2.3 Proteome Analysis with data available from the LocusLink database The Proteome Analysis Database [8, 9] was created [21], this was achieved by tracking protein in March 2000 by the SWISS-PROT and TrEMBL identifiers. Up to now, 13811 gene mappings have group at EBI and integrates information from a been made, 8094 of which have an assigned official variety of sources to facilitate the functional HUGO gene name. 12393 of the total gene classification of the proteins encoded in complete mappings have a known chromosome location. genomes (the proteome). This information is stored in a relational format and The database provides a perspective on domain is available through the Proteome Analysis pages as structure and function, gene duplication and protein gene sets for each of the human chromosomes [22]. families in different genomes by offering statistical, In addition, there is a special set for mitochondrial structural, functional and comparative analysis of genes and one for genes that are not already the predicted protein coding sequences for the mapped. The chromosome tables aim to provide a completely sequenced organisms present in SPTr, comprehensive reference of the human genome data spanning archaea, bacteria and eukaryotes. in SPTr, for which the gene name and chromosome Complete proteome sets have been built for each of location is known. Each gene set is an alphabetic the organisms, providing reliable, well-annotated listing of the genes with either a HUGO approved data as the basis for the analysis. gene symbol or an NCBI LocusLink provisional symbol encoded on that chromosome, together with 2.3.1 InterPro and CluSTr databases its chromosomal position, information about the The InterPro and CluSTr databases are the main protein it encodes and useful links to other resources used in Proteome Analysis. InterPro and databases. Links are provided to the respective CluSTr give a new perspective on families, domains Human Gene Nomenclature Database (HGNC), The and sites and cover 31% to 67% (InterPro statistics) Genome Database (GDB), GeneCard, SPTr, OMIM, of the proteins from each of the complete genomes. EMBL-Bank, Ensembl, InterPro and CluSTr entries. InterPro [10, 11] is an integrated documentation There are 2 different global views of this data, one resource of protein families, domains and functional listing all the genes in a specific chromosome, sites that has been developed initially as a means of another one listing only the ones that are annotated rationalising the complementary efforts of the as linked to one or more diseases. PROSITE, PRINTS, Pfam, ProDom and SMART database project. The CluSTr (Clusters of SPTr 4 Gene search 4.1 An example A more agile interface to this information is given For example let's imagine we are interested in by the gene search facility. This allows accessing the GPCR (G-protein-coupled receptors) proteins. More system information using either a gene name or a than half of the current prescription drugs are keyword and it can be found at the same Internet targeted to GPCRs [23]. A first set of questions location of the Human Chromosome tables [22]. would be to know how many of such proteins there Using the gene name, a synoptic gene page is are in the human genome, where they are located produced. It contains, as in the chromosome tables, and which of those are linked to some sort of the location of the gene, a link to the protein that the disease. All this can be easily accomplished by the gene encodes and information concerning this keyword query with our system. We may then be protein. These are accession number, entry name, interested in the uncharacterized potential GPCRs in description, diseases linked to the gene (if any), the database. Those too are easily detected and can cross references to nucleotide sequences where the be further explored. coding region of the protein can be found, associated As pointed out by Boyer et al [24] regarding the entries in the SPTr database and cross references to identification of potential oncogenes, the availability OMIM and Ensembl.

Exploring the Human Genome Data in SWISS-PROT and Trembl

Dual Proteome-Scale Networks Reveal Cell-Specific Remodeling of the Human Interactome

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

REG Gene Expression in Inflamed and Healthy Colon Mucosa Explored by in Situ Hybridisation

Letter to the Editor Potential Function of MMP3 Gene in Degradation Of

Reg Proteins Promote Acinar-To-Ductal Metaplasia and Act As Novel Diagnostic and Prognostic Markers in Pancreatic Ductal Adenocarcinoma

Recombinant Human Reg1b Catalog Number: 2090-RG

Human Lectins, Their Carbohydrate Affinities and Where to Find Them

A Draft Map of the Human Proteome

Human REG1B / PSPS2 Protein (His Tag)

Mouse REG1A Antibody

LJELSR: a Strengthened Version of JELSR for Feature Selection and Clustering

ATAP00191-Recombinant Human Reg1b Protein