Application of Interpro for the Functional Classification of the Proteins of Fish Origin in SWISS-PROT and Trembl
Total Page:16
File Type:pdf, Size:1020Kb
Application of InterPro for the functional classification of the proteins of fish origin in SWISS-PROT and TrEMBL MARGARET BISWAS*, ALEX KANAPIN and ROLF APWEILER EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK *Corresponding author (Fax, +44-1223 494 468; Email, [email protected]). InterPro (http://www.ebi.ac.uk/interpro/) is an integrated documentation resource for protein families, domains and sites, developed initially as a means of rationalizing the complementary efforts of the PROSITE, PRINTS, Pfam and ProDom database projects. It is a useful resource that aids the functional classification of proteins. Almost 90% of the actinopterygii protein sequences from SWISS-PROT and TrEMBL can be classified using InterPro. Over 30% of the actinopterygii protein sequences currently in SWISS-PROT and TrEMBL are of mito- chondrial origin, the majority of which belong to the cytochrome b/b6 family. InterPro also gives insights into the domain composition of the classified proteins and has applications in the functional classification of newly determined sequences lacking biochemical characterization, and in comparative genome analysis. A comparison of the actinopterygii protein sequences against the sequences of other eukaryotes confirms the high representation of eukaryotic protein kinase in the organisms studied. The comparisons also show that, based on InterPro families, the trans-species evolution of MHC class I and II molecules in mammals and teleost fish can be recognized. 1. Introduction rently, sequences from the pufferfish make up about 3% and those from the zebrafish account for more than 12% The current SWISS-PROT and TrEMBL protein sequence of the actinopterygii sequences in SWISS-PROT and databases (October 2000) have 462,453 entries; 88,753 in TrEMBL. SWISS-PROT and 373,700 in TrEMBL. SWISS-PROT In comparison with the genome of other vertebrate contains high quality annotation, is non-redundant and is species, the Fugu rubripes genome comprises less repeti- cross-referenced to many other databases. TrEMBL is a tive DNA sequence, smaller intergenic regions and computer-annotated supplement to SWISS-PROT that smaller introns (Brenner et al 1993). As a result of this contains translations of all coding sequences present in observation the genome of Fugu rubripes is a model of a EMBL Nucleotide Sequence Database, that are not yet minimalist vertebrate genome and is currently being integrated into SWISS-PROT (Bairoch and Apweiler employed in several comparative sequence studies. The 2000). There are 23,984 entries for protein sequences zebrafish (D. rerio) too offers special advantages for from vertebrates that are not mammalian. Of these, the genetic and developmental analysis at single cell resolu- sequences of fish origin make up just over 40%. By far tion. The zebrafish, nearly unique among commonly stu- the largest number of fish sequences are for the actino- died vertebrates, contains large numbers of identifiable pterygii (ray-finned fishes) with all other being much less cells, some of which have already been characterized in well represented (figure 1). Two of the organisms at the terms of their detailed anatomical, physiological, deve- focus of the genome projects, Fugu rubripes (pufferfish) lopmental and genetic properties (Lekven et al 2000). and Danio rerio (zebrafish) belong to this group. Cur- Clearly, a large number of sequences from these two Keywords. Actinopterygii (ray-finned fish); fish proteins; functional classification; InterPro; SWISS-PROT; TrEMBL ________________ Abbreviations used: GPCRs, G-coupled protein receptors; PKD, polycystic kidney disease; MHC, major histocompatibility complex. J. Biosci. | Vol. 26 | No. 2 | June 2001 | 277–284 | © Indian Academy of Sciences 277 278 Margaret Biswas, Alex Kanapin and Rolf Apweiler organisms will begin to find their way into the databases. means of rationalising the complementary efforts of the It is, therefore, an interesting time to take stock of what is PROSITE (Hofmann et al 1999), PRINTS (Attwood et al currently available and to attempt to classify the informa- 2000), Pfam (Bateman et al 2000) and ProDom (Corpet tion that is available in the sequence databases. The appli- et al 2000) database projects. These four databases are cation of InterPro for such an analysis is demonstrated in currently the member databases of InterPro. this paper. A comparative analysis of the three complete InterPro is implemented as a relational database in Ora- eukaryotic proteomes was the first application of InterPro cle and users have direct access via Java servlets. The (Rubin et al 2000). The InterPro analysis plus manual InterPro database is distributed as XML-formatted flat inspection of the data enabled the successful classification files and as exports of the relational database. The Inter- of over 50% of the proteins of the proteomes of Droso- Pro database provides an integrated layer on top of the phila melanogaster, Caenorhabditis elegans and Sac- most commonly used signature databases to provide a charomyces cerevisiae. user-friendly interface for text-based searches and sequence scans. InterPro contains manually curated documentation, combined with diagnostic signatures from different data- 2. Source databases and methods bases to create a unique, non-redundant characterization 2.1 Database of actinopterygii sequences of a given protein family, domain or functional site. A sample InterPro entry is shown in figure 2. A subset of the SWISS-PROT and TrEMBL databases An InterPro-based statistical analysis of the actino- containing only the sequences of actinopterygii has been pterygii sequences has been carried out and includes assembled. This subset was selected using SRS (Etzold information under the following categories: et al 1996) to query the two databases for these sequences. · General statistics – all InterPro entries with matches to The list describing the subset membership was stored in the actinopterygii subset. The total number of matches Oracle and has been used in this analysis. of each of the signatures and the number of proteins matched for each InterPro entry is available. 2.2 InterPro · InterPro entries with the highest number of actino- InterPro (http://www.ebi.ac.uk/interpro/) (Apweiler et al pterygii protein matches. 2001) is an integrated documentation resource of protein · Actinopterygii proteins with the highest occurrence of a families, domains and sites that has been developed as a given domain. Figure 1. Distribution of fish sequences in SWISS-PROT and TrEMBL. J. Biosci. | Vol. 26 | No. 2 | June 2001 Classification of fish proteins using InterPro 279 Figure 2. Sample InterPro entry and graphical view of cell division protein kinase 2 (EC 2.7.1.-) from Carassius auratus (goldfish). J. Biosci. | Vol. 26 | No. 2 | June 2001 280 Margaret Biswas, Alex Kanapin and Rolf Apweiler Actinopterygii proteins with the highest occurrence of flatfile may be retrieved from the EBI anonymous-ftp different signatures. server ftp://ftp.ebi.ac.uk/pub/databases/interpro. Information about protein matches for member database signatures is stored in the InterPro Oracle tables. This 3. Results and discussion data is compiled using a set of Perl scripts to generate static HTML pages based on the data extracted from the Almost 90% of the actinopterygii protein sequences can Oracle database and filed according to the type of ana- be classified using InterPro (Release 2.0, October 2000) lysis. Match information includes the protein sequence with 757 of the 3204 InterPro entries having matches to accession number, the accession number of the method the actinopterygii protein sequences. (PROSITE, PRINTS, Pfam or ProDom), the position of the signature on the protein sequence and the status of the 3.1 InterPro entries with the highest number of matches match [true (T), false positive (F), false negative (N) or unknown (?)]. The top twenty InterPro entries with the highest number The database is accessible for text- and sequence-based of matches to proteins from the subset of actinopterygii searches at http://www.ebi.ac.uk/interpro/. The InterPro protein sequences are shown in figure 3. More than 30% Figure 3. Top twenty InterPro entries with the highest number of matches to proteins from the subset of acti- nopterygii protein sequences. J. Biosci. | Vol. 26 | No. 2 | June 2001 Classification of fish proteins using InterPro 281 of the actinopterygii protein sequences with InterPro 3.3 Actinopterygii proteins with the highest occurrence matches are of mitochondrial origin reflecting the interest of different signatures in using these sequences, in particular cytochrome b, in evolutionary analysis (see, for example, Wang et al 2000). 3.3a Multi-domain proteins: InterPro can give some Over 20% of the actinopterygii protein sequences belong insights into the domain composition of the classified pro- to the cytochrome b/b6 family (IPR000179). Cytochrome teins. Some multi-domain proteins can be especially com- b is the central redox catalytic subunit of the quinol: cyto- plex. For example, the protein from the F. rubripes chrome c or plastocyanin oxidoreductases. It is a mito- polycystic kidney disease 1 (PKD1) gene (O42181) is chondrial protein for which the evolution and structure an integral membrane glycoprotein with multiple evolu- has been widely studied. Other mitochondrial enzymes tionary conserved domains and with hits to 7 different that have a high number of matches to InterPro families InterPro