Application of InterPro for the functional classification of the of fish origin in SWISS-PROT and TrEMBL

MARGARET BISWAS*, ALEX KANAPIN and EMBL Outstation, The European Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK *Corresponding author (Fax, +44-1223 494 468; Email, [email protected]). InterPro (http://www.ebi.ac.uk/interpro/) is an integrated documentation resource for families, domains and sites, developed initially as a means of rationalizing the complementary efforts of the PROSITE, PRINTS, and ProDom database projects. It is a useful resource that aids the functional classification of proteins. Almost 90% of the protein sequences from SWISS-PROT and TrEMBL can be classified using InterPro. Over 30% of the actinopterygii protein sequences currently in SWISS-PROT and TrEMBL are of mito- chondrial origin, the majority of which belong to the /b6 family. InterPro also gives insights into the domain composition of the classified proteins and has applications in the functional classification of newly determined sequences lacking biochemical characterization, and in comparative genome analysis. A comparison of the actinopterygii protein sequences against the sequences of other eukaryotes confirms the high representation of eukaryotic protein kinase in the organisms studied. The comparisons also show that, based on InterPro families, the trans-species evolution of MHC class I and II molecules in mammals and teleost fish can be recognized.

1. Introduction rently, sequences from the pufferfish make up about 3% and those from the zebrafish account for more than 12% The current SWISS-PROT and TrEMBL protein sequence of the actinopterygii sequences in SWISS-PROT and databases (October 2000) have 462,453 entries; 88,753 in TrEMBL. SWISS-PROT and 373,700 in TrEMBL. SWISS-PROT In comparison with the genome of other vertebrate contains high quality annotation, is non-redundant and is species, the Fugu rubripes genome comprises less repeti- cross-referenced to many other databases. TrEMBL is a tive DNA sequence, smaller intergenic regions and computer-annotated supplement to SWISS-PROT that smaller introns (Brenner et al 1993). As a result of this contains translations of all coding sequences present in observation the genome of Fugu rubripes is a model of a EMBL Nucleotide , that are not yet minimalist vertebrate genome and is currently being integrated into SWISS-PROT (Bairoch and Apweiler employed in several comparative sequence studies. The 2000). There are 23,984 entries for protein sequences zebrafish (D. rerio) too offers special advantages for from vertebrates that are not mammalian. Of these, the genetic and developmental analysis at single cell resolu- sequences of fish origin make up just over 40%. By far tion. The zebrafish, nearly unique among commonly stu- the largest number of fish sequences are for the actino- died vertebrates, contains large numbers of identifiable pterygii (ray-finned fishes) with all other being much less cells, some of which have already been characterized in well represented (figure 1). Two of the organisms at the terms of their detailed anatomical, physiological, deve- focus of the genome projects, Fugu rubripes (pufferfish) lopmental and genetic properties (Lekven et al 2000). and Danio rerio (zebrafish) belong to this group. Cur- Clearly, a large number of sequences from these two

Keywords. Actinopterygii (ray-finned fish); fish proteins; functional classification; InterPro; SWISS-PROT; TrEMBL ______Abbreviations used: GPCRs, G-coupled protein receptors; PKD, polycystic kidney disease; MHC, major histocompatibility complex.

J. Biosci. | Vol. 26 | No. 2 | June 2001 | 277–284 | © Indian Academy of Sciences 277

278 Margaret Biswas, Alex Kanapin and Rolf Apweiler organisms will begin to find their way into the databases. means of rationalising the complementary efforts of the It is, therefore, an interesting time to take stock of what is PROSITE (Hofmann et al 1999), PRINTS (Attwood et al currently available and to attempt to classify the informa- 2000), Pfam (Bateman et al 2000) and ProDom (Corpet tion that is available in the sequence databases. The appli- et al 2000) database projects. These four databases are cation of InterPro for such an analysis is demonstrated in currently the member databases of InterPro. this paper. A comparative analysis of the three complete InterPro is implemented as a relational database in Ora- eukaryotic proteomes was the first application of InterPro cle and users have direct access via Java servlets. The (Rubin et al 2000). The InterPro analysis plus manual InterPro database is distributed as XML-formatted flat inspection of the data enabled the successful classification files and as exports of the relational database. The Inter- of over 50% of the proteins of the proteomes of Droso- Pro database provides an integrated layer on top of the phila melanogaster, Caenorhabditis elegans and Sac- most commonly used signature databases to provide a charomyces cerevisiae. user-friendly interface for text-based searches and sequence scans. InterPro contains manually curated documentation, combined with diagnostic signatures from different data- 2. Source databases and methods bases to create a unique, non-redundant characterization 2.1 Database of actinopterygii sequences of a given , domain or functional site. A sample InterPro entry is shown in figure 2. A subset of the SWISS-PROT and TrEMBL databases An InterPro-based statistical analysis of the actino- containing only the sequences of actinopterygii has been pterygii sequences has been carried out and includes assembled. This subset was selected using SRS (Etzold information under the following categories: et al 1996) to query the two databases for these sequences. · General statistics – all InterPro entries with matches to The list describing the subset membership was stored in the actinopterygii subset. The total number of matches Oracle and has been used in this analysis. of each of the signatures and the number of proteins matched for each InterPro entry is available. 2.2 InterPro · InterPro entries with the highest number of actino- InterPro (http://www.ebi.ac.uk/interpro/) (Apweiler et al pterygii protein matches. 2001) is an integrated documentation resource of protein · Actinopterygii proteins with the highest occurrence of a families, domains and sites that has been developed as a given domain.

Figure 1. Distribution of fish sequences in SWISS-PROT and TrEMBL.

J. Biosci. | Vol. 26 | No. 2 | June 2001

Classification of fish proteins using InterPro 279

Figure 2. Sample InterPro entry and graphical view of cell division protein kinase 2 (EC 2.7.1.-) from Carassius auratus (goldfish).

J. Biosci. | Vol. 26 | No. 2 | June 2001 280 Margaret Biswas, Alex Kanapin and Rolf Apweiler

Actinopterygii proteins with the highest occurrence of flatfile may be retrieved from the EBI anonymous-ftp different signatures. server ftp://ftp.ebi.ac.uk/pub/databases/interpro.

Information about protein matches for member database signatures is stored in the InterPro Oracle tables. This 3. Results and discussion data is compiled using a set of Perl scripts to generate static HTML pages based on the data extracted from the Almost 90% of the actinopterygii protein sequences can Oracle database and filed according to the type of ana- be classified using InterPro (Release 2.0, October 2000) lysis. Match information includes the protein sequence with 757 of the 3204 InterPro entries having matches to accession number, the accession number of the method the actinopterygii protein sequences. (PROSITE, PRINTS, Pfam or ProDom), the position of the signature on the protein sequence and the status of the 3.1 InterPro entries with the highest number of matches match [true (T), false positive (F), false negative (N) or unknown (?)]. The top twenty InterPro entries with the highest number The database is accessible for text- and sequence-based of matches to proteins from the subset of actinopterygii searches at http://www.ebi.ac.uk/interpro/. The InterPro protein sequences are shown in figure 3. More than 30%

Figure 3. Top twenty InterPro entries with the highest number of matches to proteins from the subset of acti- nopterygii protein sequences.

J. Biosci. | Vol. 26 | No. 2 | June 2001

Classification of fish proteins using InterPro 281 of the actinopterygii protein sequences with InterPro 3.3 Actinopterygii proteins with the highest occurrence matches are of mitochondrial origin reflecting the interest of different signatures in using these sequences, in particular cytochrome b, in evolutionary analysis (see, for example, Wang et al 2000). 3.3a Multi-domain proteins: InterPro can give some Over 20% of the actinopterygii protein sequences belong insights into the domain composition of the classified pro- to the cytochrome b/b6 family (IPR000179). Cytochrome teins. Some multi-domain proteins can be especially com- b is the central redox catalytic subunit of the quinol: cyto- plex. For example, the protein from the F. rubripes chrome c or plastocyanin oxidoreductases. It is a mito- polycystic kidney disease 1 (PKD1) gene (O42181) is chondrial protein for which the evolution and structure an integral membrane glycoprotein with multiple evolu- has been widely studied. Other mitochondrial enzymes tionary conserved domains and with hits to 7 different that have a high number of matches to InterPro families InterPro entries (table 1). are various chains of the NADH: ubiquinone oxidoreduc- tase (complex I), subunit I of cytochrome C oxidase, sub- 3.3b Protein family/subfamily relationships: InterPro unit A of ATP synthase and . is constructed so that protein family/subfamily relation- InterPro domains that are highly represented in this ships can be described. For example, the acetylcholin- subset of actinopterygii proteins are the immunoglobulin esterase catalytic subunit of Electrophorus electricus and various major histocompatibility complex related (electric eel) (O42275) has hits to 3 different InterPro domains. The domain is present in just over entries and belongs to the larger family that has the 4% of the proteins that belong to the subset of actino- esterase/lipase/thioesterase (IPR000379). A pterygii protein sequences. Proteins containing homeobox graphical representation of these signatures is shown in domains are likely to play an important role in develop- figure 4 and description of the family/subfamily relation- ment and most are known to be sequence-specific DNA- ships for this family is given in figure 5. binding transcription factors. A protein family that is highly represented among the subset of actinopterygii 3.4 Comparative analysis protein sequences is the rhodopsin-like G-coupled protein receptors (GPCRs). Of the non-mitochondrial enzymes in It is interesting to compare the subset of actinoptery- the top twenty the eukaryotic protein kinases have a high gii protein sequences against the sequences of other number of matches. Protein kinases belong to a very eukaryotes. Currently the complete proteome sets extensive family of proteins that share a conserved cata- (http://www.ebi.ac.uk/proteome/) for C. elegans, D. lytic core with both serine/threonine and tyrosine protein melanogaster and S. cerevisiae and an incomplete redun- kinases. It has been noted earlier that these kinases dant set of human sequences are available. Figure 6 shows account for around 2% of the proteomes of the three the comparison of the top ten InterPro families of the completely sequenced eukaryotes, D. melanogaster, C. actinopterygii protein sequences against the sequences of elegans and S. cerevisiae (Rubin et al 2000). the other four eukaryotes. As mentioned earlier the pro- teins of mitochondrial origin (from IPR000179, IPR001750, 3.2 Actinopterygii proteins with the highest occurrence IPR000883 and IPR000568) are over-represented in our of a given domain redundant subset of actinopterygii protein sequences and are not considered in this comparison. Of the six remain- The same domain may be repeated a number of times ing InterPro families the eukaryotic protein kinase is across a protein sequence. In the subset of actinopterygii proteins the EGF-like domain defined by the Pfam signa- ture PF00008 (IPR000561) is repeated 36 times in the Brachydanio rerio (zebrafish) homologue of the Droso- Table 1. List of 7 different InterPro entries that have hits to phila neurogenic gene Notch (P46530) and 35 and 16 the protein from the F. rubripes PKD1 gene (O42181). times in two other from Fugu rubripes InterPro entry Description (O13149) and Brachydanio rerio (O42374) respectively. The type III fibronectin domain (IPR001777) is repeated IPR000203 Neutral zinc metallopeptidases, zinc-binding 17 times in zebrafish fibronectin (O93406) and the Sushi region domain (IPR000436) is repeated 16 times in a comple- IPR000483 Leucine rich repeat C-terminal domain ment-regulatory plasma protein (Q91275) from barred IPR000601 PKD domain sand bass (Parablax neblifer). Sushi domains, also called IPR001024 , region 2 complement control protein modules or short consensus IPR001304 C-type lectin domain repeats, exist in a wide variety of complement and adhe- IPR002859 REJ (receptor for egg jelly) domain IPR002889 WSC domain sion proteins.

J. Biosci. | Vol. 26 | No. 2 | June 2001

282 Margaret Biswas, Alex Kanapin and Rolf Apweiler highly represented in all the eukaryotic sequences stu- II molecules in mammals and teleost fish (Nonaka et al died. The immunoglobulin/major histocompatibility com- 2000). plex domain (IPR003006) and the homeobox domain (IPR001356) are the only other InterPro entries to be rep- 3.5 Classification resented in all five eukaryotes shown in the figure. The rhodopsin-like GPCR superfamily occurs in all except the Each InterPro entry has been assigned a functional classi- proteome S. cerevisiae. The class I and class II major histo- fication in the form of a three-letter code. This basic cla- compatibility complex (MHC) proteins (IPR001039 and ssification has been used as an indication of the types of IPR000353) occur only in the subset of actinopterygii proteins currently represented in the subset of actino- protein sequences and in the subset of human sequences. pterygii protein sequences. As already mentioned more These two families of proteins are absent in the proteomes than 30% of the actinopterygii protein sequences having of C. elegans, D. melanogaster and S. cerevisiae. Trans- InterPro matches are of mitochondrial origin and most of species evolution has been reported for MHC class I and these are involved, as would be expected, in electron

Figure 4. Graphical representation of the InterPro hits to the acetylcholinesterase catalytic subunit of E. electricus (electric eel) (O42275).

Figure 5. An example of a true protein family/subfamily relationship in InterPro.

J. Biosci. | Vol. 26 | No. 2 | June 2001

Classification of fish proteins using InterPro 283

Figure 6. Comparison of the top ten InterPro families of the actinopterygii protein sequences against the sequences of four other eukaryotes.

transfer. More than 10% of the proteins in the subset play type III mutants and combined that a role in immunology while close to 7% are involved in information with 7 previously determined structures to DNA/RNA binding or regulation. More than 4% of the investigate specific protein-ice interactions such as hydro- proteins in this subset of actinopterygii protein sequences gen bonds. are GPCRs, another 4% are involved in transport and , around 3% in signal transduction and about 2% 4. Conclusion are hormones. InterPro provides a perspective on domain structure and 3.6 Structural information function, gene duplication and protein families in sets of proteins. By providing a list of protein matches to each Almost 23% of the actinopterygii proteins have HSSP InterPro entry it is possible to confidently decipher the links (Berman et al 2000) whereas only 25 proteins functions of many newly predicted or previously unchar- (0.3%) have links to 82 PDB structures (Holm and Sander acterized proteins. Information about the functions of pro- 1999). One of the most studied proteins has been the teins is another step towards understanding the overall type III antifreeze protein from Macrozoarces americanus mechanisms operating and the biological processes that (North-Atlantic ocean pout) (P19614) for which there are they perform. One aim would be to match functional 29 different PDB entries. Interest in this protein arises information to the protein structure and InterPro is cur- from the fact that some cold water marine fishes avoid rently being developed to facilitate the exploration of cellular damage because of freezing by expressing anti- these connections. The tools to query and compare protein freeze proteins that bind to ice and inhibit its growth. The subsets or entire proteomes of organisms by domain mechanism of ice binding remains unclear because of the and/or protein family distributions and combinations pro- difficulty in modelling the protein-ice interaction. Recently, vide the means to identify, for example, systematically Graether and coworkers (Graether et al 1999) have conserved proteins that are likely to have othologs across determined the X-ray crystallographic structure of 10 species and be involved in a shared core biology, con-

J. Biosci. | Vol. 26 | No. 2 | June 2001

284 Margaret Biswas, Alex Kanapin and Rolf Apweiler served families that are missing in a given genome or pro- Berman H M, Westbrook J, Feng Z, Gilliland G, Bhat T N, teins unique to a particular species that may well define Weissig H, Shindyalov I N and Bourne P E 2000 The ; Nucleic Acids Res. 28 235–242 the species. Brenner S, Elgar G, Sandford R, Macrae A, Venkatesh B and Aparicio S 1993 Characterisation of the pufferfish (Fugu) genome as a compact model vertebrate genome; Nature (Lon- Acknowledgements don) 366 265–268 Corpet F, Servant F, Gouzy J and Kahn D 2000 ProDom and The authors wish to thank Sandrine Pilbout for her help ProDom-CG: tools for analysis and whole with , Virginie Mittard for providing the struc- genome comparisons; Nucleic Acids Res. 28 267–269 ture-related information and Nicola Mulder who made Etzold T, Ulyanov A and Argos P 1996 SRS: information retrieval system for molecular biology data banks; Methods available the data on the functional classification of Inter- Enzymol. 266 114–28 Pro entries. Graether S P, Deluca C I, Baardsnes J, Hill G A, Davies P L and Jia Z 1999 Quantitative and qualitative analysis of type III References antifreeze protein structure and function; J. Biol. Chem. 274 11842–11847 Apweiler R, Attwood T K, Bairoch A, Bateman A, Birney E, Hofmann K, Bucher P, Falquet L and Bairoch A 1999 The Biswas M, Bucher P, Cerutti L, Corpet F, Croning M D R, PROSITE database, its status in 1999; Nucleic Acids Res. 27 Durbin R, Falquet L, Fleischmann W, Gouzy J, Hermjakob 215–219 H, Hulo N, Jonassen I, Kahn D, Kanapin A, Karavidopoulou Holm L and Sander C 1999 Protein folds and families: sequence Y, Lopez R, Marx B, Mulder N J, Oinn T M, Pagni M, and structure alignments; Nucleic Acids Res. 27 244–247 Servant F, Sigrist C J A and Zdobnov E M 2001 The InterPro Lekven A C, Helde K A, Thorpe C J, Rooke R and Moon R T database, an integrated documentation resource for protein 2000 Reverse genetics in zebrafish; Physiol. Genomics 2 37– families, domains and functional sites; Nucleic Acids Res. 29 48 37–40 Nonaka M, Yamada-Namikawa C, Flajnik M F and Du Pasquier Attwood T K, Croning M D R, Flower D R, Lewis A P, Mabey L 2000 Trans-species polymorphism of the major histocom- J E, Scordis P, Selley J N and Wright W 2000 PRINTS-S: the patibility complex-encoded proteasome subunit LMP7 in an database formerly known as PRINTS; Nucleic Acids Res. 28 amphibian genus, Xenopus; Immunogenetics 51 186–192 225–227 Rubin G M, Yandell M D, Wortman J R, Gabor Miklos G L, Bairoch A and Apweiler R 2000 The SWISS-PROT protein Nelson C R, Hariharan I K, Fortini M E, Li P W, Apweiler R, sequence database and its supplement TrEMBL in 2000; Fleischmann W et al 2000 Comparative genomics of the Nucleic Acids Res. 28 45–48 eukaryote; Science 287 2204–2215 Bateman A, Birney E, Durbin R, Eddy S R, Howe K L and Wang J P, Hsu K C and Chiang T Y 2000 Mitochondrial DNA Sonnhammer E L L 2000 The Pfam Protein Families Data- phylogeography of paradoxus () in base; Nucleic Acids Res. 28 263–266 Taiwan; Mol Ecol. 9 1483–1494

MS received 20 November 2000; accepted 2 February 2001

Corresponding editor: VIDYANAND NANJUNDIAH

J. Biosci. | Vol. 26 | No. 2 | June 2001