BMC Genomics BioMed Central

Research article Open Access Cross genome comparisons of serine in Arabidopsis and rice Lokesh P Tripathi and R Sowdhamini*

Address: National Centre for Biological Sciences, Tata Institute of Fundamental Research, GKVK Campus, Bellary Road, Bangalore 560 065, India Email: Lokesh P Tripathi - [email protected]; R Sowdhamini* - [email protected] * Corresponding author

Published: 09 August 2006 Received: 11 April 2006 Accepted: 09 August 2006 BMC Genomics 2006, 7:200 doi:10.1186/1471-2164-7-200 This article is available from: http://www.biomedcentral.com/1471-2164/7/200 © 2006 Tripathi and Sowdhamini; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Background: Serine proteases are one of the largest groups of proteolytic enzymes found across all kingdoms of life and are associated with several essential physiological pathways. The availability of Arabidopsis thaliana and rice (Oryza sativa) genome sequences has permitted the identification and comparison of the repertoire of serine -like in the two species. Results: Despite the differences in genome sizes between Arabidopsis and rice, we identified a very similar number of -like proteins in the two plant species (206 and 222, respectively). Nearly 40% of the above sequences were identified as potential orthologues. Atypical members could be identified in the plant genomes for Deg, Clp, Lon, rhomboid proteases and species-specific members were observed for the highly populated subtilisin and serine carboxypeptidase families suggesting multiple lateral gene transfers. DegP proteases, prolyl oligopeptidases, Clp proteases and rhomboids share a significantly higher percentage orthology between the two genomes indicating substantial evolutionary divergence was set prior to speciation. Single domain architectures and paralogues for several putative subtilisins, serine carboxypeptidases and rhomboids suggest they may have been recruited for additional roles in secondary metabolism with spatial and temporal regulation. The analysis reveals some domain architectures unique to either or both of the plant species and some inactive proteases, like in rhomboids and Clp proteases, which could be involved in chaperone function. Conclusion: The systematic analysis of the serine protease-like proteins in the two plant species has provided some insight into the possible functional associations of previously uncharacterised serine protease-like proteins. Further investigation of these aspects may prove beneficial in our understanding of similar processes in commercially significant crop plant species.

Background olysis plays a major role in diverse cellular processes by The proper functioning of a cell is ensured by the precise regulating biochemical activities of proteins in concert regulation of levels that in turn are regulated by a with other cellular mechanisms such as post-translational balance between the rates of protein synthesis and degra- modification[1,2]. Serine proteases are one of the largest dation. Protein degradation is mediated by proteolysis, groups of proteolytic enzymes involved in numerous reg- which paves the way for recycling of amino acids into the ulatory processes. They catalyse the hydrolysis of specific cellular pool. In addition to degradation, regulated prote- peptide bonds in their substrates and this activity depends

Page 1 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

on a set of amino acids in the active site of the enzyme, This paper reports the first bioinformatics genome-wide one of which is always a serine. They include both survey of plant serine proteases and cross-comparisons of exopeptidases that act on the termini of polypeptide several serine protease families across monocots and chains and endopeptidases that act in the interior of dicots. Species-specific proteases and existence of evolu- polypeptide chains and belong to many different protein tionary changes like lateral gene transfer and gene shuf- families that are grouped into clans[3,4]. A clan is defined fling were observed at the catalytic domain. Further as a group of families, the members of which are likely to experimental characterization of Arabidopsis and rice ser- have a common ancestor. Currently there are over 50 ser- ine proteases that are likely to be involved in basic func- ine-protease families known as classified by MEROPS tions associated with various physiological processes in database[5], an information resource for peptidases and would be useful in expanding the current under- their inhibitors. standing of these processes in regulating plant growth and development. This would prove beneficial in understand- Serine proteases appear to be the largest class of proteases ing protease processes in plants and application of this in plants[2]. Consistent with their regulatory role, plant knowledge to important crop species. serine proteases have been shown to be involved in diverse processes regulating plant development and Results and discussion defense responses[2,6-8]. In addition, some of plant ser- Despite the differences in genome sizes between Arabidop- ine carboxypeptidases (Family S10) function as acyltrans- sis and rice (125 Mb and 389 Mb respectively) and ferases rather than hydrolases, suggesting diversification encoded number of genes (25,498 and 37,544 non-trans- of function to regulate secondary metabolism in plants [9- posable-element-related-protein-coding[14,15] and 11]. However, the function and regulation of plant pro- 55,890 in total as in TIGR[20] rice database), the two teases is poorly understood primarily due to lack of iden- plant species appear to encode a very similar number of tification of their physiological substrates. While there is serine protease-like proteins (206 and 222 putative num- extensive literature available on plant serine proteases, bers, respectively; Table 1; see additional files 1.2: Table particularly in Arabidopsis thaliana, most of these studies S1.pdf, Table S2.pdf). These include several putative ser- are either restricted to specific families [11-13] or specific ine protease homologues that are apparently catalytically physiological aspects (please see the references above). inactive, since they either lack the residues Arabidopsis thaliana and Oryza sativa constitute model sys- that are believed to be essential for catalysis or carry tems for Dicotyledons and Monocotyledons respectively, amino acid substitutions at positions corresponding to the major evolutionary lineages in angiosperms[14,15]. catalytically active sites. Such inactive-enzyme homo- The availability of the complete genomic sequence of logues are not restricted to serine proteases, but have been these two species renders it possible to carry out a compre- identified in diverse enzyme families, chiefly among pro- hensive analysis of serine-protease families to gain teins involved in signaling pathways and proteins that are insights into their function and to ascertain their evolu- apparently extracellular in function and are conserved tionary relationships. across metazoan species. They are believed to have acquired newer and more distinct functions, in the course The current analysis aims at the identification and analysis of evolution, thereby adding to the complexity of various of serine protease families encoded in the genomes of the regulatory networks in the cell[21]. We discuss below the two plant species. The domain organization of the puta- occurrence, phylogenetic patterns of serine protease tive serine proteases in the two plant genomes have been domains in the plant genomes by grouping according to analysed employing several sensitive sequence search MEROPS classification[5]. Only serine protease families methods [16-18] that have provided insights into differ- with non-zero occurrence of putative members in the two ent domain architectures. To understand the evolutionary genomes are discussed. relationships among the Arabidopsis and rice serine pro- tease-like proteins, phylogenetic analysis was performed DegP proteases (Family S1) based at their protease domains. Cross genome compari- DegP proteases are a group of ATP-independent serine son of the serine proteases across the two genomes has led proteases that belong to MEROPS[5] family S1 (trypsin to the identification of members likely to be conserved family). E.Coli DegP is a peripheral membrane protein across the two species. Predictions of the sub-cellular located in the periplasmic side of the plasma membrane. localizations of serine protease-like proteins to obtain fur- It is a heat-shock protein that combines both chaperone ther insight into their possible functional associations and proteolytic activities that switch in a temperature- were carried out using TargetP[19]. Comparison of plant dependent manner. The chaperone activity is predomi- serine proteases with other major eukaryotic lineages nant at low temperatures, while the proteolytic activity reveals selective expansion of serine protease families in takes over at higher temperatures[22]. Crystal structure of plants and identification of plant specific serine proteases. E.coli DegP reveals that the proteolytic site is inaccessible

Page 2 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

Table 1: Serine proteases identified in the two plant genomes, by sensitive sequence search methods

CLAN FAMILY [37] Accession Arabidopsis Rice

PA(S) S1 (Chymotrypsin) PF00089 16 15 SB S8 (Subtilisin) PF00082 56 63 SC S9 (Prolyl oligopeptidase) PF00326 23 23 S10 (Carboxypeptidase) PF00450 54 66 S28 (Lys. Pro-x Carboxypeptidase) PF05577 7 5 SE S12 (D-Ala-D-Ala carboxypeptidase B) PF00144 1 1 SF S16 (Lon protease) PF05362 4 4 S26 (Signal peptidase I) PF00717 9 7 SK S14 (Clp Endopeptidase) PF00574 9 13 SM S41 (C-terminal processing peptidase) PF03572 3 3 S49 (Protease IV) PF01343 1 1 S54 (Rhomboid) PF01694 20 18 SP S59 (Nucleoporin autopeptidase) PF04096 3 3 at low temperatures and thus, enzyme exists in chaperone gous pairs of DegP-like proteins across the two genomes conformation (See additional file 3: Table S3.pdf)[22]. (See additional file 4: Table S4.pdf). Arabidopsis DegP1 PDZ domains are known to modulate the function and/or (At3g27925) was found to be localised and localization of their associated proteins and it has been was rapidly upregulated in response to elevated tempera- shown that PDZ domains are involved in substrate recog- tures[30]. We identified an At3g27925 orthologue in the nition and binding in certain proteases [23-26]. rice proteome (LOC_Os05g49380) nearly identical in size that shares a high sequence similarity with At3g27925 DegP-like proteins have been identified in and and may function in a similar manner. The significant but appear to be absent in , and are number of predicted DegP-like proteins in the two plant believed to have spread to eukaryotic lineages via hori- species suggests the presence of a conserved repertoire of zontal gene transfer events [27-29]. The exact functions of proteins that are likely to be upregulated in response to DegP-like proteins in plants remain unknown. A plant elevated temperatures in thylakoid lumen and mitochon- DegP homolog, DegP1, was first identified in Arabidopsis drial matrix in plants. Some putative gene products in and was found to be strongly associated with the lumenal both the genomes are likely to be proteolytically inactive side of the thylakoid membrane. It was found to be rap- due to either mutations or deletions in the active site resi- idly upregulated in response to elevated temperatures and dues (See additional file 5: Figure SF1.pdf). It remains DegP homologue associated membrane fraction was unclear whether these proteolytically inactive members found to have serine-type proteolytic activity, suggesting a retain their chaperone activity at low temperatures, they possible role in chloroplast heat response[30,31]. More may modulate the function of active gene products by reg- recently, a chloroplast DegP was shown to function in the ulating substrate association and complex formation. initial cleavage of D1 protein of PSII (Photosystem II) subsequent to photoinhibition[32]. Phylogenetic analysis of the protease domains of Arabi- dopsis and rice DegP-like proteins showed that a majority 16 DegP-like proteins could be identified in Arabidopsis of sequences cluster into three major clades, supported by proteome, higher than an earlier estimate of 14[6]. Like- high bootstrap values, with additional members less wise, 15 DegP-like proteins were identified in rice. A related to the three clusters (See additional file 5: Figure majority of these are predicted to localise to either chloro- SF1.pdf). Clade I, the largest of three clusters consists of plast or mitochondria (See additional file 2: Table 14 DegP-like proteins that are more varied in sequence S2.pdf). Of these, four gene products (At3g03380, and share between 35 and 95% pairwise amino acid At3g27925, At5g27660, At5g39830) from Arabidopsis and sequence identity, Clade II consists of six closely related four (LOC_Os02g48180, LOC_Os04g38640, sequences that share between 50 and 93% pairwise LOC_Os05g49380, LOC_0s11g14170) from rice are pre- sequence identity with each other. Clade II includes two dicted to contain PDZ-like domains C-terminal to the sequence pairs At3g27925:LOC_Os05g49380 and protease domain, while the rest appear to be a single pro- At5g39830:LOC_Os04g38640 that carry a PDZ domain tease domain containing proteins. The absence of PDZ- each C-terminal to the protease domain and share 93% like domains in majority of plant DegP-like proteins has and 87% pairwise sequence identity respectively. Clade been suggested to have as of yet unclear functional impli- III, the smallest of three clades, consists of five sequences cations for these proteins[6]. We identified seven ortholo- that share between 51 and 94% pairwise sequence iden-

Page 3 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

tity with each other. In addition, there are two pairs of strate recognition by peptidases and is believed to have PDZ domain containing proteins, been associated with ancestral subtilases[35,36]. In addi- At5g27660:LOC_Os11g14170 and tion, most subtilisin-like proteins identified in the two At3g03380:LOC_Os02g48180, that share 58 and 50% plant species, carry short insertions ranging from 15–45 pairwise sequence identity respectively. amino acid residues in length in the N-terminal region of their predicted protease domains. These insertions do not Subtilisins (Family S8) share any obvious sequence similarities with any other The subtilisin family is the second largest family of serine known module and are likely to be unique in their associ- proteases identified till date with over 200 known mem- ation with Arabidopsis and rice subtilisin-like proteins bers across lineages in eubacteria, archaebacteria, eukary- (data not shown). However, the presence of gene products otes and . Crystal structures reveal that subtilisins in both the proteomes with only the protease domain sug- utilize a highly conserved similar to the gests that this domain alone is sufficient for the activity of members of chymotrypsin and carboxypeptidase clans these enzymes (See additional files 1, 2: Table S1.pdf, but have a different order of Asp, His and Ser residues in Table S2.pdf). The Arabidopsis proteome contains two the sequence (D137, H168, S325) with no other struc- gene products At2g19170 and At4g30020 that appear to tural similarity[33] posing a case of convergent evolution. possess an additional module, classified as DUF1034 MEROPS database[5] subdivides the subtilisin family into (Pfam[37] accession: PF06280), C-terminal to the PA two subfamilies: S8A (subtilisin subfamily) consisting of domain, while a single similar gene product true subtilisins and S8B (kexin subfamily) consisting of LOC_Os06g48650 was identified in rice proteome, sug- proprotein-processing enzymes. An interesting feature of gesting a duplication of this gene in Arabidopsis subse- the subtilisin family is that some members appear to be quent to dicot-monocot divergence (Figure 2; see mosaic with little or no sequence similarity to any other additional files 1, 2: Table S1.pdf, Table S2.pdf). The lack known proteins[33] and with large N- and C-terminal of diverse domain architecture in plant subtilisin-like pro- extensions. teins, suggests that specific expansion of subtilisin gene family in the two plant species has been facilitated by A large number of subtilisin-like serine proteases have extensive gene duplication. This is further supported by been identified in various plant species, where they have the observation that only 16 orthologous pairs were iden- been implicated in diverse processes (See additional file 3: tified across the two species, suggesting that the majority Table S3.pdf) and all appear to correspond to S8A sub- of subtilisin-like proteins in the two species arose after family, while kexin-type proteins appear to be absent dicot-monocot divergence, possible due to gene duplica- from plants[2,12,34,35]. It has been reported that both tion (See additional file 4: Table S4.pdf). It is likely that Arabidopsis and rice are characterised by significantly spatial and temporal regulation of pat- larger number of subtilases in comparison to other organ- terns may play a major role in regulating the activity of isms such as human, Caenorahabditis and Drosophila, sug- subtilisin-like proteins in plants. The presence of a large gesting that expansion of the subtilase gene family in the number of subtilisin-like proteins in the two plant species two plant species is due to multiple duplication suggests possible functional redundancy, while on the events[35]. other hand it may be indicative of functional diversifica- tion. A significant number of plant subtilisins may have We have identified 56 subtilisin-like proteins in Arabidop- been recruited for functions in plant secondary metabo- sis proteome, similar to the figures reported else- lism, in a manner similar to some serine carboxypepti- where[12,35] and 63 in rice. A majority of these gene dase-like proteins. The subtilisin-like proteins identified products, as expected, are predicted to be secretory path- here in Arabidopsis and rice as well as those reported from way proteins, though a few are predicted to localise to other plant species show maximum similarity to and mitochondria (See additional files 1, 2: MEROPS[5] S8A subfamily of subtilisins. A few gene Table S1.pdf, Table S2.pdf). The analysis of the domain products are likely to be proteolytically inactive due to architectures of Arabidopsis and rice subtilisin-like proteins mutations in the catalytic residues and are most likely to reveals that most of these gene products consist of three be evolutionary intermediates (See additional file 6: Fig- domains: an N-terminal Subtilisin_N domain that is ure SF2.pdf). removed prior to activation of the enzyme, Peptidase_S8 domain and the protease associated (PA) domain, that Phylogenetic analysis of Arabidopsis and rice subtilisin- occurs as a C-terminal insertion in the Peptidase_S8 like proteins was carried out with the S8 protease domain domain of a majority of subtilisin-like proteins identified minus the insertions. The analysis reveals the presence of in these two plant species (See additional files 1, 2: Table five major gene clusters of which four consist of subtilisin- S1.pdf, Table S2.pdf). The PA domain is believed to be like proteins from both Arabidopsis and rice including involved in protein-protein interactions or mediate sub- orthologous pairs (Figure 1). Clade I consists of 13

Page 4 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

At5g44530 Os10g25450 Clade II At2g05920 At4g20430

Os09g26920 Os03g40830 Os03g55350 Clade III Os04g47150 Os02g44590 At1g30600 At2g19170 Os03g02750 Os10g38080

At5g67360 Os01g56320

Os04g10360 At4g30020

Os12g23980 Os07g48650 Os04g48420 Os03g31630

At5g51750 Os08g35090 Os04g47160

Os05g30580

Os06g48650 At3g14240

* At4g34980 At5g45650

Os09g30250 At5g45640 Os04g45960 Clade I At5g59810 Os03g13930

At1g62340 At2g04160 * Os02g53850 Os02g10520 Os01g52750 Os02g53970

Os06g40700

At1g01900 Os03g04950 At3g14067 Os02g53910 Os01g50680 * At1g04110 Os07g39020 Os02g53860 At5g67090 Os08g23740 Os04g35140 At1g20150 Os01g64850 At5g59090 At1g20160 Os05g36010 At4g00230 Os01g17160 At5g03620 Os01g64860 At5g59190 At5g19660 Os06g06800 At4g15040 At4g20850

Os02g44520 At5g59120

At5g11940 At1g32970 * At4g10530

Os06g41880 * * * At5g58840 At4g10520 Os09g36110

At2g39850 At5g58820

At1g32940 At5g58830 Os03g06290 Os11g15520 At4g26330

Os04g03800

At1g66220 At4g21323 At4g21326 At1g32960 * Os01g58290 Os04g02960 Os01g58280 Os04g03810 Os02g16940 At3g46850 Os04g03850 Os01g58270 * At1g66210 Os02g17080 At4g10550

Os01g58260 Os02g17090 At5g59130 Os02g17150 At1g32950 Os01g58240 Os04g03100

At5g59100

At3g46840 Os02g17000 At4g21650 Os02g17060 At4g10540 IVC Os04g03060 Clade V At4g10510

Os04g02980

Os04g03710 IVB At4g21630 IVA At4g21640 Clade IV

FigureUnrooted 1 N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) subtilisin domains Unrooted N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) subtilisin domains. Subtili- sin-like protease domains were aligned using ClustalW [95] program and the alignments were exported to Phylip package [96] for representing the Neighbor-Joining tree (see methods). The colors and circles represent different evolutionary clades iden- tified in the analysis (see text for details). Clade I is represented in purple, Clade II is shaded orange, Clade III in green, Clade IV in brown and Clade V in yellow. For clarity, bootstrap values were replaced with symbols representing bootstrap percentages >50%. Bootstrap values between 50–60% are represented by an asterix, circles represent bootstrap values from 60%–80% while bootstrap values >80% are represented by rectangles. Gene names correspond to those in Additional files 1 and 2. For brevity, rice gene names have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit number assigned to each gene. A few species specific gene clusters were also identified in the analysis (see text for details).

Page 5 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

Serine Protease Domain Architecture Abundance in Occurrence in family Arabidopsis other lineages andRice S1 12/16 ; 11/15 B, A, E S1 (Trypsin family) S1 PDZ 4/16 ; 4/15 B, E

S8 B, A, E S8 (Subtilisin 3/56 ; 3/63 family) S8 PA 1/56 ; 7/63 B, E

SUB N S8 3/56 ; - B, E

SUB N S8 PA 46/56 ; 48/63 B, E

SUB N S8 PA DUF1034 2/56 ; 1/63 B, E

SUB N S8 PA Arf C2 -;1/63 E (Os)

SUB N S8 PA zf-CCHC -;1/63 E (Os)

S8 zf-CCHC rve -;1/63 E (Os)

Extensin 2 S8 zf-CCHC rve -;1/63 E (Os)

S9 (Prolyl S9 18/23 ; 18/23 B, A, E oligopeptidase B, E family) S9 N S9 4/23 ; 3/23

PD40 S9 -;1/23 B, A, E(Os)

DPPIV N S9 1/23 ; 1/23 B, E

S10 (Serine Carboxypeptidase) S10 54/54 ; 64/66 B, A, E

Transposase 21 S10 - ; 1/66 E (Os)

S10 Retrotrans gag -;1/66 E (Os)

S12 (Serine Beta ABC1 S12 1/1 ; 1/1 E(At&Os) Lactamase)

S14 (Clp protease) S14 9/9 ; 13/13 B, A, E

LON AAA 4/4 ; 3/4 B, A, E S16 (Lon S16 Protease) AAA S16 -;1/4 B, A, E

S26 (Signal S26 9/9 ; 7/7 B, A, E Peptidase) S28 (Lysosomal Pro-X S28 7/7 ; 5/5 B, E Carboxypeptidase) S41 (C-terminal processing PDZ S41 3/3 ; 3/3 B, A, E peptidase) S49 (Protease IV) S49 S49 1/1 ; 1/1 B, E

S54 17/21 ; 16/18 B, A, E S54 (Rhomboid) S54 UBA 3/21 ; 2/18 E(At&Os)

S54 zf-RanBP 1/21 ; - E (At)

S59 (Nuceloporin S59 Autopeptidase) 3/3 ; 3/3 E DomainFigure 2Architectures identified in Arabidopsis and rice serine Protease-like proteins Domain Architectures identified in Arabidopsis and rice serine Protease-like proteins. At -Arabidopsis thaliana; Os – Rice (Oryza sativa); B- Bacteria; A- Archaea; E- Eukaryota; Sxx- Serine protease family Sxx domain, where Sxx refers to the serine protease family as per MEROPS [5] classification (see text for details). PDZ- PDZ domain (Pfam [37] accession: PF00595); PA- Protease associated domain (Pfam [37] accession: PF02225); SUB N- Subtilisin N-terminal region (Pfam [37] accession: PF005922); DUF1034- Domain of unknown function (Pfam [37] accession: PF06280); Arf- ADP-ribosylation factor family (Pfam [37] acces- sion: PF00025); C2- C2 domain (Pfam [37] accession: PF00168); zf-CCHC- Zinc knuckle (Pfam [37] accession: PF00098); rve- Integrase core domain (Pfam accession:PF00665); Extensin 2- Extensin-like region (Pfam [37] accession: PF04554); S9 N- Prolyl oligopeptidase, N-terminal beta-propeller domain (Pfam [37] accession: PF02897); PD40- WD40-like beta propeller repeat (Pfam [37] accession: PF07676); DPPIV N- Dipeptidyl peptidase (DPP IV) N-terminal region (Pfam [37] accession: PF00930); Transposase 21- Transposase family tnp2 (Pfam [37] accession: PF02992); Retrotrans gag- Retrotransposon gag protein (Pfam [37] accession: PF03732); ABC1- ABC1 family (Pfam [37] accession: PF03109); LON- ATP-dependent protease La (LON) domain (Pfam [37] accession: PF02190); AAA- ATPase family associated with various cellular activities (Pfam [37] accession: PF00004); UBA- UBA/TN-S (ubiquitin associated) domain (Pfam [37] accession: PF000627); zf-RanBP- Zinc finger in Ran bind- ing protein and others (Pfam [37] accession: PF00641).

Page 6 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

sequences that share between 44 and 76% pairwise POPs have been implicated in degradation of biologically sequence identity with each other. Clade II, the smallest of important peptides such as peptide hormones and neu- five clusters consists of six sequences from Arabidopsis ropeptides associated with learning and memory and and three from rice that share between 48 and 84% therefore have become significant targets for drug sequence identity with each other. Clade III consists of 31 design[38]. They have been identified in organisms from sequences from both the plant species that share between all forms of life, though members of this family display 40 and 80% sequence identity with each other. The mem- differences in mutation rates and appear to have under- bers of Clade III may be grouped into two sub-clusters. gone several changes in their localization over the course Clade IV appears to be the largest and most diverse cluster of evolution[39,40]. of Arabidopsis and rice subtilisin-like proteins. It includes 40 sequences that share between 35 and 92% pairwise We identified 23 genes encoding prolyl oligopeptidase- percentage identity with each other. On the basis of pair- like proteins in Arabidopsis and 23 in rice. A majority of wise percentage identities between the constituent these either lack a predicted signal sequence or are pre- sequences, Clade IV may be further subdivided into three dicted to localise to chloroplast and mitochondria. Only subclusters: Clade IVA, IVB and IVC. Clade IVA consists few gene products identified here are predicted to carry a exclusively of Arabidopsis sequences that share between β-propeller N-terminal to the protease domain (See addi- 47 and 91% sequence identity with each other and tional files 1, 2: Table S1.pdf, Table S2.pdf). The absence between 35 and 60% sequence identity with the rest of the of the N-terminal domain in a majority of plant POP-like Clade IV sequences. In contrast, Clade IVB and Clade IVC, proteins is intriguing since it is likely to have a functional consist exclusively of rice subtilisin-like proteins. Clade implication for these proteins. Following a careful inspec- IVB the smallest of three subclusters, consists of sequences tion of the amino acid residues in the vicinity of the active that share between 45 and 81% sequence identity with site serine residue, we were able to unambiguously assign each other, while sharing between 35 and 60% sequence a few POP-like proteins in the two plant species to differ- identity with the remaining members of Clade IV. Clade ent subfamilies of POP-like proteins as classified by IVC sequences share between 54 and 92% sequence iden- MEROPS[5]. These include At1g76140, tity with each other and between 44 and 61% sequence LOC_Os01g01830 (GGSNGGLL; S9A); At5g24260 identity with the rest of the constituents of Clade IV. Clade (GWSYGGY; S9B); At2g47390, LOC_Os03g19410, V is an exclusive cluster of Arabidopsis subtilisin-like pro- LOC_Os07g48970 (GGHSYGAFMT; S9D) (See addi- teins that are between 41 and 85% identical with each tional file 7: Figure SF3.pdf). We identified 14 ortholo- other and most likely represent an Arabidopsis specific gous pairs of POP-like proteins across the two plant subfamily of subtilisin-like proteins. Our observations species, suggesting high degree of conservation in the appear consistent with a recent report that identifies the function of POP-like proteins across the two species (See presence of four major clusters of orthologous groups and additional file 4: Table S4.pdf). It also suggests that major- a fifth cluster of Arabidopsis sequences within the phylo- ity of these enzymes were present prior to dicot-monocot genetic tree constructed with Arabidopsis and rice sutbili- divergence, consistent with the ancient evolutionary ori- sin-like proteins[35]. We identified a few gene sub- gin of these proteins. Phylogenetic analysis, however, clusters that consist exclusively of Arabidopsis or rice sub- reveals presence of several gene clusters of Arabidopsis and tilisin-like gene products indicating the existence of Arabi- rice POP-like proteins that share low pairwise sequence dopsis or rice subtilisin-like genes and by extension identity with each other and significant bootstrap values putative dicot- and monocot-specific subtilisin-like genes (Figure 3).The largest of these clusters Clade I consists of (Figure 1). 12 POP-like gene products from the two plant species that are closely related and the pairwise percent sequence iden- Prolyl oligopeptidases (Family S9) tity between any two sequences within the cluster ranges The members of Prolyl oligopeptidases (POPs) family of from 60 to 96%. The second major cluster Clade II is more serine peptidases belong to the α/β hydrolase fold varied in composition and consists of 10 sequences that grouped into SC clan of serine peptidases. The family may further be grouped into two subclusters Clade IIA includes members of different types and with distinct spe- and Clade IIB based or pairwise percent sequence identity cificities. For example, prolyl oligopeptidase is an between any two sequences falling into the cluster. Clade endopeptidase localised to the cytosol, while dipeptidyl IIA includes four sequences (At1g20380, At1g76140, peptidase IV (DPPIV) is a membrane bound exopepti- LOC_Os01g01830 and LOC_Os04g47360) that are dase. According to MEROPS database[5], POPs may be between 72 and 89% identical with each other. Clade IIB classified into four subfamilies with prolyl oligopeptidase includes six sequences that share between 32 and 85% (S9A), dipeptidyl-peptidase IV (S9B), aminoacyl-pepti- pairwise sequence identity with each other. The members dase (S9C) and glutamyl endopeptidase (S9D) being the of two subclusters share between 22 and 32% pairwise typical examples of their respective subfamilies. sequence identity with each other. Clade IIA includes two

Page 7 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

Os06g06770 At5g25770

At5g15860 At3g47560

Os01g39800 Os05g46210 Os02g09770 At1g26120

At5g19630

Os01g57770

Os01g42690 Os06g42730

At3g30380 At4g14290 At1g13610 Os10g04620

* At3g01690 Os03g24450 At4g24760 Clade I

Os07g41730 Os12g18860

At1g32190 At5g20520 *

At2g24320

Os02g18850

At5g24260 (S9B) At1g66900 Os02g55330

At5g36210

Os06g11190 Os01g49510 Os06g11180

* At1g50380

At2g47390 (S9D)

Os06g51410 Os07g48970 Os03g19410 (S9D) (S9D) Os09g28040 IIB At4g14570 Os10g28030 At5g66960 Os04g47360 Clade II Os10g28020

At1g76140 At1g69020 Os01g01830 (S9A) Os09g29950 (S9A) At1g20380 IIA domainsFigureUnrooted 3 N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) prolyl oligopeptidase Unrooted N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) prolyl oligopeptidase domains. Prolyl oligopeptidase-like domains were aligned using ClustalW [95] program and the alignments were exported to Phylip package [96] for representing the Neighbor-Joining tree (see methods). The colors and circles represent the two evolu- tionary clades identified in the analysis (see text for details). Clade I is represented in Orange, Clade II is shaded green. For clarity, bootstrap values were replaced with symbols representing bootstrap percentages >50%. Bootstrap values between 50– 60% are represented by an asterix, circles represent bootstrap values from 60%–80% while bootstrap values >80% are repre- sented by rectangles. Gene names correspond to those in Additional files 1 and 2. For brevity, rice gene names have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit number assigned to each gene. Subfamily assignments where possible are indicated in parentheses below the gene name (see Figure SF3 and text for details).

Page 8 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

Table 2: A list of most recent serine protease gene duplications in Arabidopsis thaliana genome. The most recent gene duplicates identified in the segmentally duplicated regions of the Arabidopsis genome are suffixed with (S). See text for details

S. No. Arabidopsis serine protease-like Biological function if Most recent protein known duplicate

Family S1 (Deg protease family) 1. At1g51150 At5g54745 2. At1g65640 (S) At5g36950 (S) 3. At2g47940 At5g40200 4. At3g16540 At3g16550 5. At3g16550 At3g16540 6. At3g27925 At5g39830

Family S8 (Subtilisin family) 1. At1g20150 At1g20160 2. At1g32940 At1g32960 3. At1g66210 At1g66220 4. At2g04160 At5g59810 5. At2g19170 (S) At4g30020 (S) 6. At3g14240 At4g34980 7. At3g46840 At3g46850 8. At4g00230 At5g03620 9. At4g10520 At4g10530 10. At4g10540 At4g10550 11. At4g20430 (S) At5g44530 (S) 12. At4g21630 At4g21640 13. At5g45640 At5g45650 14. At5g51750 At5g67360 15. At5g58820 At5g58830 16. At5g59090 At5g59120

Family S9 (Prolyl oligopeptidase family) 1. At1g20380 (S) At1g76140 (S) 2. At1g52700 At3g15650 3. At1g69020 At5g66960 4. At3g02410 (S) At5g15860 (S) 5. At3g23540 (S) At4g14290 (S) 6. At4g14570 At5g36210

Family S10 (Serine carboxypeptidase family) 1. At1g11080 (S) At1g61130 (S) 2. At1g28110 At2g33530 3. At1g61130 At1g11080 4. At1g73300 At5g36180 5. At2g22920 At2g22970 6. At2g24000 (S) At4g30610 (S) 7. At2g35780 At3g07990 8. At3g12230 At3g12240 9. At3g25420 (S) At4g12910 (S) 10. At3g45010 At5g22980 11. At3g52000 At3g52010 12. At3g52020 At3g63470 13. At5g42230 At5g42240

Family S14 (Clp protease family) 1. At1g09130 At1g49970 2. At1g66670 At5g45390

Family S16 (Lon protease family) 1. At3g05790 At5g26860

Family S26 (Signal Peptidase family) 1. At1g06870 At2g30440

Page 9 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

Table 2: A list of most recent serine protease gene duplications in Arabidopsis thaliana genome. The most recent gene duplicates identified in the segmentally duplicated regions of the Arabidopsis genome are suffixed with (S). See text for details (Continued) 2. At1g23465 At1g29960 3. At1g52600 (S) At3g15710 (S)

Family S28 (Lysosomal Pro-X Carboxypeptidase family) 1. At2g24280 At5g65760 2. At4g36190 At4g36195

Family S41 (C-terminal processing peptidase family) 1. At3g57680 At4g17740

Family S54 (Rhomboid family) 1. At1g12750 (S) At1g63120 (S) 2. At1g74130 At1g74140 3. At2g41160 (S) At3g56740 (S)

Family S59 (Nucleoporin autopeptidase family) 1. At1g10390 At1g59660

sequences (At1g76140, LOC_Os01g01830) identified as like proteins reveals that nearly all gene products identi- members of the POP (S9A) subfamily of Prolyl oli- fied here are single domain proteins containing a single gopeptidase-like proteins (see above) suggesting that the protease domain, with the exception of other members of this subcluster may possibly belong to LOC_Os01g22970 that carries a single zinc binding mod- the S9A subfamily (Figure 3). No species-specific clusters ule (zf-CCHC; Pfam[37] accession: PF00098) N-terminal were identified in this family. to the protease domain and LOC_Os06g32740 that car- ries a conserved transposable element module Serine carboxypeptidases (Family S10) (Transposase_21; Pfam[37] accession: PF01359) C-termi- Serine carboxypeptidases catalyze the hydrolysis of the C- nal to the protease domain (Figure 2). This clearly indi- terminal bond in proteins and peptides. Crystal structures cates that the predicted serine carboxypeptidase domain is show that serine carboxypeptidases belong to the α/β usually sufficient for biological activity of the majority of hydrolase fold and possess a catalytic triad similar to gene products identified in the two plant species. The lack members of chymotrypsin and subtilisin families in the of diverse domain architecture suggests that spatial and order Ser, Asp and His (S257, D449, H508)[5]. temporal regulation of gene expression patterns may play a major role in regulating the activity of SCP-like proteins Serine carboxypeptidase-like proteins (SCPLs) have been in Arabidopsis and rice. We identified 17 orthologous pairs identified in several plant species, where they form large across the two species suggesting, as in the case of subtili- and diverse gene families in contrast with prokaryotes and sin-like proteins, a majority of SCP-like proteins in Arabi- other eukaryotes. Plant SCPLs have been implicated in dopsis and rice probably arose due to extensive gene several biochemical processes including secondary metab- duplication subsequent to dicot-monocot divergence and olism (See additional file 3: Table S3.pdf) and are that has contributed to their forming a large gene family believed to be essential to plant growth and develop- in the two species (Tables 2, 3; see additionals files 4, 8: ment[2,9-11], [41-43]. Table S4.pdf, Figure SF4.pdf). The presence of a large number of SCP-like proteins in the two plant species sug- We identified 54 genes coding for serine-carboxypepti- gests a possible functional redundancy, while on the other dase (SCP)-like proteins in Arabidopsis and 66 in rice. A hand it may indicate of functional diversification, where a majority of these gene products are predicted to be secre- significant number of these proteins may have been tory pathway proteins, though a number of rice SCP-like recruited for functions in plant secondary metabolism to proteins are either predicted to localise to mitochondria catalyze novel reactions (See additional file 3: Table and chloroplasts or lack a predictable signal sequence (See S3.pdf). SCP-like enzymes have been shown to function additional files 1, 2: Table S1.pdf, Table S2.pdf). The anal- as acyltransferases in several plant species including Arabi- ysis of domain architectures of Arabidopsis and rice SCP- dopsis. Arabidopsis SCP-like proteins SCT/SNG2

Page 10 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

Table 3: A list of most recent serine protease gene duplications in rice genome. The most recent gene duplicates identified in the segmentally duplicated regions of the rice genome are suffixed with (S). See text for details

S. No. Rice serine protease-like protein Gene duplicate

Family S1 (DegP protease family) 1. LOC_Os02g50880 (S) LOC_Os06g12780 (S) 2. LOC_Os04g38640 LOC_Os05g49380 3. LOC_Os12g04740 LOC_Os12g04740

Family S8 (Subtilisin family) 1. LOC_Os01g56320 LOC_Os06g48650 2. LOC_Os01g58240 LOC_Os01g58260 3. LOC_Os01g58280 LOC_Os01g58290 4. LOC_Os01g64860 (S) LOC_Os05g36010 (S) 5. LOC_Os02g10520 (S) LOC_Os06g40700 (S) 6. LOC_Os02g16940 LOC_Os02g17090 7. LOC_Os02g17000 LOC_Os02g17060 8. LOC_Os02g17090 LOC_Os02g16940 9. LOC_Os02g44590 (S) LOC_Os04g47150 (S) 10. LOC_Os02g53910 LOC_Os02g53970 11. LOC_Os03g02750 (S) LOC_Os10g38080 (S) 12. LOC_Os03g31630 LOC_Os07g48650 13. LOC_Os03g40830 LOC_Os03g55350 14. LOC_Os04g02980 LOC_Os04g03060 15. LOC_Os04g03800 LOC_Os04g03810 16. LOC_Os09g26920 LOC_Os10g25450

Family S9 (Prolyl oligopeptidase family) 1. LOC_Os01g01830 LOC_Os04g47360 2. LOC_Os04g47360 LOC_Os01g01830 3. LOC_Os06g11180 LOC_Os06g11190 4. LOC_Os07g48970 LOC_Os07g48970 5. LOC_Os09g28040 LOC_Os09g29950 6. LOC_Os10g28020 LOC_Os10g28030

Family S10 (Serine carboxypeptidase family) 1. LOC_Os01g06490 LOC_Os01g61690 2. LOC_Os01g11670 LOC_Os05g50600 3. LOC_Os01g22980 LOC_Os05g06660 4. LOC_Os02g02320 LOC_Os07g29620 5. LOC_Os02g42310 (S) LOC_Os04g44410 (S) 6. LOC_Os03g26920 LOC_Os03g26930 7. LOC_Os03g52040 LOC_Os03g52080 8. LOC_Os04g25560 LOC_Os12g15470 9. LOC_Os04g32540 LOC_Os11g31980 10. LOC_Os05g50570 LOC_Os05g50580 11. LOC_Os08g44640 LOC_Os08g44640 12. LOC_Os10g01110 LOC_Os11g42390 13. LOC_Os11g24290 LOC_Os11g24410 14. LOC_Os11g24510 LOC_Os12g39170 15. LOC_Os11g27270 LOC_Os11g27350

Family S14 (Clp protease family) 1. LOC_Os01g32350 LOC_Os03g19510 2. LOC_Os02g42290 (S) LOC_Os04g44400 (S) 3. LOC_Os03g22430 LOC_Os05g51450 4. LOC_Os08g15270 LOC_Os12g10590

Family S16 (Lon protease family) 1. LOC_Os01g54040 (S) LOC_Os05g44590 (S) 2. LOC_Os03g19350 LOC_Os07g48960 3. LOC_Os03g29540 LOC_Os07g32560

Page 11 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

Table 3: A list of most recent serine protease gene duplications in rice genome. The most recent gene duplicates identified in the segmentally duplicated regions of the rice genome are suffixed with (S). See text for details (Continued)

Family S24 (Signal peptidase family) 1. LOC_Os03g55640 LOC_Os09g28000 2. LOC_Os05g23260 LOC_Os06g16260

Family S28 (Lysosomal Pro-X Carboxypeptidase family) 1. LOC_Os01g56150 LOC_Os06g43930 2. LOC_Os10g36780 LOC_Os10g36780

Family S41 (C-terminal processing peptidase family) 1. LOC_Os01g47450 LOC_Os06g21380

Family S54 (Rhomboid family) 1. LOC_Os01g18100 LOC_Os03g44830 2. LOC_Os03g24390 LOC_Os07g46170 (S) 3. LOC_Os08g43320 (S) LOC_Os09g35730 (S)

Family S59 (Nucleoporin autopeptidase family) 1. LOC_Os12g06870 LOC_Os12g06890

(At2g22920) and SMT/SNG1 (At5g09640) have been other with the exception of a small number of rice shown to function as acyltransferases in phenylpropanoid sequences that form an exclusive subcluster. The members metabolism[41,43]. Studies have shown that Arabidopsis of the two clades mostly share between 20 and 30% pair- SMT is a heavily glycosylated protein that localises to cen- wise sequence identity. However, of the three subclusters tral vacoules of the leaf tissue, suggesting that some SCP- identified among Clade I proteins, the members of Clade like proteins identified in the two plant species may also IC appear to be closer to Clade II sequences than members be vacoule localised[44]. of Clade IA and Clade IB and share between 25 and 35% pairwise sequence identity with the members of Clade II. Phylogenetic analysis indicates that the majority of Arabi- dopsis and rice SCPL proteins cluster into two major Serine beta-lactamases (Family S12) clades, but there are some additional SCPL proteins that Serine beta-lactamases are a group of hydrolases that cat- are less closely related to members of the clades (See addi- alyze the hydrolysis of the beta-lactam ring of β-lactam tional file 8: Figure SF4.pdf). Clade I may further be sub- antibiotics such as penicillins. They were first identified in divided into three subgroups, Clade IA (includes only penicillin-resistant bacterial isolates and have been found Arabidopsis SCPL proteins), Clade IB (includes only rice to be widely distributed in eubacterial genomes. On the SCPL proteins) and Clade IC (includes a pair of Arabidop- basis of sequence similarity, they have been classified into sis and rice SCPL proteins). Clade IA includes known three classes A, C and D, where class B consists of metallo- SCPL acyltransferases such as SMT (At2g22990) and SCT beta-lactamases. Class A (penicillinase type) proteins were (At5g09640). Clade IA and IB consist exclusively of Arabi- the first to be identified and are the most common beta- dopsis and rice sequences respectively that have no iden- lactamases. All these proteins share sequence similarity tifiable orthologues in either subcluster. Therefore these within the class but not with members of other classes. two subclades may represent early diversification of cer- However, proteins of different classes share significant tain SCPL proteins followed by their lineage specific structural similarity and are believed to be homologues delineation and expansion subsequent to speciation (See that have undergone divergent evolution. Structure-based additional file 8: Figure SF4.pdf). phylogeny studies suggest that beta-lactamases are ancient enzymes that probably evolved as means of protection Clade II, the larger of two major clades, consists of against beta-lactam antibiotics[46]. Site-directed muta- sequences that share between 33 and 82% pairwise genesis and structural analyses have helped to unravel the sequence identity with each other. Earlier observations catalytic mechanism of β-lactamases and led to the iden- have suggested that Clade II consists of more varied tification of their active site residues (S93, K96, sequences in terms of pairwise sequence identity, conser- Y190)[47,48]. vation of key residues and exon/intron splice posi- tions[45]. In contrast to Clade I, Arabidopsis and rice Serine beta-lactamases are widespread in bacterial sequences falling into Clade II, are interspersed with each genomes, and their corresponding genes may occur on

Page 12 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

bacterial chromosomes or on plasmids. This allows for We have identified nine different ClpP genes encoding the their transfer to distant species and may account for their proteolytic subunit of the Clp protease in the Arabidopsis distribution and diversity[48]. Serine beta-lactamase nuclear proteome and 13 in rice nuclear proteome. Eight homologues have been identified in eukaryotes, mainly as of the nine Arabidopsis ClpP-like proteins are predicted to a consequence of ongoing genome sequencing programs. localise to chloroplast, while At5g23140 gene product is However, little function information is available about predicted to be mitochondria localised, though experi- these proteins in eukaryotes. mental evidence suggests that At5g23140 gene product appears to localise to both chloroplasts and mitochon- We have identified a single serine beta-lactamase like gene dria[52]. Likewise, in rice, 11 ClpP-like proteins are pre- product each in Arabidopsis and rice proteomes, both of dicted to be chloroplast localised, while which are predicted to localise to mitochondria. An inter- LOC_Os08g15270 gene product is predicted to localise to esting feature is the presence of a single ABC1 domain mitochondria, while subcellular localization for (Pfam[37] accession: PF03109) N-terminal to the serine LOC_Os12g1059 could not be determined (See addi- β-lactamase domain (Figure 2; see additional files 1, 2: tional files 1, 2: Table S1.pdf, Table S2.pdf). We observed Table S1.pdf, Table S2.pdf). ABC1 domains are associated that some ClpP gene products identified in the two species with a family of proteins that include members from yeast have lost the catalytic protease site and therefore do not and bacteria that display a nuclear or mitochondrial sub- appear to take part in proteolysis. However, they are cellular location in eukaryotes. Some members of the believed to play a significant role in proper assembly of family are believed to regulate mRNA translation and are functional ClpP protein complex and potentially influ- essential for electron transfer in the bc1 complex[37] (See ence chaperone interaction (Figure 4). Interestingly, we additional file 9: Figure SF5.pdf). identified orthologues for all nine Arabidopsis ClpP gene products in rice, suggesting high degree of evolutionary Clp ATP-dependent proteases (Family S14) conservation of Clp function in the two species (See addi- Clp proteases are a group of ATP-dependent serine tional file 4: Table S4.pdf). Recently, a 325-kDa complex endopeptidases that belong to MEROPS[5] family S14. consisting of 10 Clp isoforms including the chloroplast E.coli ClpP is an ATP-dependent serine protease that con- encoded pClpP and nuclear encoded Clp proteins was sists of a smaller protease subunit ClpP, and a larger chap- identified in Arabidopsis chloroplasts and a 320-kDa erone regulatory ATPase subunit (either ClpA or ClpX). homotetradecameric complex consisting of only ClpP2 Though the protease domain is capable of proteolysis on gene products was identified in mitochondria. These pro- its own, the ATPase subunits are essential for effective lev- tein complexes are believed to be involved in degradation els of proteolysis. Structural analysis (See additional file 3: of misfolded or unassembled peptides in an ATP-depend- Table S3.pdf) reveals that the catalytic triad residues Ser- ent manner[52]. High evolutionary conservation of ClpP- His-Asp (S111, H136, D185) are enclosed in a single cav- like proteins across the two species suggests that some rice ity that allows for degradation of small peptides but pre- ClpP homologues may participate in the assembly of a cludes the entry of large folded polypeptides (See similar regulatory complex in rice chloroplasts and mito- additional file 3: Table S3.pdf)[6,7,49]. chondria.

Clp proteases have been identified in bacteria and eukary- Phylogenetic analysis of Arabidopsis and rice Clp-like pro- otes but appear to be absent from archaea. They have been teins reveals the presence of several gene clusters (Clades implicated in protein quality control and degradation of I-VIII) that share comparatively low pairwise sequence misfolded nascent peptides[50]. Plant Clp proteases were identity with each other, as opposed to the members first identified with the discovery of the plastid-encoded within a cluster that share significant sequence similarity gene clpP1, orthologous to E.coli clpP and in most higher with each other (Figure 5). We identified eight Clp-like plants it is transcribed by genes rps12 and rps120[6,7]. proteins from both Arabidopsis and rice that are likely to Though the exact function for Clp proteases remains to be have lost their catalytic protease activity due to mutations worked out, they appear to be essential for chloroplast of key residues in protease active site (Figure 4). Six of function and viability. Clp proteases are likely to play a these gene products (At1g09130, At1g49970, At4g17040 role in protein degradation in stroma and thylakoid mem- from Arabidopsis and LOC_Os01g16530, brane under particular stress conditions thereby ensuring LOC_Os03g22430, LOC_Os05g51450 from rice) fall into removal of aberrant polypeptides and recycling of the a single gene cluster (Clade I) and share between 36 and nutrient pool[7,13]. They have also been implicated in 72% pairwise sequence identity with each other (Figure shoot development[51] suggesting a broad regulatory role 5). The above sequences have a 9–10 amino acid insert for Clp proteases in plants. (L1 inserts) that are believed to affect the presentation of the substrate to the neighboring catalytic triads in ClpP proteins and also known to contribute residues to puta-

Page 13 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

At1g09130 ------PRTPPPDLPSMLLDGRIVYIGMPLVPAVTELVVAELMYLQWLDPKEPIYIYINSTGTTRDDGETVGMESE Os03g22430 ------RRPPPDLPSLLLHGRIVYIGMPLVPAVTELVVAQLMYLEWMNSKEPVYIYINSTGTARDDGEPVGMESE At1g49970 -----GGSERPRTAPPDLPSLLLDARICYLGMPIVPAVTELLVAQFMWLDYDNPTKPIYLYINSPGTQNEKMETVGSETE Os05g51450 ------SSAPDLPSLLLDSRIIFLGMPIVPAVTELIAAQFLWLDYDDRTKPIYLYINSTGTMDENNELVASETD At4g17040 -----SKGSAHEQPPPDLASYLFKNRIVYLGMSLVPSVTELILAEFLYLQYEDEEKPIYLYINSTGTTK-NGEKLGYDTE Os01g16530 -----LRGTAWQQPPPDLASFLYKNRIVYLGMCLVPSVTELMLAEFLYLQYDDAEKPIYLYINSTGTTK-NGEKLGYETE At1g02560 ------MVQERFQSIISQLFQYRIIRCGGAVDDDMANIIVAQLLYLDAVDPTKDIVMYVNSP------GGSVTA Os03g19510 ------MVMERFQSVVSQLFQHRIIRCGGPVEDDMANIIVAQLLYLDAIDPNKDIIMYVNSP------GGSVTA At1g66670 ------SFEELDTTNMLLRQRIVFLGSQVDDMTADLVISQLLLLDAEDSERDITLFINSP------GGSITA Os01g32350 ------RLEELDTTNMLLRQRIVFLGSPVDDMSADLIISQLLLLDAEDKTKDIKLFINSP------GGSITA At5g45390 ------RGAESDVMGLLLRERIVFLGSSIDDFVADAIMSQLLLLDAKDPKKDIKLFINSP------GGSLSA Os10g43050 ------PGAGAGGDVMGLLLRERIVFLGNEIEDFLADAVVSQLLLLDAVDPNSDIRLFVNSP------GGSLSA Os02g42290 ------SRGERAYDIFSRLLKERIVLIHGPIADETASLVVAQLLFLESENPLKPVHLYINSP------GGVVTA Os04g44400 ------SRGERAYDIFSRLLKERIVCIHGPITDDTASLVVAQLLFLESENPAKPVHLYINSP------GGVVTA At5g23140 ------SRGERAYDIFSRLLKERIICINGPINDDTSHVVVAQLLYLESENPSKPIHMYLNSP------GGHVTA At1g11750 ------PGGPLDLSSVLFRNRIIFIGQPINAQVAQRVISQLVTLASIDDKSDILMYLNCP------GGSTYS Os03g29810 ------PAGALDLATVLLGNRIIFIGQYINSQVAQRVISQLVTLAAVDEEADILIYLNCP------GGSLYS At1g12410 ------NREEGTWQWVDIWNALYRERVIFIGQNIDEEFSNQILATMLYLDTLDDSRRIYMYLNGP------GGDLTP Os06g04530 ------PGEGTWQWLDIWNALYRERIIFIGDSIDEEFSNQVLASMLYLDSVDNTKKILLYINGP------GGDLTP Os08g15270 ------Os12g10590 ------PGDEEVTWVDLYNVMYRERTLFLGQEIRCEVTNHITGLMVYLSIEDGISDIFLFINSP------GGWLIS Os11g11210 DLPLLPFQPAEVLIPSECKTLHLYEARYLALLEEALYRTNNSFVHLVLDPVVSGSPKASFAVERLDIGALVSIRGVCRVN

At1g09130 GFAIYDSLMQLKNEVHTVCVGAAIGQACLLLSAGTKGKRFMMPHAK------AMIQQPRVPSSGLMPASDVLIRAKEVIT Os03g22430 GFAIYDAMMRMKTEIHTLCIGAAAGHACLVLAAGKKGKRYMFPHAK------AMIQQPRIPSYGMMQASDVVIRAKEVVH At1g49970 AYAIADTISYCKSDVYTINCGMAFGQAAMLLSLGKKGYRAVQPHSS------TKLYLPKVNRS-SGAAIDMWIKAKELDA Os05g51450 AFAIADFINRSKSKVYTINLSMAYGQAAMLLSLGVKGKRGVLPNSI------TKLYLPKVHKS-GGAAIDMWIKAKELDT At4g17040 AFAIYDVMGYVKPPIFTLCVGNAWGEAALLLTAGAKGNRSALPSST------IMIKQPIARFQGQATDVEIARKEIKH Os01g16530 AFAIYDAMRYVKVPIFTLCVGNAWGEAALLLAAGAKVSVVFASKKYEDNDLILLTFNVQPIGRFQGQATDVDIARKEIRN At1g02560 GMAIFDTMRHIRPDVSTVCVGLAASMGAFLLSAGTKGKRYSLPNSR------IMIHQPLGGAQ-GGQTDIDIQANEMLH Os03g19510 GMAIFDTMKHIRPDVSTVCIGLAASMGAFLLSAGTKGKRYSLPNSR------IMIHQPLGGAQ-GQETDLEIQANEMLH At1g66670 GMGIYDAMKQCKADVSTVCLGLAASMGAFLLASGSKGKRYCMPNSK------VMIHQPLGTAG-GKATEMSIRIREMMY Os01g32350 GMGVYDAMKFCKADISTVCFGLAASMGAFLLAAGTKGKRFCMPNAR------IMIHQPSGGAG-GKVTEMGLQIREMMY At5g45390 TMAIYDVVQLVRADVSTIALGIAASTASIILGAGTKGKRFAMPNTR------IMIHQPLGGAS-GQAIDVEIQAKEVMH Os10g43050 TMAIYDVMQLVRADVSTIGLGIAGSTASIILGGGTKGKRFAMPNTR------IMIHQPVGGAS-GQALDVEVQAKEILT Os02g42290 GLAIYDTMQYIRCPVTTLCIGQAASMGSLLLAAGARGERRALPNAR------VMIHQPSGGAQ-GQATDIAIQAKEILK Os04g44400 GLAIYDTMQYIRSPVTTLCIGQAASMASLLLAAGARGERRALPNAR------VMIHQPSGGAS-GQASDIAIHAKEILK At5g23140 GLAIYDTMQYIRSPISTICLGQAASMASLLLAAGAKGQRRSLPNAT------VMIHQPSGGYS-GQAKDITIHTKQIVR At1g11750 VLAIYDCMSWIKPKVGTVAFGVAASQGALLLAGGEKGMRYAMPNTR------VMIHQPQTGCG-GHVEDVRRQVNEAIE Os03g29810 ILAIYDCMSWIKPKVGTVCFGVVASQAAIILAGGEKGMRYAMPNAR------VMIHQPQGVSE-GNVEEVRRQVGETIY At1g12410 SLAIYDTMKSLKSPVGTHCVGLAYNLAGFLLAAGEKGHRFAMPLSR------IALQSPAGAAR-GQADDIQNEAKELSR Os06g04530 CMALYDTMLSLKSPIGTHCLGFAFNLAGFILAAGEKGSRTGMPLCR------ISLQSPAGAAR-GQADDIENEANELIR Os08g15270 ------MAS---FILLGGEPTKRIAFPHAR------IMLHQPASAYYRARTPEFLLEVEELHK Os12g10590 GMAIFDTMQTVTPDIYTICLGIAASMVSFILLGGEPTKRIAFPHAR------IMLHQPASAYYRARTPEFLLEVEELHK Os11g11210 IINLLQMEPYLRGDVSPIMDISSESIELGLRISKLRESMCNLHSLQMK-----LKVPEDEPLQTNIKASLLWSEKEIFEE

. : .: . At1g09130 NRDILVELLSKHTGNSVETVANVMRRPYYMDAPKAKEFGVIDRILWRG------Os03g22430 NRNTLVRLLARHTGNPPEKIDKVMRGPFYMDSLKAKEFGVIDKILWRG------At1g49970 NTEYYIELLAKGTGKSKEQINEDIKRPKYLQAQAAIDYGIADKIADSQ------Os05g51450 NTDYYLELLSKGVGKPKEELAEFLKGPRYFRAQEAIDYGLADTILHSL------At4g17040 IKTEMVKLYSKHIGKSPEQIEADMKRPKYFSPTEAVEYGIIDKVVYNE------Os01g16530 VKIEMIKLLSRHIGKSVEEIAQDIKRPKYFSPSEAVDYGIIDKVLYNE------At1g02560 HKANLNGYLAYHTGQSLEKINQDTDRDFFMSAKEAKEYGLIDGVIMNP------Os03g19510 HKANLNGYLAYHTGQPLDKINVDTDRDYFMSAKEAKEYGLIDGVIMNP------At1g66670 HKIKLNKIFSRITGKPESEIESDTDRDNFLNPWEAKEYGLIDAVIDDG------Os01g32350 EKIKINKILSRITGKPEEQIDEDTKFDYFMSPWEAKDYGIVDSVIDEG------At5g45390 NKNNVTSIIAGCTSRSFEQVLKDIDRDRYMSPIEAVEYGLIDGVIDGD------Os10g43050 NKRNVIRLISGFTGRTPEQVEKDIDRDRYMGPLEAVDYGLIDSVIDGD------Os02g42290 LRDRLNKIYQKHTGQEIDKIEQCMERDLFMDPEEARDWGLIDEVIENR------Os04g44400 VRDRLNKIYAKHTSQAIDRIEQCMERDMFMDPEEAHDWGLIDEVIEHR------At5g23140 VWDALNELYVKHTGQPLDVVANNMDRDHFMTPEEAKAFGIIDEVIDERS------At1g11750 ARQKIDRMYAAFTGQPLEKVQQYTERDRFLSASEALEFGLIDGLLETE------Os03g29810 ARDKVDKMFAAFTGQTLDMVQQWTERDRFMSSSEAMDFGLVDALLETR------At1g12410 IRDYLFNELAKNTGQPAERVFKDLSRVKRFNAEEAIEYGLIDKIVRPP------Os06g04530 IKNYLYSKLSEHTGHPVDKIHEDLSRVKRFDAEGALEYGIIDRIIRPS------Os08g15270 VREMITRVYALRTGKPFWVVSEDMERDVFMSADEAKAYGLVDIVGDEM------Os12g10590 VHEMITRVYALRTGKPFWVVSEDMERDVFMSADKAKAYGLVDIVGDEM------Os11g11210 YNEGFIPALPERLSFAAYQTVSGMSEAELLSLQKYKIQAMDSTNTLERLNSGIEYVEH MultipleteinsFigure 4sequence alignment of the Clp protease domain region of the annotated Arabidopsis and rice Clp protease-like pro- Multiple sequence alignment of the Clp protease domain region of the annotated Arabidopsis and rice Clp protease-like pro- teins. The catalytic triad residues are indicated. Gene names correspond to those in Additional files 1 and 2. For brevity, rice gene names have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit number assigned to each gene. Several gene products that display mutation in one or more catalytic triad residues can be visualised here (see text for details).

Page 14 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

Clade II Os01g16530 At4g17040 Os06g04530 Os02g42290 At1g12410 Clade I At1g49970 Clade III

Os04g44400 Os05g51450

*

At5g23140

At1g09130 * At1g02560 Clade IV

Os03g19510 *

Os03g29810 Os03g22430 Os11g11210

Clade V At1g11750

At1g66670 Os12g10590 Os01g32350 At5g45390Os10g43050 Os08g15270 Clade VI Clade VIII Clade VII FigureUnrooted 5 N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) Clp protease domains Unrooted N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) Clp protease domains. Clp protease-like domains were aligned using ClustalW [95] program and the alignments were exported to Phylip package [96] for representing the Neighbor-Joining tree (see methods). The colors and circles represent different evolutionary clades identified in the analysis (see text for details). Clade I is represented in orange, while clades II-VIII are shaded in black. For clarity, boot- strap values were replaced with symbols representing bootstrap percentages >50%. Bootstrap values between 50–60% are represented by an asterix, circles represent bootstrap values from 60%–80% while bootstrap values >80% are represented by rectangles. Gene names correspond to those in Additional files 1 and 2. For brevity, rice gene names have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit number assigned to each gene.

tive lateral openings likely to stimulate or inhibit exit of Lon proteases (Family S16) peptide fragments from the core[52] (Figure 4). The other Lon proteases are also a group of ATP-dependent serine ClpP-like gene products identified in the two species proteases. Unlike the Clp proteases, the catalytic and mostly cluster into independent gene clusters, Clades II- ATPase domains of the Lon proteases reside in the same VIII (Figure 5). The sequences constituting the gene clus- polypeptide. They are conserved in prokaryotes and ters share high pairwise sequence identity with each other, eukaryotic organelles such as chloroplast and mitochon- but are between 25 and 45% identical to the rest of ClpP- dria. E.coli Lon protease was the first ATP-dependent pro- like proteases identified across the two species. tease to be described and consists of three functional

Page 15 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

domains: the N-terminal domain (LON), a central ATPase At5g47040 is predicted to localise to mitochondria and domain (AAA+ module) and a C-terminal proteolytic LOC_Os09g36300 is predicted to localise to chloroplasts. domain (Lon_C). The N-terminal LON domain along The members of the two clusters display between 40 and with the AAA module is believed to impart substrate spe- 45% pairwise sequence identity with each other. A Lon cificity to Lon proteases[6,53]. Structure and sequence protease like gene product from rice, LOC_Os06g05820, analysis along with site-directed mutagenesis indicate that is less closely related and shares between 19 and 25% pair- Lon proteases employ a Ser-Lys (S679, K/R722) catalytic wise sequence identity with the rest of the Lon protease dyad instead of a canonical serine protease catalytic like gene products from the two species. triad[53]. Based on the conservation of residues around the catalytic residues, the Lon proteases have recently Signal peptidases (Family S26) been proposed to be divided into two subfamilies: LonA Signal peptidases (SPases or leader peptidases) are a and LonB[53]. diverse group of serine endopeptidases, that belong to MEROPS[5] family S26 and are responsible for the Unlike Clp proteases, Lon proteases appear to be present removal of signal peptides from preproteins within the in organisms from all forms of life. They are believed to cell[57]. Signal peptides comprise the N-terminal part of play a significant role in removal of aberrant proteins and the polypeptide chain. They are essential for targeting var- thereby maintain cellular stability. Interestingly, an ious proteins to their correct destination and are cleaved archaeal membrane bound Lon protease was found to off subsequent to the transfer of the protein across the have both ATP-dependent and ATP-independent proteo- membrane. The failure to remove signal peptides often lytic activities towards folded and unfolded proteins leads to protein inactivation and/or mislocalisation[57]. respectively[54]. Little is known of their function in plants, but Lon1 from Arabidopsis has been implicated in The Type I SPases are membrane-bound serine endopepti- the control of cytoplasmic male sterility[55]. dases that were first identified in bacteria (E.coli lepB) and homologues have subsequently been identified in We have identified four protein sequences containing the archaea, mitochondrial inner membrane, the thylakoid Lon proteolytic domain in Arabidopsis and an equal membrane of chloroplasts and the endoplasmic reticu- number in rice. In addition, we also identified five protein lum membranes of yeast and higher eukaryotes[57,58]. sequences containing only the Lon N-terminal domain in The Type I SPase family is further divided into two sub- Arabidopsis and four in rice (data not shown). Of the Ara- families: members of S26A subfamily employ a serine- bidopsis Lon protease-like proteins, At3g05790 and lysine dyad for catalysis (S91, K146) and include prokary- At5g26860 gene products are predicted to localise to chlo- otic, chloroplast and mitochondrial peptidases, while the roplast, while At3g05780 and At5g47040 gene products members of S26B subfamily that include ER SPC and are predicted to be mitochondria localised. All four Lon archaeal peptidases, employ a serine-histidine dyad (S44, protease-like proteins identified in rice are predicted to H83) based on the mechanism for catalysis[57,58]. localise to chloroplasts. The Lon-protease like proteins identified in the two plant species share higher similarities In bacteria, Type I SPases appear to be essential for cell via- with the LonA subfamily (See additional file 10: Figure bility and their inactivation invariably leads to cell death. SF6.pdf). The presence of LON-only sequences has been Therefore they have been attractive targets for antibacte- known in several prokaryotic species and its presence in rial drugs[57,58]. In higher plants, homologues of Type I Arabidopsis and rice proteomes suggests that LON domain SPases, known as thylakoidal processing peptidases have may have been recruited as a general module for regulat- been identified and are involved in processing of protein ing protein-protein interactions, particularly for targeted presequences required for integration into or transloca- protein degradation[56]. tion across thylakoid membranes. TPP, an Arabidopsis homologue of E.coli LepB is a thylakoid membrane pro- Sequence analysis reveals that Arabidopsis and rice Lon tein with its active site situated at the lumenal side[59,60]. protease like sequences fall into two clusters on the basis An Arabidopsis plastidic SPase I (Plsp1, At24590) has been of pairwise sequence identity (data not shown). The larg- found to be required for complete maturation of the plas- est of these clusters consists of five sequences (At3g05790, tid protein translocation channel Toc75. It is proposed At5g26860, At3g05780, LOC_Os03g19350, that Plsp1 may have multiple functions and is believed to LOC_Os07g48960), all of which, except At3g05780 are be essential for biogenesis of plastid internal mem- predicted to localise to chloroplasts and share between 76 branes[61]. and 91% pairwise sequence identity with each other. The second cluster consists of two sequences (At5g47040 and We have identified nine genes coding for signal peptidase- LOC_Os09g36300) that are 82% identical to each other like proteins in Arabidopsis and seven in rice. A majority of but predicted to have different subcellular localizations. these gene products are predicted to localise to mitochon-

Page 16 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

. :. : :. : : At1g06870 SIPSTSMLPTLDVGDR------VIAEKVSYFFRKPEVSDIVIFKAPPILVEH---GYSCADVFIKRIVASEGDWVEVCD At2g30440 SIPSTSMYPTLDKGDR------VMAEKVSYFFRKPEVSDIVIFKAPPILLEYPEYGYSSNDVFIKRIVASEGDWVEVRD At3g24590 YIPSLSMYPTFDVGDR------LVAEKVSYYFRKPCANDIVIFKSPPVLQEVG---YTDADVFIKRIVAKEGDLVEVHN Os03g55640 SIPSKSMYPTFDVGDR------ILADKVSYVFREPNILDIVIFRAPPVLQALG---CSSGDVFIKRIVAKGGDTVEVRD Os09g28000 SIPSKSMYPTFDVGDR------ILAEKVSYIFREPEILDIVIFRAPPALQDWG---YSSGDVFIKRVVAKAGDYVEVRD At3g08980 PVRGDSMSPTFNPQRNSYLDDYVLVDKFCLKD-YKFARGDVVVFSSP------THFGDRYIKRIVGMPGEWISS-- Os04g08340 TVRGTSMNPTLESQQG----DRALVSRLCLDARYGLSRGDVVVFRSP------TEHRSLLVKRLIALPGDWIQVPA At1g23465 YAYGPSMIPTLHPSGN------MLLAERISKRYQKPSRGDIVVIRSP------ENPNKTPIKRVVGVEGDCISFVI At1g29960 YAYGPSMTPTLHPSGN------VLLAERISKRYQKPSRGDIVVIRSP------ENPNKTPIKRVIGIEGDCISFVI At1g53530 HVHGPSMLPTLNLTGD------VILAEHLSHRFGKIGLGDVVLVRSP------RDPKRMVTKRILGLEGDRLTFSA Os11g40500 LVMGPSMLPAMNLAGD------VVAVDLVSARLGRVASGDAVLLVSP------ENPRKAVVKRVVGMEGDAVTFLV Os05g23260 VVLSGSMEPGFKRGDI------LFLHMSKDPIRTGEIVVFNID------GREIPIVHRVIKVHEREESAEV Os06g16260 VVLSGSMEPGFKRGDI------LFLHMSKDPIRTGEIVVFNVD------GREIPIVHRVIKVHERQESAEV At1g52600 VVLSGSMEPGFKRGDI------LFLHMSKDPIRAGEIVVFNVD------GRDIPIVHRVIKVHERENTGEV Os02g58140 VVLSESMELGFERGDI------LFLQMSKHPIRTGDIVVFND------GREIPIVHRVIEVHERRDNAQV At3g15710 VVLSESMEPGFQRGDI------LFLRMTDEPIRAGEIVVFSVD------GREIPIVHRAIKVHERGDTKAV

At1g06870 GKLLVNDT------At2g30440 GKLFVNDI------At3g24590 GKLMVNGV------Os03g55640 GKLLVNGV------Os09g28000 GKLIVNGV------At3g08980 ------SRDVIRVPEG----- Os04g08340 ------AQEIRQIPVG----- At1g23465 DPVKSDESQTIVVPKG----- At1g29960 DSRKSDESQTIVVPKG----- At1g53530 DPLVGDAS------Os11g40500 DPGNSDASKTVVVPKG----- Os05g23260 DILTK------Os06g16260 DILTK------At1g52600 DVLTKGDNNYGDDRLLYAEG- Os02g58140 DFLTKGDNNPMDDRILYTHGQ At3g15710 DVLTKGDNNDIDDI------

MultipleFigure 6sequence alignment of the Type I Spase domain region of the annotated Arabidopsis and rice Type I Spase-like proteins Multiple sequence alignment of the Type I Spase domain region of the annotated Arabidopsis and rice Type I Spase-like proteins. The catalytic dyad residues are indicated. Gene names correspond to those in Additional files 1 and 2. For brevity, rice gene names have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit number assigned to each gene. The variations in the second residue (K/H) of catalytic dyad can be identified here (see text for details). dria, though some are predicted to be chloroplast local- signal-peptidase function in the two species (See addi- ised, membrane bound and even secreted (See additional tional file 4: Table S4.pdf). files 1, 2: Table S1.pdf, Table S2.pdf). Arabidopsis TPP (At2g30440), though predicted by TargetP[19] to be Phylogenetic analysis indicates that Arabidopsis and rice mitochondria localised, has been experimentally deter- Type I SPase-like proteins cluster into two major clades mined to be associated with chloroplast fraction. The supported by high bootstrap values and high pairwise enzyme when overexpressed in vitro, was found to pos- sequence identity (See additional file 11: Figure SF7.pdf). sess catalytic activity against the lumenal targeting prese- Clade I consists of 5 sequences (At1g06870, At2g30440, quence of the 23-kDa component of photosystem II[59], At3g24590, LOC_Os09g28000, LOC_Os03g55640) that suggesting a possible role for TPP and similar proteins in are between 67 and 86% identical to each other. Clade II processing of polypeptides during their transport through consists of 5 sequences (At1g52600, At3g15710, cellular membranes. Most of the gene products analysed LOC_Os05g23260, LOC_Os06g16260, here possess Ser and Lys as active site residues and are LOC_Os02g58140), that share between 67 and 96% pair- likely to employ a serine-lysine catalytic dyad for proteol- wise sequence identity with each other. The members of ysis. Some gene products (At1g52600, At3g15710, either cluster share low sequence identity with the rest of LOC_Os05g23260, LOC_Os06g16260, the gene products. Interestingly, all the members of clade LOC_Os02g58140) appear to possess Ser and His as II carry histidine as an active site residue instead of Lysine active site residues and consequently may employ a ser- and consequently are believed to employ a Ser-His dyad ine-histidine catalytic dyad for proteolysis (Figure 6). We for proteolysis and possibly correspond to S26B sub- identified five orthologous pairs of signal peptidase-like family of Type I Spases (Figure 6; see above; see additional proteins across the two species, suggesting conservation of file 11: Figure SF7.pdf). The remaining sequences carry

Page 17 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

lysine as an active site residue and are consequently with each other. These observations suggest the presence believed to employ a Ser-Lys dyad for proteolysis likely to of two evolutionary clades among Arabidopsis and rice correspond to S26A subfamily of Type I Spases (Figure 6; S28-like proteins prior to speciation. see above; see additional file 11: Figure SF7.pdf). Clade I sequences share between 8 and 10% pairwise sequence C-terminal processing peptidases (Family S41) identity with members of Clade II but share between 21 C-terminal processing peptidases (Ctps) belong to and 43% pairwise sequence identity with the remaining MEROPS[5] family S41. E.coli Tsp, a periplasmic endopro- Type I SPase-like gene products. Clade II members appear tease, was one of the first members of the family to be to be less closely related and share between 8 and 15% identified (See additional file 3: Table S3.pdf). It was sub- pairwise sequence identity with the rest of Type I SPase- sequently determined that Most family members are asso- like gene products. Taken together these observations sug- ciated with PDZ domains that are protein-protein gest that functional clusters of Type I SPases employing interaction modules and are important for substrate rec- slightly varied catalytic mechanism were established early ognition[24]. The C-terminal processing peptidase family during evolution and were subsequently retained, possi- consists of two subfamilies, which differ greatly in activity, bly for adaptation to different physiological processes. sensitivity to inhibitors and molecular structure[5]. The two subfamilies employ different mechanisms for cataly- Lysosomal Pro-X carboxypeptidase family (Family S28) sis. While, the C-terminal processing peptidase subfamily This family includes several eukaryotic enzymes such as (S41A) is found chiefly in bacteria and some eukaryotes lysosomal Pro-X carboxypeptidase (PCP), dipeptidyl- employing a Ser-Lys catalytic dyad (S452, K477), the tri- peptidase II and thymus specific serine peptidase and is con core protease subfamily (S41B) is largely confined to grouped in MEROPS[5] clan SC along with MEROPS[5] archaea with few bacterial representatives employing a peptidase families such as S9, S10, S15 and S33. The pre- catalytic tetrad consisting of Ser, His, Ser and Glu residues dicted active site residues for the members of this family (S745, H746, S965, E1023). (S179, D430, H455) occur in the same order in the sequence as that of family S10. In bacteria, CPPs (such as E. coli Tsp) are believed to play an important role in degradation of improperly synthe- Members of this family have been identified exclusively in sised proteins. For instance, proteins translated from dam- eukaryotes and have been implicated in several processes aged mRNA are modified by addition of a C-terminal tag, (See additional file 3: Table S3.pdf)[4,62,63]. No func- which enables their recognition and degradation by tional information is available regarding their role in CPPs[64]. In plants, Ctps have been implicated in process- plants. ing D1 protein of photosystem II, which is essential for correct assembly and functioning of the plant photosyn- Analysis of the Arabidopsis proteome revealed the presence thetic apparatus[60,65,66]. of seven genes encoding for the members of S28 family, while five genes were identified in rice. Of these, a major- We identified three genes each coding for Ctp-like pro- ity are predicted to be the members of the secretory path- teins in Arabidopsis and rice proteomes. While the Arabi- way, while two rice gene products LOC_Os01g56150 and dopsis proteins are all predicted to be soluble and have LOC_Os10g36780 are predicted to be mitochondria been shown to be translocated into thylakoid lumen, two localised (See additional files 1, 2, 12: Table S1.pdf, Table of the three rice gene products (LOC_Os01g47450 and S2.pdf, Figure SF8.pdf). We further identified three orthol- LOC_Os02g57060) are predicted to be mitochondria ogous pairs across the two species. S28 protease-like pro- localised, while a single gene product LOC_Os06g21380 teins identified here, like their eukaryotic counterparts are is predicted to be translocated to chloroplasts (See addi- believed to be involved in processing of penultimate Pro- tional files 1, 2: Table S1.pdf, Table S2.pdf). All Ctp-like line containing polypeptides in the two plant species (See proteins identified in the two plant species carry a PDZ additional file 4: Table S4.pdf). domain N-terminal to the protease domain (Figure 2; see additional files 1, 2: Table S1.pdf, Table S2.pdf). PDZ Phyolgenetic analysis reveals that Arabidopsis and rice pro- domains are protein-protein interaction modules that are teins belonging to S28 family cluster into two clades (Fig- likely to mediate substrate recognition for plant Ctp-like ure 7). Clade I appears to be more diverse of the two proteins[24]. Sequence analysis suggests that the Ctp-like clades and consists of six sequences that share between 40 proteins identified here may employ a serine-lysine cata- and 69% pairwise sequence identities among themselves. lytic dyad for proteolysis and possibly belong to S41A Clade II consists of five sequences that are more closely subfamily of C-terminal processing peptidases (See addi- related to each other than Clade I sequences and between tional file 13: Figure SF9.pdf). We identified three orthol- 61 and 92% identical. The members of the two clusters ogous pairs (At3g57680:LOC_Os06g21380, share between 19 and 26% pairwise sequence identity At4g17740:LOC_Os02g57060,

Page 18 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

Os01g56150

At2g24280 At5g65760

Os06g43930

Os11g05760

Clade I

At5g22860

At2g18080

Os10g36760 *

At4g36195 Os10g36780 Clade II At4g36190 domainsFigureUnrooted 7 N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) family S28 protease Unrooted N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) family S28 protease domains. S28-like protease domains were aligned using ClustalW [95] program and the alignments were exported to Phylip package [96] for representing the Neighbor-Joining tree (see methods). The colors represent the two evolutionary clades identified in the analysis (see text for details). Clade I is represented in Orange, Clade II is shaded green. For clarity, bootstrap values were replaced with symbols representing bootstrap percentages >50%. Bootstrap values between 50–60% are repre- sented by an asterix, circles represent bootstrap values from 60%–80% while bootstrap values >80% are represented by rec- tangles. Gene names correspond to those in Additional files 1 and 2. For brevity, rice gene names have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit number assigned to each gene.

Page 19 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

At5g46390:LOC_Os01g47450) of six Ctp-like proteins LOC_Os02g49570 each carry two protease IV domains across the two plant species (See additional file 4: Table (Figure 2; see additional files 1, 2, 14: Table S1.pdf, Table S4.pdf). Each gene product displays a higher sequence S2.pdf, Figure SF10.pdf). Interestingly the N-terminal pro- identity with its orthologue in the corresponding genome tease IV domain of At1g73390 is 63% identical to the C- than similar gene products from the same species. This terminal protease IV domain of the rice counterpart suggests that these groups were most likely established LOC_Os02g49570, while sharing only 17% identity with early in evolution and disseminated in the two plant spe- At1g73390 C-terminal protease IV domain and 14% iden- cies through speciation. Like prokaryotic Ctps, plant Ctp- tity with LOC_Os02g49570 N-terminal protease IV like proteins are believed to play a role in degradation of domain. Likewise LOC_Os02g49570 N-terminal protease incorrectly synthesised proteins in chloroplasts and mito- IV domain is 75% identical to At1g73390 C-terminal pro- chondria. tease IV domain and only 18% identical to LOC_Os02g49570 C-terminal protease IV domain. This SppA/protease IV family (Family S49) clearly suggests that independent gene shuffling events These groups of serine peptidases belong to MEROPS[5] subsequent to acquisition of SppA homologues in the two family S49 and were first identified on the basis of their plant species may be responsible for the current domain ability to degrade signal peptides that accumulate in the architecture. Similar to their prokaryotic counterparts, cytosol subsequent to their removal from precursor they may function in association with signal-peptidases to polypeptides by signal peptidases. Signal peptides are degrade leader peptides removed from precursor proteins responsible for targeting various proteins to their correct similar to their prokaryotic counterparts. subcellular localizations and are subsequently removed from the preproteins. The signal peptides generated fol- Rhomboids (Family S54) lowing this cleavage need to be rapidly degraded since Rhomboid proteins are a family of serine proteases that they may be harmful to cells by interfering with protein cleave substrates within transmembrane domains[69,70]. translocation or may accumulate in the membrane lead- Rhomboids are one of the most widespread membrane ing to cell lysis. E.coli SppA (protease IV), an inner mem- protein families and are found in archaea, bacteria and brane protein was the first protein identified with signal eukaryotes including plants[70,71] suggesting that while peptide peptidase activity (See additional file 3: Table their signaling mechanism particularly protease activity S3.pdf)[57,67]. The predicted active site serine residue for may be conserved across lineages, they may have addi- members of Protease IV family and Clp protease family tional roles (See additional file 3: Table S3.pdf). A pro- occur in the same order in the sequence: S, R/H, D and posed catalytic triad comprising Asn169, Ser217 and forms the basis for their assignment to SK clan[5]. His281 has been identified using site directed mutagene- sis and each of proposed active site residues is located Members of the protease IV family have been identified in within a transmembrane domain[72]. However, a recent viruses, archaea and bacteria and among the higher report suggests that while Ser217 and His281 along with eukaryotes; they have been identified only in plants. Two a glycine residue (two residues N-terminal to Ser217) are proteins, SppA1 and SppA2 have been identified in Syne- essential for catalysis, Asn169 is not required for catalytic chosystis while a single homologue SppA has been identi- activity and that rhomboids likely function as endopepti- fied in Arabidopsis that lacks the putative transmembrane dases with a serine-histidine dyad[73]. spanning segments predicted from the E.coli sequence[68]. Rhomboid like proteins have been identified in plants, however, there was no experimental information availa- Protease IV family appears to be represented by single ble about them till recently. A rhomboid homologue from members in Arabidopsis and rice proteomes and are not Arabidopsis AtRBL2 was recently isolated and shown to represented in any other eukaryotes. Though predicted by have proteolytic activity and substrate specificity[74]. TargetP[19] to be mitochondria localised, experimental AtRBL2 was shown to cleave Drosophila ligands Spitz and evidence indicates that Arabidopsis SppA (At1g73390) is a Keren but not similar proteins like TGFα, when expressed light-inducible membrane-bound protein with an unu- in mammalian cells, thereby releasing soluble ligands sual monotropic arrangement, associated with thylakoid from the cell into the media. AtRBL2 along with AtRBL1, membranes and has been proposed to be involved in light another Arabidopsis rhomboid homologue was found to dependent degradation of photosystem II and LHCII localise to Golgi apparatus in plant cells consistent with antenna proteins[68]. A single SppA homologue in rice previous observations where several eukaryotic rhomboid (LOC_Os2g49570) is also predicted to be mitochondria proteins have been predicted to localise within Golgi localised and in absence of experimental evidence, it apparatus[72,74]. Interestingly, plants appear to lack any would be tempting to speculate that the rice protein may component of the EGF signaling pathway other than also localise to chloroplasts. At1g73390 and rhomboid and it has been suggested that rhomboid-like

Page 20 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

proteins in plants may be involved in processing as yet an extra TMH C-terminal to the 6-TMH core typical of uncharacterised substrates that may be conserved in other most rhomboid-like proteins[71]. Since a majority of organisms as well[74]. sequences that cluster into Clade I carry seven TMHs (data not shown), it is likely that they correspond to the RHO We identified 20 genes coding for rhomboid-like proteins subfamily of rhomboid-like proteins. Clade II is the sec- in Arabidopsis and 18 in rice. Subsequent to multiple ond largest cluster supported by significant bootstrap val- sequence alignment, it became evident that several pro- ues and consists of five sequences (At2g39060, teins appeared to be catalytically inactive, owing to loss or At3g07950, LOC_Os03g24390, LOC_Os04g01300 and mutation(s) in residues believed to be essential for cataly- LOC_Os07g46170). Arabidopsis genome encodes for four sis. While the role of these catalytically compromised pro- gene products (see below in bold) that were earlier iden- teins remains unclear, it is believed that they may function tified as members of the PARL subfamily, the other major as regulators of active rhomboid proteases[71] (See addi- subfamily of rhomboid-like proteins[71]. The four gene tional files 1, 2, 15: Table S1.pdf, Table S2.pdf, Figure products cluster into four independent groups, Clade III, SF11.pdf). Of all the rhomboid-like proteins identified in consisting of four (At1g18600, At1g74130, At1g74140 the two plant species, only a few are likely to carry addi- and LOC_Os01g55740; 29–56% pairwise sequence iden- tional domains. Three Arabidopsis gene products tity), Clade IV, consisting of two (At3g58460, (At2g41160, At3g56740, At3g58460) and two gene prod- LOC_Os03g44830; 72% identical), Clade V, consisting of ucts from rice (LOC_Os01g16330, LOC_Os03g44830) two (At3g17611, LOC_Os01g18100; 66% identical) and are predicted to possess a ubiquitin associated (UBA) Clade VI consisting of two (At3g59520, domain, C-terminal to the rhomboid protease domain. LOC_0s01g67040; 80% identical) sequences respectively An Arabidopsis gene product At3g17611 is predicted to (Figure 8). A majority of rhomboid-like proteins identi- have a zinc finger-like module C-terminal to the predicted fied in the two plant species were found to be single rhomboid protease domain (Figure 2; see additional files domain proteins, containing a single rhomboid protease 1, 2: Table S1.pdf, Table S2.pdf). We identified 10 orthol- domain. However, three Arabidopsis gene products ogous pairs across Arabidopsis and rice, suggesting a high (At2g41160, At3g56740 and At3g58460) were found to degree of conservation of rhomboid mechanisms across carry an additional UBA domain and At3g17611 was the two plant species (See additional file 4: Table S4.pdf). found to carry a zf-RAN domain C-terminal to the rhom- TargetP[19] predictions suggest that a few rhomboid-like boid protease domain (Figure 2; see additional file 1: proteins from both the plant species may localise to mito- Table S1.pdf). Two gene products in the rice genome, chondria or chloroplasts (See additional files 1, 2: Table LOC_Os03g44830 and LOC_Os01g16330 were found to S1.pdf, Table S2.pdf). Recently, a rhomboid protease was carry a UBA-domain C-terminal to the rhomboid protease identified in yeast mitochondria, where it is involved in domain (Figure 2; see additional file 2: Table S2.pdf). processing of substrates localised to the mitochondrial Interestingly, the UBA domain containing rhomboid-like intermembrane space[75]. Although rhomboid-like pro- proteins were observed to fall into two clusters Clade IV teins are yet to be identified in chloroplasts, identification (At3g58460 and LOC_Os03g44830) and Clade VII of a yeast mitochondrial rhomboid protease coupled with (At2g41160, At3g56740 and LOC_Os01g16330; 69–85% the possibility that a few rhomboid-like proteins in the pairwise sequence identity), suggesting that the rhom- two plant species may localise to mitochondria and/or boid-UBA domain architecture probably evolved early in chloroplasts suggests that rhomboid-like proteins may be evolution, prior to divergence of dicot and monocot line- involved in regulatory processes in mitochondria and ages in higher plants. chloroplasts of higher plants. The members of individual gene clusters (except Clade II) Phylogenetic analysis of Arabidopsis and rice rhomboid- share high pairwise sequence identities with each other like proteins reveals presence of several gene clusters that (see above) but share low sequence identity with other share low pairwise sequence identity with each other, rhomboid-like proteins identified in the two plant though the members within a cluster share significant genomes and are likely to have diverged early in evolu- sequence similarity with each other (Figure 8). Clade I is tion. Our results may in part be explained by previous the largest of the clusters supported by high bootstrap val- observations by Koonin and coworkers that reveal a ues and consists of 15 sequences, most of which, carry polyphyletic origin for eukaryotic rhomboids and the pos- seven transmembrane helices (TMHs) and share greater sibility of a bacterial origin for rhomboid-like proteins than 40% pairwise sequence identity with each other. and their subsequent dissemination in other lineages Three sequences in Clade I: At1g77860, At1g25290 and through multiple horizontal gene transfer events followed At5g07250 have been previously designated as members by extensive gene duplication[71]. of RHO subfamily, one of the two major rhomboid sub- families. The members of RHO subfamily typically have

Page 21 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

At5g25752 Os05g13370 Clade III Os01g05430 Os09g28100 Os01g55740 At1g18600 Os04g48130 At4g23070 At1g52580 At1g25290 At1g74130 At2g29050 Clade I At1g63120 At3g53780 At1g12750

Os03g02530 * At1g74140 Os10g37760 *

Os01g16330

Os09g35730 Os11g47840 At1g77860 At5g38510 At2g41160 Clade VII At5g07250 At3g56740 Os08g43320 *

At2g39060 At3g58460

Os03g24390

Os07g46170 Os01g67040 Os03g44830 At3g07950 Clade IV At3g59520 Os04g01300 At3g17611 Clade II Clade VI Os01g18100 Clade V domainsFigureUnrooted 8 N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) rhomboid protease Unrooted N-J tree computed from multiple sequence alignments of Arabidopsis (red) and rice (blue) rhomboid protease domains. Rhomboid protease-like domains were aligned using ClustalW [95] program and the alignments were exported to Phylip package [96] for representing the Neighbor-Joining tree (see methods). The colors and circles represent different evolu- tionary clades identified in the analysis (see text for details). Clade I is represented in orange, Clade II is shaded brown, Clade III in green, Clade IV in purple. Clades V-VIII are shaded in black. For clarity, bootstrap values were replaced with symbols rep- resenting bootstrap percentages >50%. Bootstrap values between 50–60% are represented by an asterix, circles represent bootstrap values from 60%–80% while bootstrap values >80% are represented by rectangles. Gene names correspond to those in Additional files 1 and 2. For brevity, rice gene names have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit number assigned to each gene.

Nucleoporin autopeptidases (Family S59) MOS3 with high sequence similarity to human nucleop- Nucleoporin autopeptidases are a group of autocatalytic orin 96 has been identified in Arabidopsis and shown to be serine endopeptidases belonging to MEROPS[5] family required for activation of disease resistance response S59. Most nucleoporins are synthesised as precursors and against bacterial pathogens. It was experimentally shown processed by autoproteolysis prior to their association in to localise to the nuclear envelope, suggesting that nucle- the Nuclear Pore Complex (NPC). Studies have indicated ocytoplasmic trafficking may play a significant role in that the C-terminal domain of the nucleoporins is essen- plant disease resistance[78]. tial for their localization to the NPC. Mutations that block proteolytic processing within the C-terminal domain pre- Analysis of Arabidopsis and rice proteomes led to identifi- vent their association with the pore and inhibit their abil- cation of three genes each coding for nucleoporin-like ity to form NPCs[76,77]. Structural analysis of the proteins (See additional files 1, 2, 16: Table S1.pdf, Table proteolytic cleavage site confirms the presence of a cata- S2.pdf, Figure SF12.pdf). As mentioned earlier, MOS3 lytic dyad (H879, S881) within HisPheSer motif and an (At1g80680), a likely homologue of human nucleoporin intein-like mechanism for autoproteolysis (See additional 96, plays a significant role in Arabidopsis resistance to bac- file 3: Table S3.pdf)[77]. terial pathogen[78]. We identified a MOS3 orthologue in rice, LOC_Os03g07580, which is of similar size and may Members of the S59 family have been identified across play a similar role in defense response to bacterial patho- eukaryotes, from yeast to humans. Recently, a protein gens in rice (See additional file 4: Table S4.pdf). In addi-

Page 22 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

tion, Arabidopsis proteome also contains two gene 2. LOC_Os04g02960 (Subtilisin family S8)- Domain products At1g10390 and At1g59660 that are 76% identi- architecture Ex2-S8-zfCCHC-rve (Figure 2; see additional cal and show high similarity to human Nup98. In file 2: Table S2.pdf). The putative gene product carries an humans, Nup96 and Nup98, are encoded by the same Extensin_2 domain N-terminal to the subtilisin protease gene and synthesised as a part of the same precursor domain. This domain is associated with a family of plant polypeptide while, in Arabidopsis, they are encoded at dif- hydroxyproline-rich glycoproteins (HRGPs) found in ferent chromosomal locations. Rice proteome contains plant extracellular matrix that play a major role in cell wall two gene products LOC_Os12g06870 and self-assembly and cell extension. The gene product also LOC_Os12g06890 that are 95% identical and are very contains a zhCCHC domain and an RVE domain C-termi- similar to At1g10390 and may be involved in similar nal to the protease domain. The RVE (integrase core) functional pathways in rice. domain is associated with viral proteins that mediate the integration of a DNA copy of the viral genome into the Domain architectures host chromosome. A protein domain maybe defined as an independent evo- lutionary unit that may either occur on its own in a single 3. LOC_Os06g06800 (Subtilisin family S8)- Domain domain protein or alongside other similar units in a architecture S8-zfCCHC-rve. The putative gene product multi-domain protein. A domain may have an independ- carries a zhCCHC domain and an rve domain C-terminal ent function or may contribute to the function of a multi- to the protease domain (Figure 2; see additional file 2: domain protein in cooperation with other domains. Table S2.pdf). Some of the major mechanisms that lead to multi-domain proteins and novel combinations are domain shuffling 4. LOC_Os06g40700 (Subtilisin family S8)- Domain and gene fusion[79,80]. architecture SN-S8-PA-Arf2-C2. The putative gene product carries an Arf2 and a C2 domain C-terminal to the subtili- The serine protease-like proteins belonging to different sin protease domain (Figure 2; see additional file 2: Table families identified in the two plant species exhibit 31 dif- S2.pdf). Arf2 domain is associated with members ADP- ferent domain architectures (Figure 2; see additional files ribosylation factor family that includes small GTP-bind- 1, 2: Table S1.pdf, Table S2.pdf). The different serine pro- ing proteins that are major regulators of vesicle biogenesis tease-like proteins display a preference for co-existing in intracellular traffic. C2 domain is a calcium-dependent domain(s), when present in multi-domain polypeptides, membrane-targeting module found in many cellular pro- where the co-existing domains are believed to either facil- teins associated with signal transduction or membrane itate protein-protein interactions or help evolve newer trafficking. and more diverse functions for serine protease-like pro- teins in the cellular networks. We observed that a majority 5. LOC_Os06g32740 (serine carboxypeptidase family of serine protease-like proteins belonging to a specific S10)- Domain architecture S10-Transposase_21 (Figure 2; family exhibit the same domain architectures with a few see additional file 2: Table S2.pdf). The putative gene members exhibiting different domain architectures (Fig- product is associated with Transposase_21 domain C-ter- ure 2; see additional files 1, 2: Table S1.pdf, Table S2.pdf). minal to the serine carboxpeptidase domain. The TP21 We identified some gene products that contain serine pro- domain is associated with a group of proteins that include tease-like domains and exhibit novel domain architec- a transposable element. tures specific to either of the two plant species. The information on co-existing domains described below was 6. LOC_Os11g27170 (serine carboxypeptidase family obtained from Pfam[37] (See Figure 2 for Pfam[37] acces- S10)- Domain architecture S10-Retrotrans_gag (Figure 2; sions). see additional file 2: Table S2.pdf). The putative gene product is associated with a Retrotrans_gag domain C-ter- 1. LOC_Os01g17160 (Subtilisin family S8)- Domain minal to the serine carboxypeptidase protease domain. architecture SN-S8-PA-zfCCHC (Figure 2; see additional The Retrotrans_gag domain is associated with groups of file 2: Table S2.pdf). The putative gene product is associ- proteins that include eukaryotic Gag or capsid-related ret- ated with a Zinc knuckle motif C -terminal to the subtili- rotranposon-related proteins. sin protease domain, which is associated with nucleocaspid protein of some retroviruses such as HIV. It 7. At5g24810, LOC_Os06g48770 (serine β-lactamse fam- is required for viral genome packaging and is also found ily S12)- Domain architecture ABC1-S12. The putative in eukaryotic proteins involved in RNA binding or single gene products are associated with an ABC1 domain N-ter- stranded DNA binding. minal to the serine β-lactamase domain (Figure 2; see additional files 1, 2: Table S1.pdf, Table S2.pdf). ABC1 domains are associated with a family of proteins that

Page 23 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

include members from yeast and bacteria that display a relationships. One such instance is Arabidopsis AIR3 nuclear or mitochondrial subcellular location in eukaryo- (At2g04160), which has been associated with lateral root tes. Some members of the family are believed to regulate emergence[12] and has a putative orthologue mRNA translation and are essential for electron transfer in (LOC_Os02g10520) in rice (See additional file 4: Table the bc1 complex. S4.pdf). However, AIR3 also appears to be an orthologue of LOC_Os06g40700, which is most closely related to 8. At2g41160, At3g56740, At3g58460, LOC_Os02g10520 and LOC_Os08g23740, which is also LOC_Os01g16330, LOC_Os03g44830 (rhomboid family closely related to LOC_Os02g10520 in terms of sequence S54)- Domain architecture S54-UBA. The putative gene similarity. Therefore, the aforementioned rice genes repre- products are associated with an UBA domain C-terminal sent paralogous genes resulting from recent duplication to the rhomboid protease domain (Figure 2; see addi- events and it would be interesting to investigate whether tional files 1, 2: Table S1.pdf, Table S2.pdf). UBA-domain they are also associated with lateral root emergence like is a novel sequence motif found in diverse proteins asso- their Arabidopsis counterpart and whether they display ciated with ubiquitin and the ubiquitination pathway. redundancy in function. Similarly, Arabidopsis ARA12 [12] (At5g67360) has a putative orthologue 9. At3g17611 (rhomboid family S54)- Domain architec- LOC_Os03g55350 in rice (See additional file 4: Table ture S54-zfRanBP. The putative gene product is associated S4.pdf) and is likely to be associated with cell wall metab- with a zf-RanBP domain C-terminal to the rhomboid pro- olism. ARA12 also appears to be an orthologue of tease domain (Figure 2; see additional files 1, 2: Table LOC_Os03g40830 which, is a likely paralogue of S1.pdf, Table S2.pdf). The zf-RanBP domain is associated LOC_Os03g55350 (Table 3), suggesting a recent duplica- with a group of proteins that include Ran binding proteins tion event in rice genome with respect to At5g67360 that play a role in transport between the nucleus and the (ARA12) in Arabidopsis. cytoplasm. Gene duplication, resulting from regional events or The gene products above, unlike other members of their genome-wide polyploidization is a significant aspect of respective serine protease families, are associated with plant genome evolution. It is generally believed that domains forming novel domain combinations. It is likely extensive gene duplication is responsible for functional that the protease domains may be responsible for regulat- divergence, increased genomic and phenotypic complex- ing the activities of the associated domains for some of the ity and may constitute a major factor in speciation. Under- gene products. standing the mechanisms that underlie gene duplication is vital since these provide greater insights into genome- Orthologue sequence analysis and gene duplications wide aspects of evolutionary processes that shape genome Orthologues are genes in different organisms that have organizations, evolutionary relationships and interac- evolved from a common ancestor gene by speciation. tions[82,83]. Comparison of plant serine proteases with They normally retain the same function in the course of those of other major eukaryotic lineages reveals identifica- evolution and the identification of orthologues is impor- tion of plant specific serine proteases and selective expan- tant for reliable prediction of gene functions and for com- sion of serine protease families in plants (data not paring genome organizations[81]. Nearly 40% of genes shown). In order to explore possible functional redun- encoding for serine protease-like proteins in the two plant dancy in plant serine protease-like proteins, we investi- species were identified as orthologues, suggesting that ser- gated chromosomal locations of serine protease-like ine protease functions may have been established prior to proteins from the two plant species and inferred probable dicot-monocot divergence (See additional file 4: Table gene duplication events that may have facilitated this dis- S4.pdf). Comparison of Arabidopsis thaliana serine pro- tribution. We observed that plant serine protease-like pro- tease-like genes of known functions with similar gene teins are distributed on nearly on all the chromosomes in products encoded in the rice genome has enabled the the two plant species. Several serine protease-like proteins identification of rice serine-protease like gene products in the two plant species were found to occur in tandem likely to be involved in similar physiological processes in repeats, suggesting that local duplication events may have rice (See additional file 4: Table S4.pdf). contributed to expansion of these gene families (See addi- tional files 1, 2: Table S1.pdf, Table S2.pdf). It was also When comparing multi-gene families between species it is noted that several sequences identified as probable most frequently observed that several genes in one species may recent gene duplicates in the two plant species were collectively be orthologues of a single gene in the other, located on different chromosomes (Tables 2, 3). In order indicating recent duplications exclusive to the former. In to analyse the role of segmental duplications in generating such cases, knowledge of gene function of certain mem- these gene duplicates, we compared our results on most bers allows confirmation of orthologous and paralogous recent gene duplicates with the TIGR[20] dataset of genes

Page 24 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

found within segmentally duplicated regions of the Arabi- like proteins of known functions with similar gene prod- dopsis and rice genomes and were able to identify ucts encoded in the rice genome has enabled the identifi- instances of segmentally duplicated genes corresponding cation of rice serine-protease like proteins likely to be to certain families of serine proteases in Arabidopsis and involved in similar physiological processes associated rice genomes (Tables 2, 3). Therefore, some of the with growth and development in rice. In addition, identi- observed most recent gene duplications on different chro- fication of serine protease-like proteins in either plant spe- mosomes in Arabidopsis and rice genomes may be attrib- cies with no apparent counterpart in the other species uted to segmental duplications. It seems likely that the indicates the existence of Arabidopsis and rice specific ser- expansion of certain serine protease families in the two ine protease-like proteins and by extension, dicot- and plant species may be a consequence of extensive segmen- monocot-specific serine protease-like proteins. The sys- tal and tandem gene duplication events subsequent to tematic analysis of the serine protease-like proteins in the speciation (Tables 2, 3). Our analysis appears to be con- two plant species has provided some insight into the func- sistent with recent observations that segmental and tan- tional associations of previously uncharacterised serine dem gene duplication events may have led to selective protease-like proteins. Further investigation of these expansion of several gene families in Arabidopsis and rice aspects may help unravel more specific functional roles genomes[84,85]. It has been suggested that, in general, for several serine protease-like proteins in various cellular duplicated genes have a tendency to acquire newer and and physiological processes in Arabidopsis and rice and more distinct functions. This is possible mainly due to dif- may prove beneficial in our understanding of similar ferential sequence evolution and divergent expression processes in commercially significant crop plant species. patterns thereby adding to the complexity of cellular net- works[15,84-87]. Methods Sequences encoded by the complete Arabidopsis and rice Conclusion (Oryza sativa L.ssp. japonica) proteomes were obtained The current analysis provides an overview and a compari- from The Institute for Genomic research (TIGR[20]). son of serine protease families of Arabidopsis and rice, the two plant species with completely sequenced genomes Search for serine proteases in the two plant genomes that constitute model plant systems for dicots and mono- a. A preliminary search for serine-proteases was per- cots, respectively. The genomes of Arabidopsis thaliana and formed using BLASTP[88] on Arabidopsis and rice pro- rice encode very similar number of serine protease-like teomes, using Type sequences for each serine-protease gene products despite large differences in gene size and family as classified by MEROPS[5] as queries. MEROPS[5] gene content. Similar results have been obtained from employs a hierarchical structure based classification of comparison of other gene families such as ABC transport- proteins and groups the peptidases together on the basis ers and P-ATPases though unlike serine proteases, a differ- of statistical similarities in amino acid sequence and terti- ential distribution of families and subfamilies has been ary structure. Each family has a type example around observed in the above two cases. Nearly 40% of genes which the family is built. All the members from the result- encoding for serine protease-like proteins in the two plant ing list of similar sequences were retrieved for another species were identified as orthologues, suggesting that round of searching and new accession numbers added to some essential plant serine protease functions may have the original list of sequences. The procedure was repeated been established prior to dicot-monocot divergence. This for five iterations with an E-value threshold of 10-3. is further supported by our observation that almost all the phylogenetic clades corresponding to different serine pro- b. The protein sequences thus obtained were scanned fur- tease families taken up for study in this analysis include ther against a dataset of hidden Markov models (HMMs) both Arabidopsis and rice gene products indicating that obtained from the Pfam A database[37], employing the divergence of these clades occurred prior to the divergence Hmmsearch of the HMMER suite[16], with the E-value of two species. However, several clades display an unequal thresholds set to 0.1[89]. representation of gene products from the two species indi- cating that a significant number of serine protease genes c. A complimentary approach of matching sequences to a in the two plant species may have evolved by gene dupli- database of annotated profiles was also employed. Each cation events subsequent to speciation. In addition, plants Arabidopsis and rice genome sequence was matched to sen- appear to share with prokaryotes, some specific sets of ser- sitive profiles obtained from the PfamA ine protease-like proteins, not identified in other eukary- database[37], using IMPALA[17]. A sequence aligning otes, which may have originated from ancestral with any of the serine-protease domains considered for chloroplast genomes and may be indicative of ancient study with an E-value of <10-5 by IMPALA[17] was consid- endosymbiotic events leading to evolution of chloro- ered as a potential member of that family. plasts. Comparison of Arabidopsis thaliana serine protease-

Page 25 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

d. A similar approach was carried out where each of Arabi- Multiple sequence alignment and phylogenetic analysis dopsis and rice genome sequences was scanned against Multiple sequence alignments of the serine-protease protein family profiles obtained from the NCBI (National domains were performed using CLUSTALW program[95]. Centre for Biotechnology Information) conserved In order to compare equivalent regions, the domain domain database (NCBI-CDD[18]), a compilation of regions were retrieved employing HMMALIGN[16], multiple sequence alignments representing protein sequence to profile matching method against the PfamA domains. It has been populated with alignment data from database[37]. Proteins lacking a significant portion of the Pfam[37] and SMART [90], plus contributions from NCBI protease-like domain were not included in alignments. A such as COG[91] using RPS-BLAST[18]. The E-value Blosum 30 matrix, an open gap penalty of 10 and an threshold was the same as employed for IMPALA[17]. extension penalty of 0.05 were the parameters employed for multiple sequence alignment. An overall phylogenetic e. The hits obtained from all the above methods were tree was inferred from the multiple sequence alignment pooled together and a non-redundant dataset was with PHYLIP (Phylogeny Inference Package) 3.65[96]. derived. Proteins in this dataset were clustered using CD- Bootstrapping was performed 100 times using SEQ- HIT program[92] and MALIGN[93] and the hits sharing BOOT[96] to obtain support values for each internal 100% sequence identity with other proteins were classi- branch (to reduce the sampling error, bootstrapping is a fied as redundant and not considered for further analysis. method of testing the reliability of a dataset by the crea- Proteins appearing as fragments or identical to larger pro- tion of pseudo replicate datasets by resampling. Boot- teins were also purged from the dataset. strapping assesses whether stochastic effects have influenced the distribution of amino acids). Pairwise dis- Sequence properties of Arabidopsis and rice serine tances were determined with PROTDIST[96]. Neighbor- proteases joining phylogenetic trees were calculated with NEIGH- a. The cellular localizations of serine protease-like pro- BOR[96] using standard parameters. The majority-rule teins identified in the two plant species was predicted consensus trees of all bootstrapped sequences were using TargetP[19]. obtained with the program CONSENSE[96]. Representa- tions of the calculated trees were constructed using b. Co-existing domains were predicted employing TreeView[97]. Clusters with bootstrap values greater than HMMPFAM[16] and RPS-BLAST[18] sequence to profile 50% were defined as confirmed subgroups, and sequences matching methods. Each serine-protease sequence was with lower values added to these subgroups according to matched to protein family profiles corresponding to their sequence similarity in the alignment as judged by PfamA[37] and NCBI-CDD[18]. Conservation of domain visual inspection. The pairwise percentage identity architectures across various lineages (Figure 2) was deter- between the serine protease-like domain regions of any mined using NCBI Conserved domain architecture two sequences belonging to the same serine protease fam- retrieval tool (CDART)[94] and Pfam database[37]. ily was determined by MALFORM, a constituent of MALIGN multiple alignment program[93]. Orthologue sequence analysis Each of the Arabidopsis and rice serine protease-like pro- Abbreviations teins was searched against the other proteome and con- AGI- Arabidopsis Genome Initiative; IRGSP- International versely, the nearest homologue was searched back in first Rice Genome Sequencing Project; TIGR- The Institute for proteome. Two sequences were defined as orthologues Genomic Research when each of them was the best hit of the other and if they exhibit the same domain architecture (See additional file Authors' contributions 4: Table S4.pdf). LT carried out the computational sequence analysis. LT and RS conceived of the study and participated in its Chromosomal locations and recent duplications design and coordination. LT authored the first draft of this The chromosomal locations for all Arabidopsis and rice ser- manuscript and RS provided comments and revisions to ine protease sequences were retrieved from TIGR[20]. the final version of this text. Both authors read and Subsequently, the Arabidopsis and rice proteomes were approved the final manuscript. searched for gene paralogues using a BLAST[18] based approach similar to the one employed for orthologue sequence analysis and two sequences were defined as most recent paralogues when each of them was the best non-self hit of the other (Tables 2, 3).

Page 26 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

Additional material Additional file 5 Figure SF1. Multiple sequence alignment and phylogenetic analysis of Additional file 1 Arabidopsis and rice DegP protease-like proteins. A: Multiple sequence Table S1. An inventory of Arabidopsis thaliana serine protease-like pro- alignment of Arabidopsis and rice DegP protease-like proteins. Multiple teins. An inventory of Arabidopsis thaliana serine protease-like proteins sequence alignment of the DegP domain region of the annotated Arabi- identified by multifold approach (see methods for details). The list dopsis and rice DegP protease-like proteins. The catalytic triad residues includes gene identifiers, predicted subcellular localization, chromosome are indicated. Gene names correspond to those in Additional files 1 and location, chromosomal nucleotide position and domain architectures of 2. For brevity, rice gene names have been shortened to OsXXg##### serine proteases identified in current analysis instead of LOC_OsXXg#####, XX referring to chromosome 1–12 and a Click here for file 5 digit number assigned to each gene. B: Phylogenetic analysis of Arabi- [http://www.biomedcentral.com/content/supplementary/1471- dopsis and rice DegP protease-like proteins. Unrooted N-J tree computed 2164-7-200-S1.pdf] from multiple sequence alignments of Arabidopsis (red) and rice (blue) DegP protease domains. DegP protease-like protease domains were Additional file 2 aligned using ClustalW[95] program and the alignments were exported Table S2. An inventory of rice serine protease-like proteins. An inventory to Phylip package[96] for representing the Neighbor-Joining tree (see of rice serine protease-like proteins identified by multifold approach (see methods). The colors and circles represent different evolutionary clades methods for details). The list includes gene identifiers, predicted subcellu- identified in the analysis (see text for details). Clade I is represented in lar localization, chromosome location, chromosomal nucleotide position orange, Clade II is shaded green while Clade III is labelled in purple. For and domain architectures of serine proteases identified in current analysis clarity, bootstrap values were replaced with symbols representing bootstrap Click here for file percentages >50%. Bootstrap values between 50–60% are represented by [http://www.biomedcentral.com/content/supplementary/1471- an asterix, circles represent bootstrap values from 60%–80% while boot- 2164-7-200-S2.pdf] strap values >80% are represented by rectangles. Gene names correspond to those listed in Tables 2 and 3. For brevity, rice gene names have been Additional file 3 shortened to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit number assigned to each gene. Table S3. Background information on serine proteases. Additional litera- Click here for file ture information on serine protease families taken up for study in current [http://www.biomedcentral.com/content/supplementary/1471- analysis. The information is categorized into three parts namely a brief 2164-7-200-S5.pdf] structural overview, enzyme characteristics and functional information where known. Additional references for the material contained in the file have been provided at the end. Additional file 6 Click here for file Figure SF2. Multiple sequence alignment of Arabidopsis and rice subtili- [http://www.biomedcentral.com/content/supplementary/1471- sin-like proteins. Multiple sequence alignment of the subtilisin domain 2164-7-200-S3.pdf] region of the annotated Arabidopsis and rice subtilisin protease-like pro- teins. The catalytic triad residues are indicated. Gene names correspond Additional file 4 to those in Additional files 1 and 2. For brevity, rice gene names have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX referring to Table S4. Orthologous serine protease-like proteins identified in the two chromosome 1–12 and a 5 digit number assigned to each gene. plant species. A list of Arabidopsis thaliana serine protease-like proteins Click here for file and their putative orthologues in rice genome (see methods for details). [http://www.biomedcentral.com/content/supplementary/1471- Functional information gathered from literature has been provided for 2164-7-200-S6.pdf] Arabidopsis gene products where possible Click here for file [http://www.biomedcentral.com/content/supplementary/1471- Additional file 7 2164-7-200-S4.pdf] Figure SF3. Multiple sequence alignment of Arabidopsis and rice prolyl oligopeptidase -like proteins Multiple sequence alignment of the Prolyl oli- gopeptidase domain region of the annotated Arabidopsis and rice prolyl oligopeptidase-like proteins. The catalytic triad residues are indicated. Gene names correspond to those in Additional files 1 and 2. For brevity, rice gene names have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit number assigned to each gene. Based on the conservation of residues around the catalytic Ser residue, a few gene products were unambiguously assigned to their respective subfamilies prefixed by figures in parentheses (See text for details) Click here for file [http://www.biomedcentral.com/content/supplementary/1471- 2164-7-200-S7.pdf]

Page 27 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

Additional file 8 Additional file 11 Figure SF4. Multiple sequence alignment and phylogenetic analysis of Figure SF7. Multiple sequence alignment of Arabidopsis and rice serine Arabidopsis and rice serine carboxypeptidase-like proteins A: Multiple Type I SPase-like proteins. Unrooted N-J tree computed from multiple sequence alignment of Arabidopsis and rice serine carboxypeptidase-like sequence alignments of Arabidopsis (red) and rice (blue) Type I SPase proteins. Multiple sequence alignment of the serine carboxypeptidase domains. Type I SPase domains were aligned using ClustalW[95] pro- domain region of the annotated Arabidopsis and rice serine carboxypepti- gram and the alignments were exported to Phylip package[96] for repre- dase-like proteins. The catalytic triad residues are indicated. Gene names senting the Neighbor-Joining tree (see methods). The colors represent the correspond to those in Additional files 1 and 2. For brevity, rice gene two evolutionary clades identified in the analysis (see text for details). names have been shortened to OsXXg##### instead of Clade I is represented in green Clade II is shaded orange. For clarity, boot- LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit strap values were replaced with symbols representing bootstrap percentages number assigned to each gene. B: Phylogenetic analysis of Arabidopsis >50%. Bootstrap values between 50–60% are represented by an asterix, and rice serine carboxypeptidase-like proteins. Unrooted N-J tree com- circles represent bootstrap values from 60%–80% while bootstrap values puted from multiple sequence alignments of Arabidopsis (red) and rice >80% are represented by rectangles. Gene names correspond to those (blue) serine carboxypeptidase domains. Serine carboxypeptidase-like pro- listed in Tables 2 and 3. For brevity, rice gene names have been shortened tease domains were aligned using ClustalW[95] program and the align- to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromo- ments were exported to Phylip package[96] for representing the Neighbor- some 1–12 and a 5 digit number assigned to each gene. Joining tree (see methods). The colors and circles represent different evo- Click here for file lutionary clades identified in the analysis (see text for details). Clade I is [http://www.biomedcentral.com/content/supplementary/1471- represented in orange, Clade II is shaded green while the branches con- 2164-7-200-S11.pdf] necting the remaining sequences are labelled in purple. For clarity, boot- strap values were replaced with symbols representing bootstrap percentages Additional file 12 >50%. Bootstrap values between 50–60% are represented by an asterix, Figure SF8. Multiple sequence alignment of Arabidopsis and rice serine circles represent bootstrap values from 60%–80% while bootstrap values S28 protease-like proteins. Multiple sequence alignment of the family S28 >80% are represented by rectangles. Gene names correspond to those protease domain region of the annotated Arabidopsis and rice family S28 listed in Tables 2 and 3. For brevity, rice gene names have been shortened protease-like proteins. The catalytic triad residues are indicated. Gene to OsXXg##### instead of LOC_OsXXg#####, XX referring to chromo- names correspond to those in Additional files 1 and 2. For brevity, rice some 1–12 and a 5 digit number assigned to each gene. gene names have been shortened to OsXXg##### instead of Click here for file LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit [http://www.biomedcentral.com/content/supplementary/1471- number assigned to each gene. 2164-7-200-S8.pdf] Click here for file [http://www.biomedcentral.com/content/supplementary/1471- Additional file 9 2164-7-200-S12.pdf] Figure SF5. Pairwise sequence alignment of Arabidopsis and rice serine beta-lactamase -like proteins. Pairwise sequence alignment of the serine Additional file 13 beta-lactamase domain region of the annotated Arabidopsis and rice ser- Figure SF9. Multiple sequence alignment of Arabidopsis and rice Ctpase- ine beta-lactamase-like proteins. The catalytic residues are indicated. like proteins. Multiple sequence alignment of the Ctpase domain region of Gene names correspond to those in Additional files 1 and 2. For brevity, the annotated Arabidopsis and rice Ctpase-like proteins. The catalytic rice gene names have been shortened to OsXXg##### instead of dyad residues are indicated. Gene names correspond to those in Additional LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit files 1 and 2. For brevity, rice gene names have been shortened to number assigned to each gene. OsXXg##### instead of LOC_OsXXg#####, XX referring to chromo- Click here for file some 1–12 and a 5 digit number assigned to each gene. [http://www.biomedcentral.com/content/supplementary/1471- Click here for file 2164-7-200-S9.pdf] [http://www.biomedcentral.com/content/supplementary/1471- 2164-7-200-S13.pdf] Additional file 10 Figure SF6. Multiple sequence alignment of Arabidopsis and rice serine Additional file 14 Lon protease-like proteins. Multiple sequence alignment of the Lon_C pro- Figure SF10. Multiple sequence alignment of Arabidopsis and rice pro- tease domain region of the annotated Arabidopsis and rice Lon protease- tease IV-like proteins. Multiple sequence alignment of the protease IV like proteins. The catalytic dyad residues are indicated. Gene names cor- domain region of the annotated Arabidopsis and rice protease IV-like respond to those in Additional files 1 and 2. For brevity, rice gene names proteins. The active site serine residue is indicated. Gene names are have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX derived from those in Additional files 1 and 2 as follows: A1a73990 referring to chromosome 1–12 and a 5 digit number assigned to each (At1g73390 N-terminal domain); A1b73990 (At1g73390 C-terminal gene. domain); O2a49570 (Os02g49570 N-terminal domain); O2b49570 Click here for file (Os02g49570 C-terminal domain). For brevity, rice gene names have [http://www.biomedcentral.com/content/supplementary/1471- been shortened to OsXXg##### instead of LOC_OsXXg#####, XX refer- 2164-7-200-S10.pdf] ring to chromosome 1–12 and a 5 digit number assigned to each gene. Click here for file [http://www.biomedcentral.com/content/supplementary/1471- 2164-7-200-S14.pdf]

Page 28 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

12. Beers EP, Jones AM, Dickerman AW: The S8 serine, C1A cysteine Additional file 15 and A1 aspartic protease families in Arabidopsis. Phytochemis- try 2004, 65:43-58. Figure SF11. Multiple sequence alignment of Arabidopsis and rice rhom- 13. Sinvany-Villalobo G, Davydov O, Ben-Ari G, Zaltsman A, Raskind A, boid-like proteins. Multiple sequence alignment of the rhomboid protease Adam Z: Expression in multigene families. Analysis of chloro- domain region of the annotated Arabidopsis and rice rhomboid protease- plast and mitochondrial proteases. Plant Physiol 2004, like proteins. The catalytic triad residues are indicated. Gene names cor- 135:1336-1345. respond to those in Additional files 1 and 2. For brevity, rice gene names 14. AGI: Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 2000, 408:796-815. have been shortened to OsXXg##### instead of LOC_OsXXg#####, XX 15. IRSGP: The map-based sequence of the rice genome. Nature referring to chromosome 1–12 and a 5 digit number assigned to each 2005, 436:793-800. gene. Several sequences identified here display mutations in key catalytic 16. Eddy SR: Profile hidden Markov models. Bioinformatics 1998, residues and may possibly have evolved a regulatory function (See text for 14:755-763. details). 17. Schaffer AA, Wolf YI, Ponting CP, Koonin EV, Aravind L, Altschul SF: Click here for file IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. [http://www.biomedcentral.com/content/supplementary/1471- Bioinformatics 1999, 15:1000-1011. 2164-7-200-S15.pdf] 18. Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, Geer LY, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Additional file 16 Liebert CA, Liu C, Lu F, Marchler GH, Mullokandov M, Shoemaker BA, Simonyan V, Song JS, Thiessen PA, Yamashita RA, Yin JJ, Zhang D, Figure SF12. Multiple sequence alignment of Arabidopsis and rice nucle- Bryant SH: CDD: a Conserved Domain Database for protein oporin autopeptidase-like proteins. Multiple sequence alignment of the classification. Nucleic Acids Res 2005, 33:D192-6. nucleoporin autopeptidase domain region of the annotated Arabidopsis 19. Emanuelsson O, Nielsen H, Brunak S, von Heijne G: Predicting sub- and rice nucleoporin autopeptidase-like proteins. The catalytic residues are cellular localization of proteins based on their N-terminal indicated. Gene names correspond to those in Additional files 1 and 2. For amino acid sequence. J Mol Biol 2000, 300:1005-1016. 20. TIGR: The Institute for Genomic Research [http:// brevity, rice gene names have been shortened to OsXXg##### instead of www.tigr.org]. . LOC_OsXXg#####, XX referring to chromosome 1–12 and a 5 digit 21. Pils B, Schultz J: Inactive enzyme-homologues find new func- number assigned to each gene. tion in regulatory processes. J Mol Biol 2004, 340:399-404. Click here for file 22. Krojer T, Garrido-Franco M, Huber R, Ehrmann M, Clausen T: Crys- [http://www.biomedcentral.com/content/supplementary/1471- tal structure of DegP (HtrA) reveals a new protease-chaper- 2164-7-200-S16.pdf] one machine. Nature 2002, 416:455-459. 23. Hung AY, Sheng M: PDZ domains: structural modules for pro- tein complex assembly. J Biol Chem 2002, 277:5699-5702. 24. Spiers A, Lamb HK, Cocklin S, Wheeler KA, Budworth J, Dodds AL, Pallen MJ, Maskell DJ, Charles IG, Hawkins AR: PDZ domains facil- itate binding of high temperature requirement protease A Acknowledgements (HtrA) and tail-specific protease (Tsp) to heterologous sub- L.T. is supported by Senior Research Fellowship from the Council of Scien- strates through recognition of the small stable RNA A tific and Industrial Research (CSIR) Govt. of India and R.S. was a Senior (ssrA)-encoded peptide. J Biol Chem 2002, 277:39443-39449. Research Fellow of the Wellcome Trust, U.K. We also thank NCBS (TIFR) 25. Wilken C, Kitzing K, Kurzbauer R, Ehrmann M, Clausen T: Crystal structure of the DegS stress sensor: How a PDZ domain rec- for infrastructural support. No explicit funding was required for the collec- ognizes misfolded protein and activates a protease. Cell 2004, tion of data and the preparation of the manuscript. 117:483-494. 26. Murwantoko, Yano M, Ueta Y, Murasaki A, Kanda H, Oka C, Kawaichi M: Binding of proteins to the PDZ domain regulates proteo- References lytic activity of HtrA1 serine protease. Biochem J 2004, 1. Callis J: Regulation of Protein Degradation. Plant Cell 1995, 381:895-904. 7:845-857. 27. Spiess C, Beil A, Ehrmann M: A temperature-dependent switch 2. Schaller A: A cut above the rest: the regulatory function of from chaperone to protease in a widely conserved heat plant proteases. Planta 2004, 220:183-197. shock protein. Cell 1999, 97:339-347. 3. Barrett AJ, Rawlings ND: Families and clans of serine pepti- 28. Koonin EV, Aravind L: Origin and evolution of eukaryotic apop- dases. Arch Biochem Biophys 1995, 318:247-250. tosis: the bacterial connection. Cell Death Differ 2002, 9:394-404. 4. Rawlings ND, Barrett AJ: Dipeptidyl-peptidase II is related to 29. Clausen T, Southan C, Ehrmann M: The HtrA family of proteases: lysosomal Pro-X carboxypeptidase. Biochim Biophys Acta 1996, implications for protein composition and cell fate. Mol Cell 1298:1-3. 2002, 10:443-455. 5. Rawlings ND, Morton FR, Barrett AJ: MEROPS: the peptidase 30. Itzhaki H, Naveh L, Lindahl M, Cook M, Adam Z: Identification and database. Nucleic Acids Res 2006, 34:D270-2. characterization of DegP, a serine protease associated with 6. Adam Z, Adamska I, Nakabayashi K, Ostersetzer O, Haussuhl K, the luminal side of the thylakoid membrane. J Biol Chem 1998, Manuell A, Zheng B, Vallon O, Rodermel SR, Shinozaki K, Clarke AK: 273:7094-7098. Chloroplast and mitochondrial proteases in Arabidopsis. A 31. Chassin Y, Kapri-Pardes E, Sinvany G, Arad T, Adam Z: Expression proposed nomenclature. Plant Physiol 2001, 125:1912-1918. and characterization of the thylakoid lumen protease DegP1 7. Adam Z, Clarke AK: Cutting edge of chloroplast proteolysis. from Arabidopsis. Plant Physiol 2002, 130:857-864. Trends Plant Sci 2002, 7:451-456. 32. Kanervo E, Spetea C, Nishiyama Y, Murata N, Andersson B, Aro EM: 8. Palma JMSLMCFJRPMMCIRLA: Plant proteases, protein degra- Dissecting a cyanobacterial proteolytic system: efficiency in dation, and oxidative stress: role of peroxisomes. Plant Physiol inducing degradation of the D1 protein of photosystem II in Biochem 2002, 40:521-530. cyanobacteria and plants. Biochim Biophys Acta 2003, 9. Li AX, Steffens JC: An acyltransferase catalyzing the formation 1607:131-140. of diacylglucose is a serine carboxypeptidase-like protein. 33. Siezen RJ, Leunissen JA: Subtilases: the superfamily of subtilisin- Proc Natl Acad Sci U S A 2000, 97:6902-6907. like serine proteases. Protein Sci 1997, 6:501-523. 10. Steffens JC: Acyltransferases in protease's clothing. Plant Cell 34. Meichtry J, Amrhein N, Schaller A: Characterization of the subti- 2000, 12:1253-1256. lase gene family in tomato (Lycopersicon esculentum Mill.). 11. Milkowski C, Strack D: Serine carboxypeptidase-like acyltrans- Plant Mol Biol 1999, 39:749-760. ferases. Phytochemistry 2004, 65:517-524. 35. Rautengarten C, Steinhauser D, Bussis D, Stintzi A, Schaller A, Kopka J, Altmann T: Inferring Hypotheses on Functional Relation-

Page 29 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

ships of Genes: Analysis of the Arabidopsis thaliana Subtilase 59. Chaal BK, Mould RM, Barbrook AC, Gray JC, Howe CJ: Character- Gene Family. PLoS Comput Biol 2005, 1:e40. ization of a cDNA encoding the thylakoidal processing pepti- 36. Mahon P, Bateman A: The PA domain: a protease-associated dase from Arabidopsis thaliana. Implications for the origin domain. Protein Sci 2000, 9:1930-1934. and catalytic mechanism of the enzyme. J Biol Chem 1998, 37. Bateman A, Coin L, Durbin R, Finn RD, Hollich V, Griffiths-Jones S, 273:689-692. Khanna A, Marshall M, Moxon S, Sonnhammer EL, Studholme DJ, 60. Sokolenko A, Pojidaeva E, Zinchenko V, Panichkin V, Glaser VM, Her- Yeats C, Eddy SR: The Pfam protein families database. Nucleic rmann RG, Shestakov SV: The gene complement for proteolysis Acids Res 2004, 32:D138-41. in the cyanobacterium Synechocystis sp. PCC 6803 and Ara- 38. Rosenblum JS, Kozarich JW: Prolyl peptidases: a serine protease bidopsis thaliana chloroplasts. Curr Genet 2002, 41:291-310. subfamily with high potential for drug discovery. Curr Opin 61. Inoue K, Baldwin AJ, Shipman RL, Matsui K, Theg SM, Ohme-Takagi Chem Biol 2003, 7:496-504. M: Complete maturation of the plastid protein translocation 39. Polgar L: The prolyl oligopeptidase family. Cell Mol Life Sci 2002, channel requires a type I signal peptidase. J Cell Biol 2005, 59:349-362. 171:425-430. 40. Venalainen JI, Juvonen RO, Mannisto PT: Evolutionary relation- 62. Carrier A, Wurbel MA, Mattei MG, Kissenpfennig A, Malissen M, Mal- ships of the prolyl oligopeptidase family enzymes. Eur J Bio- issen B: Chromosomal localization of two mouse genes chem 2004, 271:2705-2715. encoding thymus-specific serine peptidase and thymus- 41. Lehfeldt C, Shirley AM, Meyer K, Ruegger MO, Cusumano JC, Vii- expressed acidic protein. Immunogenetics 2000, 51:984-986. tanen PV, Strack D, Chapple C: Cloning of the SNG1 gene of 63. Shariat-Madar Z, Mahdi F, Schmaier AH: Identification and char- Arabidopsis reveals a role for a serine carboxypeptidase-like acterization of prolylcarboxypeptidase as an endothelial cell protein as an acyltransferase in secondary metabolism. Plant prekallikrein activator. J Biol Chem 2002, 277:17962-17969. Cell 2000, 12:1295-1306. 64. Keiler KC, Sauer RT: Sequence determinants of C-terminal 42. Cercos M, Urbez C, Carbonell J: A serine carboxypeptidase gene substrate recognition by the Tsp protease. J Biol Chem 1996, (PsCP), expressed in early steps of reproductive and vegeta- 271:2589-2593. tive development in Pisum sativum, is induced by gibberel- 65. Diner BA, Ries DF, Cohen BN, Metz JG: COOH-terminal lins. Plant Mol Biol 2003, 51:165-174. processing of polypeptide D1 of the photosystem II reaction 43. Shirley AM, Chapple C: Biochemical characterization of center of Scenedesmus obliquus is necessary for the assem- sinapoylglucose:choline sinapoyltransferase, a serine carbox- bly of the oxygen-evolving complex. J Biol Chem 1988, ypeptidase-like protein that functions as an acyltransferase 263:8972-8980. in plant secondary metabolism. J Biol Chem 2003, 66. Liao DI, Qian J, Chisholm DA, Jordan DB, Diner BA: Crystal struc- 278:19870-19877. tures of the photosystem II D1 C-terminal processing pro- 44. Hause B, Meyer K, Viitanen PV, Chapple C, Strack D: Immunolocal- tease. Nat Struct Biol 2000, 7:749-753. ization of 1- O-sinapoylglucose:malate sinapoyltransferase in 67. Suzuki T, Itoh A, Ichihara S, Mizushima S: Characterization of the Arabidopsis thaliana. Planta 2002, 215:26-32. sppA gene coding for protease IV, a signal peptide peptidase 45. Fraser CM, Rider LW, Chapple C: An expression and bioinfor- of Escherichia coli. J Bacteriol 1987, 169:2523-2528. matics analysis of the Arabidopsis serine carboxypeptidase- 68. Lensch M, Herrmann RG, Sokolenko A: Identification and charac- like gene family. Plant Physiol 2005, 138:1136-1148. terization of SppA, a novel light-inducible chloroplast pro- 46. Hall BG, Barlow M: Evolution of the serine beta-lactamases: tease complex associated with thylakoid membranes. J Biol past, present and future. Drug Resist Updat 2004, 7:111-123. Chem 2001, 276:33645-33651. 47. Galleni M, Lamotte-Brasseur J, Raquet X, Dubus A, Monnaie D, Knox 69. Freeman M: Rhomboids. Curr Biol 2003, 13:R586. JR, Frere JM: The enigmatic catalytic mechanism of active-site 70. Freeman M: Proteolysis within the membrane: rhomboids serine beta-lactamases. Biochem Pharmacol 1995, 49:1171-1178. revealed. Nat Rev Mol Cell Biol 2004, 5:188-197. 48. Petrosino J, Cantu C, Palzkill T: beta-Lactamases: protein evolu- 71. Koonin EV, Makarova KS, Rogozin IB, Davidovic L, Letellier MC, Pel- tion in real time. Trends Microbiol 1998, 6:323-327. legrini L: The rhomboids: a nearly ubiquitous family of intram- 49. Chandu D, Nandi D: Comparative genomics and functional embrane serine proteases that probably evolved by multiple roles of the ATP-dependent proteases Lon and Clp during ancient horizontal gene transfers. Genome Biol 2003, 4:R19. cytosolic protein degradation. Res Microbiol 2004, 155:710-719. 72. Urban S, Lee JR, Freeman M: Drosophila rhomboid-1 defines a 50. Wickner S, Maurizi MR: Here's the hook: similar substrate bind- family of putative intramembrane serine proteases. Cell 2001, ing sites in the chaperone domains of Clp and Lon. Proc Natl 107:173-182. Acad Sci U S A 1999, 96:8318-8320. 73. Lemberg MK, Menendez J, Misik A, Garcia M, Koth CM, Freeman M: 51. Kuroda H, Maliga P: The plastid clpP1 protease gene is essential Mechanism of intramembrane proteolysis investigated with for plant development. Nature 2003, 425:86-89. purified rhomboid proteases. Embo J 2005, 24:464-472. 52. Peltier JB, Ripoll DR, Friso G, Rudella A, Cai Y, Ytterberg J, Giacomelli 74. Kanaoka MM, Urban S, Freeman M, Okada K: An Arabidopsis L, Pillardy J, van Wijk KJ: Clp protease complexes from photo- Rhomboid homolog is an intramembrane protease in plants. synthetic and non-photosynthetic plastids and mitochondria FEBS Lett 2005, 579:5723-5728. of plants, their predicted three-dimensional structures, and 75. van der Bliek AM, Koehler CM: A mitochondrial rhomboid pro- functional implications. J Biol Chem 2004, 279:4768-4781. tease. Dev Cell 2003, 4:769-770. 53. Rotanova TV, Melnikov EE, Khalatova AG, Makhovskaya OV, Botos I, 76. Teixeira MT, Fabre E, Dujon B: Self-catalyzed cleavage of the Wlodawer A, Gustchina A: Classification of ATP-dependent yeast nucleoporin Nup145p precursor. J Biol Chem 1999, proteases Lon and comparison of the active sites of their 274:32439-32444. proteolytic domains. Eur J Biochem 2004, 271:4865-4871. 77. Hodel AE, Hodel MR, Griffis ER, Hennig KA, Ratner GA, Xu S, Pow- 54. Fukui T, Eguchi T, Atomi H, Imanaka T: A membrane-bound ers MA: The three-dimensional structure of the autoproteo- archaeal Lon protease displays ATP-independent proteo- lytic, nuclear pore-targeting domain of the human lytic activity towards unfolded proteins and ATP-dependent nucleoporin Nup98. Mol Cell 2002, 10:347-358. activity for folded proteins. J Bacteriol 2002, 184:3689-3698. 78. Zhang Y, Li X: A Putative Nucleoporin 96 Is Required for Both 55. Sarria R, Lyznik A, Vallejos CE, Mackenzie SA: A cytoplasmic male Basal Defense and Constitutive Resistance Responses Medi- sterility-associated mitochondrial peptide in common bean ated by suppressor of npr1-1, constitutive 1. Plant Cell 2005. is post-translationally regulated. Plant Cell 1998, 10:1217-1228. 79. Ponting CP, Russell RR: The natural history of protein domains. 56. Iyer LM, Leipe DD, Koonin EV, Aravind L: Evolutionary history Annu Rev Biophys Biomol Struct 2002, 31:45-71. and higher order classification of AAA+ ATPases. J Struct Biol 80. Vogel C, Bashton M, Kerrison ND, Chothia C, Teichmann SA: Struc- 2004, 146:11-31. ture, function and evolution of multidomain proteins. Curr 57. Paetzel M, Karla A, Strynadka NC, Dalbey RE: Signal peptidases. Opin Struct Biol 2004, 14:208-216. Chem Rev 2002, 102:4549-4580. 81. Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on 58. van Roosmalen ML, Geukens N, Jongbloed JD, Tjalsma H, Dubois JY, protein families. Science 1997, 278:631-637. Bron S, van Dijl JM, Anne J: Type I signal peptidases of Gram- 82. Raes J, Van de Peer Y: Gene duplication, the evolution of novel positive bacteria. Biochim Biophys Acta 2004, 1694:279-297. gene functions, and detecting functional divergence of dupli- cates in silico. Appl Bioinformatics 2003, 2:91-101.

Page 30 of 31 (page number not for citation purposes) BMC Genomics 2006, 7:200 http://www.biomedcentral.com/1471-2164/7/200

83. Lawton-Rauh A: Evolutionary dynamics of duplicated genes in plants. Mol Phylogenet Evol 2003, 29:396-409. 84. Cannon SB, Mitra A, Baumgarten A, Young ND, May G: The roles of segmental and tandem gene duplication in the evolution of large gene families in Arabidopsis thaliana. BMC Plant Biol 2004, 4:10. 85. Yu J, Wang J, Lin W, Li S, Li H, Zhou J, Ni P, Dong W, Hu S, Zeng C, Zhang J, Zhang Y, Li R, Xu Z, Li X, Zheng H, Cong L, Lin L, Yin J, Geng J, Li G, Shi J, Liu J, Lv H, Li J, Deng Y, Ran L, Shi X, Wang X, Wu Q, Li C, Ren X, Li D, Liu D, Zhang X, Ji Z, Zhao W, Sun Y, Zhang Z, Bao J, Han Y, Dong L, Ji J, Chen P, Wu S, Xiao Y, Bu D, Tan J, Yang L, Ye C, Xu J, Zhou Y, Yu Y, Zhang B, Zhuang S, Wei H, Liu B, Lei M, Yu H, Li Y, Xu H, Wei S, He X, Fang L, Huang X, Su Z, Tong W, Tong Z, Ye J, Wang L, Lei T, Chen C, Chen H, Huang H, Zhang F, Li N, Zhao C, Huang Y, Li L, Xi Y, Qi Q, Li W, Hu W, Tian X, Jiao Y, Liang X, Jin J, Gao L, Zheng W, Hao B, Liu S, Wang W, Yuan L, Cao M, McDermott J, Samudrala R, Wong GK, Yang H: The Genomes of Oryza sativa: a history of duplications. PLoS Biol 2005, 3:e38. 86. Blanc G, Wolfe KH: Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell 2004, 16:1679-1691. 87. Seoighe C, Gehring C: Genome duplication led to highly selec- tive expansion of the Arabidopsis thaliana proteome. Trends Genet 2004, 20:461-464. 88. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lip- man DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389-3402. 89. Bhaduri A, Sowdhamini R: A genome-wide survey of human tyrosine phosphatases. Protein Eng 2003, 16:881-888. 90. Letunic I, Copley RR, Schmidt S, Ciccarelli FD, Doerks T, Schultz J, Ponting CP, Bork P: SMART 4.0: towards genomic data integra- tion. Nucleic Acids Res 2004, 32:D142-4. 91. Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S, Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale DA: The COG database: an updated version includes eukaryotes. BMC Bioinformatics 2003, 4:41. 92. Li W, Jaroszewski L, Godzik A: Clustering of highly homologous sequences to reduce the size of large protein databases. Bio- informatics 2001, 17:282-283. 93. Johnson MS, Overington JP, Blundell TL: Alignment and searching for common protein folds using a data bank of structural templates. J Mol Biol 1993, 231:735-752. 94. Geer LY, Domrachev M, Lipman DJ, Bryant SH: CDART: protein homology by domain architecture. Genome Res 2002, 12:1619-1623. 95. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22:4673-4680. 96. Felsenstein J: PHYLIP (Phylogeny Inference Package) version 3.6. Distributed by the author Department of Genetics, University of Washington, Seattle 2005. 97. Page RD: TreeView: an application to display phylogenetic trees on personal computers. Comput Appl Biosci 1996, 12:357-358.

Publish with BioMed Central and every scientist can read your work free of charge "BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime." Sir Paul Nurse, Cancer Research UK Your research papers will be: available free of charge to the entire biomedical community peer reviewed and published immediately upon acceptance cited in PubMed and archived on PubMed Central yours — you keep the copyright

Submit your manuscript here: BioMedcentral http://www.biomedcentral.com/info/publishing_adv.asp

Page 31 of 31 (page number not for citation purposes)