<<

Biomolecules 2014, 4, 268-288; doi:10.3390/biom4010268 OPEN ACCESS biomolecules ISSN 2218-273X www.mdpi.com/journal/biomolecules/

Article Similar Structures to the E-to-H Helix Unit in the Globin-Like Fold are Found in Other Helical Folds

Masanari Matsuoka 1,2, Aoi Fujita 1, Yosuke Kawai 1,† and Takeshi Kikuchi 1,*

1 Department of Bioinformatics, College of Life Sciences, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan; E-Mails: [email protected] (M.M.); [email protected] (A.F.); [email protected] (Y.K.) 2 Japan Society for the Promotion of Science (JSPS), Ichibancho, Chiyoda-ku, Tokyo 102-8471, Japan

† Present address: Division of Biomedical Information Analysis, Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, Sendai, Miyagi 980-8575, Japan

* Author to whom correspondence should be addressed; E-Mail: [email protected]; Tel.: +81-77-561-5909; Fax: +81-77-561-5203.

Received: 6 December 2013; in revised form: 11 February 2014 / Accepted: 13 February 2014 / Published: 27 February 2014

Abstract: A protein in the globin-like fold contains six alpha-helices, A, B, E, F, G and H. Among them, the E-to-H helix unit (E, F, G and H helices) forms a compact structure. In this study, we searched similar structures to the E-to-H helix of leghomoglobin in the whole protein structure space using the Dali program. Several similar structures were found in other helical folds, such as KaiA/RbsU domain and Type III secretion system domain. These observations suggest that the E-to-H helix unit may be a common subunit in the whole protein 3D structure space. In addition, the common conserved hydrophobic residues were found among the similar structures to the E-to-H helix unit. Hydrophobic interactions between the conserved residues may stabilize the 3D structures of the unit. We also predicted the possible compact regions of the units using the average distance method.

Keywords: Globin-like fold; super-fold; keyword; dali search; average distance map

Biomolecules 2014, 4 269

1. Introduction

To elucidate how a protein folds into its unique 3D structure is a long-standing unsolved problem in molecular bioinformatics and molecular biophysics. To tackle this problem, we have to know how the information of a protein’s 3D structure construction is coded in its amino acid sequence. Protein folds are characterized by some combinations of secondary structural elements, and such structural characteristics of a protein are called its fold or topology. Some protein folds have neither sequence nor functional similarity. Proteins with these folds frequently appear in the structure database, that is, more than 30% of determined structures and such folds are called a “superfold” [1]. In such a case, one might ask if there is a common property in their amino acid sequences that leads to the same fold. Alternatively, is there a partial key 3D structure that leads to the same topology? If so, it should be a common 3D structural unit, which leads to a specific topology. It is well-recognized that some of the proteins classified as having the globin-like fold according to the Structural Classification of Protein Database (SCOP) show a rather low amino acid sequence identity of around 15% in spite of the high conservativeness of the 3D scaffold [1]. The globin-like fold is regarded as a superfold [1]. Figures 1 and 2 show the amino acid sequence alignment and 3D structures of leghemoglobin (soy bean) and myoglobin (sperm whale) as examples. The codes of the Protein Data Bank (PDB) are 1FSL and 1MBN, respectively.

Figure 1. Amino acid sequence alignment of soybean leghemoglobin (PDB: 1FSL) and sperm whale myoglobin (PDB: 1MBN). The amino acid sequence identity is 15%. White letters with a black background denotes a residue in the -helices of the E-to-H helix unit. The portions labeled by E, F, G and H refer to these -helical regions.

Biomolecules 2014, 4 270

Figure 2. 3D structures of (a) soybean leghemoglobin (PDB: 1FSL) and (b) sperm whale myoglobin (PDB: 1MBN). The portion in light gray is the E-to-H helix unit.

The amino acid sequence identity between them is 15%. As seen in Figure 2, a protein in the globin-like fold contains six helices A, B, E, F, G and H. Nakajima et al. [2] have revealed that the E-to-H helices is thought to be a folding initiation site, especially in the plant hemoglobins which are supposed to retain the ancestral characteristics in the predicted folding mechanism. This analysis corresponds well to the NMR experimental results [3] for leghemoglobin. Furthermore, a major part of the 3D structure of 2-on-2 hemoglobin (Mycobacterium tuberculosis), in which the N-terminal part in the peptide chain is truncated in comparison with leghemoglobin and myoglobin, is the E-to-H helix shown in Figure 3 [2].

Figure 3. 3D structure of 2-on-2 hemoglobin, i.e., Mycobacterium tuberculosis hemoglobin (PDB: 1NGK). The portion in light gray is the E-to-H helix unit. The E-to-H helix unit of this protein is very similar to that in 1FSL or 1MBN.

These observations lead us to speculate that the E-to-H helix part forms an evolutionary conserved 3D scaffold, that is, this part is considered as a structure formation unit. In addition, proteins regarded as members of a superfold sometimes appear in several SCOP folds [1]. Thus, we have come to believe that the fold of the E-to-H helix unit may be widely observed in protein 3D structure space.

Biomolecules 2014, 4 271

The purpose of the present study is to search for 3D structures similar to the E-to-H helix unit as independent folding units in SCOP folds besides the globin-like fold. Namely, we investigate the commonality of the 3D scaffold of the E-to-H helix unit in the protein 3D structure space. One technique employed in this work is a 3D structure alignment technique for the aforementioned purpose, the Dali algorithm [4]. Another technique employed involves a contact map constructed from only the amino acid sequence of a protein based on inter-residue average distance statistics (average distance map, ADM) [5,6]. This technique is used to predict a possible compact region in the 3D structure of a protein. In this study, a heme is not explicitly considered because a protein in the globin-like fold, myoglobin, has been confirmed to fold into its native structure without a heme [3]. A similar concept to the present study has been proposed as “protein lego” which refers to similar partial 3D structures observed in the protein 3D structure space [7,8]. Some analyses of 3D structural properties and stability of -helix bundle were carried out by means of energy calculations [9–19], Wenxiang diagram and so on [20–24]. In the present work, we focus on the E-to-H helix unit predicting whether a corresponding part will form a compact region as a possible folding unit. The organization of the present paper and the way of thinking are as follows. We first show the results of the DALI searches with the 3D structure of the E-to-H helix unit in soy bean leghemoglobin as a query followed by the ADM analyses for all hit proteins to examine whether the partial sequence of the E-to-H helix part in a hit protein exhibits a property to form a compact structure based on the ADM for this protein (Section 2.1). The details of the 3D structures of the proteins obtained in Section 2.1 are summarized in Section 2.2. In the Sections 2.3–2.5, the conservation of hydrophobic residues in homologues of a protein currently considered is presented and the hydrophobic packing formed by conserved hydrophobic residues is analyzed to examin whether a specific residue pattern can be observed in the E-to-H helix unit of a protein treated in the present study. Finally, the commonality of the E-to-H helix unit in the whole protein structure space is discussed.

2. Results

2.1. Dali Search and ADM Analyses

About 7500 hits were obtained by the Dali algorithm with the E-to-H helix unit of soybean leghemoglobin as a query. We picked up proteins with Z ≥ 2.0 so that the number of the false positive is less than about 5%. The hit proteins could be classified into the 690 SCOP families, and the representative protein from each family was selected based on the criterion described in the method section, Section 4.2. The 690 families are summarized in Table S1 of the Supplementary Material. It is somewhat hard to pick up folding units by analyzing compact parts defined in 3D structures. Some proteins contain several compact parts, which may be probable folding units. In this study, in order to detect the property of partial peptide sequences, which tend to form compact structures, we use the ADM method. The ADM method predicts the possible compact regions in a protein from its amino acid sequence and predicted regions correspond well to structural domains [5,6,25], and experimentally observed folding regions including highly protected regions measured by NMR during folding [26–28].

Biomolecules 2014, 4 272

For every representative protein, average distance map (ADM) analysis was performed and we took each protein in which the corresponding E-to-H unit was predicted as a compact region. An average distance map, ADM, is a kind of predicted contact maps and the details on ADM are described in the Section 4.2 (the method section). The ADMs for soybean leghemoglobin (PDB: 1FSL) and sperm whale myoglobin (PDB: 1MBN) are presented in Figure 4. In the ADM for soybean leghemoglobin, the regions 9–34 and 65–140 are predicted as compact regions with  values of 0.228 and 0.324 respectively. A  value denotes an index of the strength of the compactness of a predicted compact region by ADM (see method section for details). The respective predicted regions include helices A–B and E-to-H. From the  values, the E-to-H part can be regarded as the main compact unit in soybean leghemoglobin. On the other hand, the compact units 7–33 and 99–151 are predicted by the ADM for sperm whale myoglobin with the  values of 0.273 and 0.276 for these regions, respectively, with these regions corresponding to helices A–B and G–H as shown in Figure 4b. From the ADM for sperm whale myoglobin, the regions 7–33 and 99–151 are predicted to be equally compact. This prediction reflects the experimental results of folding of myoglobin by NMR [3,26]. This fact suggests that the G-H part sometimes strongly form in the E-to-H helix unit. Therefore, we also took each protein hit by the Dali search in which only the G–H part is predicted to be a compact region, because in sperm whale myoglobin only the G–H part is predicted to be a compact region.

Figure 4. Average distance maps (ADM) for 1FSL (a), and 1MBN (b). The locations of the -helices are labeled by A, B and so on. The label “9–34” denotes the compact region predicted by ADM. A numeral in parentheses indicates the  value of a compact region predicted by ADM.

The ADM predictions were applied for the representative proteins obtained above and finally, five representative proteins were obtained. We confirmed that these five proteins are also all obtained when the E-to-H helix unit from sperm whale myoglobin is used as a query instead of that from leghemoglobin (data not shown). The hit proteins and their details are summarized in Table 1. The amino acid sequences from FASTA files of these five proteins and the positions of the E-to-H helix parts are shown in Figure 5 (a helix is presented by white letters with a black background).

Biomolecules 2014, 4 273

Table 1. Proteins obtained in the present work. Region Hit by DALI Search with the Following Query(Z Score, rmsd(a)) Protein(Source, PDB ID) Family Fold Leghemoglobin E-to-H Myoglobin E-to-H (Soy Bean, 1fsl) (Sperm Whale, 1mbn) circadian clock protein Kai A Circadian clock protein KaiA/RbsU domain E-to-H helices (4.4, 3.8) E-to-H helices (7.0, 3.3) (Synechococcus, 1R8J) KaiA, C-terminal domain Type III secretion system secretion control protein A LcrE-like EGH helices (2.0, 4.5) FGH helices (4.1, 3.9) domain chain(Yersinia, 1XL3) SipA N-terminal cell invasion protein SipA SipA N-terminal domain-like E-to-H helices (2.9, 9.3) H helix (3.0, 8.0) domain-like (Salmonella, 2FM9) transcriptional regulator Tetracyclin repressor-like, Tetracyclin repressor-like, RHA1_ro04179 GH helices (4.4, 9.7) GH helices (2.1, 4.7) C-terminal domain C-terminal domain (Rodococcus, 2NP5) hypothetical protein AF0060 all-alpha NTP GH helices with a part of the E helix AF0060-like GH helices (3.5, 3.5) (E. coli, 2P06) pyrophosphatases (3.2, 4.8) (a): rms distance (Å) between the 3D structures of the aligned regions of the query structure and a hit protein.

Biomolecules 2014, 4 274

Figure 5. The amino acid sequences from FASTA files of (a) Circadian clock protein KaiA (Synechococcus, PDBID:1R8J), (b) Secretion control protein A chain (Yersinia, 1XL3), (c) Cell invasion protein SipA (Salmonella, PDBID: 2FM9), (d) Transcriptional regulator RHA1_ro04179 (Rodococcus, PDBID: 2NP5), and (e) Hypothetical protein AF0060 (E. coli, PDBID: 2P06). White letters with a black background denotes a residue in the -helices of the corresponding E-to-H helix unit.

Biomolecules 2014, 4 275

2.2. Detailed Comparisons of the Folding Units Predicted by ADMs with a Region Hit by the Dali Search

The regions predicted by ADMs for the respective proteins with the  values are summarized in Table 2. The details are as follows. We may use a PDB ID to specify each protein.

Table 2. The summary of the results of the ADMs. Protein (PDB ID) Predicted Folding Unit  Value 9–34 0.228 1FSL 65–140 0.324 7–33 0.273 1MBN 99–151 0.276 5–34 0.218 1R8J 51–82 0.203 223–270 0.254 3–44 0.195 1XL3 77–99 0.202 124–201 0.297 1–51 0.292 2FM9 79–115 0.157 166–199 0.297 5–38 0.253 2NP5 76–103 0.175 128–186 0.372 2P06 3–83 0.370

2.2.1. Circadian Clock Protein KaiA (1R8J)

The 3D structure and ADM of the circadian clock protein KaiA (1R8J) are presented in Figure 6a and 6b. The predicted compact regions are 5–34,51–82 and 223–270 as shown in Figure 6b and the last predicted region with the highest  value corresponds to G and H helices in the E-to-H helix unit as presented in Table 2 and Figure 6b. The rest of this protein, that is, the region 1–179 contains a Flavodoxin-like fold domain, namely,  domain (see Figures 5a (regions enclosed by rectangles) and 6a). Incidentally, the first and second predicted regions form the Rossmann fold.

Biomolecules 2014, 4 276

Figure 6. 3D structures and ADMs of 1R8J (a) and (b), 1XL3(c) and (d), 2FM9 (e) and (f), 2NP5 (g) and (h), and 2P06 (i) and (j). A portion in light gray denotes the corresponding E-to-H helix part. The label “5–34” denotes the compact region predicted by ADM. The numeral in parentheses means the  value of a compact region predicted by ADM.

2.2.2. Secretion Control Protein (1XL3)

This protein consists of only helices. Figure 6c and 6d present the 3D structure and ADM of the secretion control protein (1XL3). The predicted compact regions are 3–44, 77–99 and 124–201 and the corresponding E-to-H helices are included in the last predicted region with the highest  value (see Figure 6d and Table 2). The DALI search picked up the corresponding E, G and H helix parts in this protein, as the parts are structurally similar to the query structure. However, there is a helix between the corresponding E and G helices in this protein, and thus we regard this helix as the helix corresponding to the F helix (Figure 6d).

Biomolecules 2014, 4 277

2.2.3. Cell Invasion Protein SipA (2FM9)

This protein consists of only helices. Figures 6e and 6 f show the 3D structure and ADM of the cell invasion protein SipA (2FM9). The predicted compact regions are 1–51, 79–115 and 166–199 as presented in Table 2 and the region 1–51 corresponds to the E-to-H unit with one of the highest  values (0.292 for 1–51 and 0.297 for 166–199). The Dali search hit the segment corresponding to the E-to-G helices as the part structurally similar to the query structure (see also Figure S1 in Supplementary Material). The helix corresponding to the H helix is outside the predicted compact region, but the proximity of this helix to the E-to-G helix unit can be confirmed by visual inspection as seen in Figure 6e.

2.2.4. Transcriptional Regulator RHA1_ro04179 (2NP5)

This protein consists of only helices. The 3D structure and ADM of the transcriptional regulator RHA1_ro04179 (2NP5) are presented in Figure 6g,h. The predicted compact regions are 5–38, 76–103, and 128–186 as shown in Table 2 with the last predicted regions corresponding to the E (short part), G and H helices with the highest value. The Dali search hit only the part corresponding to the G-H helices as the part structurally similar to the query structure (Figure S1 in the Supplementary Material). The other helix is perpendicularly close to these two helices, and this can be regarded as the E helix (see Figure 6g). There is no helix corresponding to the F helix.

2.2.5. Hypothetical Protein AF0060 (2P06)

The 3D structure and ADM of the hypothetical protein AF0060 (2P06) are presented in Figure 6i and 6j. The ADM predicts the almost whole region to be the compact region (3–83) as seen in Table 2. The Dali search hit portions corresponding to the G-H helices and a small portion of the corresponding E helix as the part with a 3D structure similar to the query structure (Figure S1 in Supplementary Material). The structural similarity of this part to the E-to-H unit in leghemoglobin can also be confirmed by visual inspection (Figure 6i). The configurations of the four helices corresponding to the E-to-H helix unit in the proteins obtained by the present method are illustrated in Figure 7. In particular, 1R8J and 1XL3 have the same helix configuration as the helices in 1FSL and 1MBN. However, the helix configuration in 2FM9 and 2NP5 is almost a mirror image of that in 1FSL. This is because the Dali search picks up a protein with a 3D structure similar to that of a query protein by comparing the inter-C distances, and thus a protein with a mirror image structure with a query can be hit. 2P06 contains a helix configuration similar to that of 1FSL, but the orientation of the helix E is different. Thus, one should be careful about the helix configuration in each E-to-H helix unit. We call these helix configurations Configuration A (the configuration in 1FSL, 1MBN, 1R8J, and 1XL3), Configuration B (the configuration in 2FM9 and 2NP5), and Configuration C (configuration in 2P06), respectively, as illustrated in Figure 7.

Biomolecules 2014, 4 278

Figure 7. The configurations of helices E, F, G and H. (a) Configuration A. 1FSL, 1MBN, 1R8J and 1XL3 belong to this category. (b) Configuration B. Mirror image of configuration A. 2FM9 and 2NP5 belong to this category. (c) Configuration C. Variant of Configuration A. 2P06 belongs to this category.

2.3. Conserved Residues in the E-to-H Unit

We performed BLAST searches for amino acid sequences of the E-to-H helix parts in five proteins with leghemoglobin and myoglobin followed by multiple alignments with MUSCLE (see method section, Section 4.4, for details). The conservation in a helix corresponding to the F helix in 1FSL or 1MBN was not taken into account because this helix is not indispensable for structure formations [3]. The statistics of the searched homologs are summarized in Table 3. The accession codes of homologues and their multiple aliments are presented in Figure S2 of the Supplementary Material. Because there is no sufficient amino acid sequence diversity among the homologous sequences of 2FM9 and 2NP5, we cannot find any residues, which could be identified as the conserved residue in terms of the number of amino acid substitutions. The conservation in the amino acid sequence of 2P06 could not be evaluated since there is no significant hit with the amino acid sequence of this protein as a query on UniprotKB databases by a BLAST search. The configurations of the helices in these proteins, 2FM9, 2NP5, and 2P06, are shown in Figure 7b and 7c. Thus, we discuss the conservation of residues for the following four proteins: 1FSL, 2MBN, 1R8J, and 1XL3. All of these proteins contain the E-to-H helix unit with Configuration A (Figure 7a). Focusing on the conservativeness of hydrophobic residues, that is, A, F, I, L, M, V, and W, we present the conserved hydrophobic residues in each protein in Figure 8, indicating the conserved residues with the mark “˅”.

Biomolecules 2014, 4 279

Table 3. Homologues of the proteins in the present analyses. Number of Residues Protein Number of Number of Containing E-to-H (Source, PDB) Homologs Conserved Residue Helices Leghemoglobin (soybean 1FSl) 45 34 88 Myoglobin (sperm whale 1MBN) 82 38 96 Circadian clock protein KaiA 49 50 98 (Synechococcus, 1R8J) Secretion control protein) A chain 29 25 76 (Yersinia, 1XL3) Cell invasion protein SipA 6 0 79 (Salmonella, 2FM9) Transcriptional regulator RHA1_ro04179 3 0 74 (Rodococcus, 2NP5) Hypothetical protein AF0060 0 0 81 (E. coli, 2P06)

Figure 8. Conserved hydrophobic residues in the E-to-H helix unit of (a) 1FSL, (b) 1MBN, (c) 1R8J, (d) 1XL3, (e) 2FM9, (f) 2NP5, and (g) 2P06. The conserved residues are labeled with the mark “˅”. Any two residues with the same mark (one of the marks, #, %, ‡, †, ▲, ▼, ■, □, ○, ◊, and ∆) in different helices denote that this residue pair makes a hydrophobic packing detected by the buried surface. Residues with “*” in a helix constitute a common residue pattern specific for the E-to-H helix unit. The residue with “*” actually does not form hydrophobic packing. For (e) 2FM9, (f) 2NP5 and (g) 2P06, a residue with a mark “+” indicates a residue pattern similar to one of the present common residue patters in Table 4.

Biomolecules 2014, 4 280

Figure 8. Cont.

2.4. Residues Involved in Hydrophobic Packing Assigned Based on Buried Surface

The packing residue pairs in the E-to-H unit in each protein detected based on buried surface (see method section, Section 4.3, for details) are depicted in Figure 8. In this figure, any two residues with the same mark (one of the marks, #, %, ‡, †, ▲, ▼, ■, □, ○, ◊, and ∆) in different helices denote that this residue pair makes hydrophobic packing. For the analyses of the packing residue pairs, the F helix was not considered in the present study because it has been confirmed that the F helix in leghemoglobin does not play a significant role in its folding as already mentioned [3]. Relatively many packing pairs are observed between the G and H helices in every protein while a few packing residues are observed between the E and G or H helices as seen in Figure 8. From this figure, we can confirm that the almost all the packing hydrophobic residue pairs are distributed in the conserved residues. Only 1 of the 58 packing residues, 181-I in 1XL3, are not conserved residues in the present definition (Figure 8, see also Figure 5b).

Biomolecules 2014, 4 281

2.5. Common Residue Patterns Specific to the E-to-H Helix Unit Defined from the Packing Patterns of Conserved Hydrophobic Residues

A common residue pattern might be defined from the information of both conserved hydrophobic residues and the packing residues of the E-to-H helix unit in 1FSL, 1MBN, 1R8J and 1XL3. The common residue patterns defined by visual inspection are presented in Figure 8. A residue constituting a common pattern is labeled by the mark “*”. These residue patterns are also summarized in Table 4. The common residue patterns are expressed as x(3)x(3)orx(2)x(4)for the E helix, x(2,3)(2)x(2, 3) for the G helix, x(2)(2)x(2)for the H helix. The symbol  denotes a hydrophobic residue. Only 1of the 44 residues in these patterns, 267-R in 1R8J (a residue with the mark “*” in Figure 8), does not form hydrophobic packing. The packing of the residues constituting common residue patterns in the E-to-H helix unit in 1FSL, 1R8J and 1XL3 is presented in Figure 9a,c respectively. For 2FM9, 2NP5, and 2P06, the similar residue patterns are also inferred by visual inspection based on only the packing in Figures 8e,g. A residue in such a pattern in 2FM9, 2NP5, and 2P06 is labeled by “+” in Figure 8e,g. These residue patterns are also summarized in Table 4. The residue patterns for 2FM9, 2NP5, and 2P06 deviate somewhat from those in the four proteins discussed above. This is probably due to the different configurations of the E-to-H helix units [Figure 7b,c] compared with the helix configuration of the four proteins 1FSL, 1MBN, 1R8J, and 1XL3. Thus, the rule of common residue patterns cannot be applied strictly to 2FM9, 2NP5, and 2P06 as presented in Table 4.

Table 4. Common residue patterns. Protein E Helix G Helix H Helix 1FSL x(2)x(4) x(3)(2)x(2) x(2)(2)x(2) 1MBN x(3)x(3) x(3)(2)x(2) x(2)(2)x(2) 1R8J x(3)x(3) x(3)(2)x(2) x(2)x(3) 1XL3 x(3)x(3) x(2)(2)x(3) x(2)(2)x(3) 2FM9 — x(6) x(10) 2NP5 x(2)x(3) x(3)(2)x x(3)(2)x(2) 2P06 — x(3) x(3)(2)x(1)

The location of these residue patterns for 1R8J, 1XL3, 2FM9, 2NP5, and 2P06 correspond well to that for 1FSL. Figure S1 in Supplementary Material shows the amino acid sequence alignments based on the Dali algorithm of these proteins with 1FSL. In this figure, a residue with the mark “†” constitutes a common residue pattern.

3. Discussion

The results of the present study show that the common topology of the E-to-H helix unit can be observed beyond the globin-like fold when the Dali search is performed followed by ADM screening to check the tendency of a partial amino acid sequence of the hit regions to form a compact structure. This finding demonstrates that the E-to-H helix unit is a common structural unit in the structural space of proteins. Thus, E-to-H structures appearing in several protein folds must not necessarily have evolved from one common ancestor. Instead, the 3D structures of the E-to-H unit should be energetically stable and so this unit is observed widely in the 3D structure space. As far as the globin-like fold is concerned,

Biomolecules 2014, 4 282 as mentioned in the introduction section, there are truncated hemoglobins as primitive hemoglobins and the main structural unit of these proteins is the E-to-H helix unit part. It is speculated that E-to-H unit must be the basic structure in the early stage of the evolution of globin-like fold proteins, and this unit might grow to various globin-fold proteins during evolution. Proteins in other folds, for example, 1R8J, 1XL3 and so on, might also have grew from each ancestor protein with a structure similar to the E-to-H helix unit during evolution. Among the hit proteins, 1R8J is interesting because this protein is composed of two domains, one is the E-to-H unit and the other is the domain, while the other hit proteins are all proteins. This also indicates that the E–to-H helix unit widely exists in the structural space of proteins. Examining the conservation of packing residues with respect to the homologues of each protein reveals the common patterns of packing residue locations on the helices. These residue patterns seem to be typical patterns in -helices, but the present specific combination of these residue patterns must stabilize the 3D structure of the E-to-H helix unit. In other words, we speculate that the helix bundle formed by the G and H helices is stabilized and the E helix interacts with this bundle as shown in Figure 9. It might be possible to define a structural motif specific for E-to-H helix unit by sophisticating the present amino acid sequence patterns.

Figure 9. The hydrophobic packing formed by residues of the common patterns in (a) 1FSL; (b) 1R8J and (c) 1XL3.

Biomolecules 2014, 4 283

For the other hit proteins with the Configurations B and C, only loosely defined common residue pattern are observed as seen in Table 4. Thus, proteins with Configurations B and C show the irregularity of the patterns. Proteins with the Configurations B and C may be excluded from hit proteins by Dalilite by tuning of a Z-score. The analyses in the present study may be extended to other protein structural units and may lead us to find various common structural units beyond folds. We are currently studying along this direction.

4. Method

4.1. Search Protein 3D Structures Similar to that of the E-to-H Helix Unit in Leghemoglobin

In order to search for a 3D structure similar to that of the E-to-H helix unit, a 3D structure comparison program, Dalilite (http://ekhidna.biocenter.helsinki.fi/dali/star), was used [4] with the PDB structures. The 3D coordinates of the C atoms in the E-to-H units in soy bean leghemoglobin (PDB: 1FSL) were used as a query because the E-to-H unit in this protein has been confirmed as a folding core [2,3,26]. To do this, we prepared a 3D structure database, with Dalilite searches for the E-to-H-unit-like structures, by excluding the globin-like fold proteins from the whole PDB structures. Because all the globin-like fold proteins are expected to potentially possess an E-to-H unit, any PDB structures annotated as “globin-like fold” at SCOP were discarded from the Dali search. The protein in which the longest region aligned with the query structure produced by the Dalilite search in comparison with the other proteins within a SCOP family was taken as the representative of this family.

4.2. Prediction of Compact Regions Based on the Average Distance Map Method

To examine whether a polypeptide chain with the E-to-H-unit-like structure in a hit protein by Dalilite itself tends to form a compact structure, we employ the average distance map method. The summary of this method is presented as follows. The readers can also refer to [5,6] for more details. For a given amino acid sequence with an unknown 3D protein structure, a kind of predicted contact maps is defined by plotting a possible contact on a map when the average distance of a pair of residues in a “range” is less than a cutoff value determined beforehand. A map defined with inter-residue average distance statistics is referred to as Average Distance Map (ADM) in this paper. The procedure to construct ADM is presented as follows. (1) Calculations of average distances. The average distances between a pair of Cρatoms of residues were calculated in each range using proteins with known structures. A range is defined by the length between two residues along the amino acid sequence of a given protein. The range M = 1 is defined when 1 ≦ k ≦ 8, where k = |i − j| for i-th and j-th residues along the amino acid sequence. In the same way, the respective ranges M = 2, 3, 4 were defined by 9 ≦ k ≦ 20, 21 ≦ k ≦ 30, 31 ≦ k ≦40 and so on. For every range, the inter Cα atomic distances between two residues were calculated using proteins with known 3D structures [5]. (2) Construction of ADM. To construct a map for a given amino acid sequence, a cutoff distance for each range should be defined. These values are defined so that the contact density of the whole real distance map (RDM) of

Biomolecules 2014, 4 284 the protein is reproduced [5]. The RDM refers to a typical contact map, in other words, a map constructed based on the determined 3D structure of a protein. In the present study, an RDM corresponds to a map constructed with a cutoff of inter-residue Cα atomic distance of 15 Å. The formula, , where N is the total number of residues of a protein and C is an adjustable constant, approximately predicts the average values of the contact density of the entire region of an RDM [5]. C = 36.12, which corresponds to the 15Å threshold for the construction of an RDM, is used in the present work [5]. The cutoff distances for the construction of an ADM from the amino acid sequence of a protein are determined to reproduce a value of . A different cutoff distance is found for a different range to construct an ADM, whereas in the case of an RDM construction just one cutoff distance is required. In the construction of an ADM, it is assumed that the number of residue pairs that make contacts (and should therefore be plots on a map) obeys the following Equation in a range M [5]:

( )

Here, P(M)C is the number of amino acid pairs whose average distances in the range M is less than a given cutoff distance (i.e., residue pairs to be plotted on the ADM), and P(M)t is the total number of residue pairs in a given range (i.e., the number of the pairs with statistical significance among the whole 210 pairs of residues [5].

D is an adjustable parameter chosen to adjust the overall average density ρav of the ADM to be around the predicted value of ρav on an RDM. For an amino acid sequence of a protein, when a pair of residues has an average distance less than the cutoff distance in the range of the residue pair, a plot is made on a contact map. In this way, based on the inter-residue average distances, a kind of predicted contact map is produced from only the amino acid sequence of a given protein. A constructed map is analyzed by the following procedure. (1) Calculation of contact density differences.

Suppose that and ̃ represent the contact density of the triangular and trapezoidal parts, respectively, when the whole area of a map is divided into two parts by a line parallel to the abscissa at the i-th residue or by a line parallel to the ordinate at the i-th residue as illustrated in Figure 10a and 10b.

The contact density difference value is defined as ̃ .

Figure 10. Schematic drawing of a map divided by a line parallel to the ordinate at the i-th residue (a), and divided by a line parallel to the abscissa at the i-th residue (b). The density of

plots in the trapezoidal part and the triangular parts are denoted by ̃ and , respectively.

Biomolecules 2014, 4 285

The values of the contact density difference, are calculated from residues 1 to N, providing a scanning plot of . The scanning plot produced by the division using the line parallel to the ordinate is called horizontal scanning, and the plot produced by the line parallel to the abscissa is called vertical scanning. h of and v of denote the horizontal and vertical divisions of a map, respectively. (2) Detection of boundaries of a compact region. The maximum (peak) or minimum (valley) in the scanning plot reflects a large change in contact density values on a map. Figure 11a shows a schematic example of a horizontal scanning plot of from 1 to N, and at the bottom of the figure, a peak and a valley appear at c and d, respectively indicating a large change in contact density values. The same situation is observed in the vertical scanning in Figure 11a, indicating a peak and a valley at a and b, respectively (shown the left of the figure). The boundary of a compact region on a map can be detected by a peak and a valley appearing in the horizontal and vertical scanning plots of contact density differences.

Figure 11. (a) Schematic drawing of a map with some plots. A peak and a valley appear at the boundaries of a highly dense region of plots. This map suggests the interaction between the segments a–b and c–d. (b) Hypothetical contact map with two compact areas near the diagonal along with the horizontal and vertical scanning plots. This map suggests the existence of two domains at the regions p–q and m–n. We define  as a measure of the  v  v compactness of the region, namely,   pq    or   mn    .

(3) Prediction of location of compact regions. The following explains how to define a possible compact region on a map using the positions of the scanning plot peaks. Figure 11b shows a hypothetical contact map with two compact areas near the diagonal with the horizontal and vertical scanning plots. We recognize the existence of two domains by the peaks at residues m and n in the horizontal scanning plot and residues p and q in the vertical scanning plot. Thus, the regions m–p and n–q on the map are predicted as possible compact regions or domains in the protein. In the ADM method, the strength of the compactness of a region m–p is measured by values defined by (Figure 11b) [5]. The procedure described above predicts the regions of possible compact regions in a protein from only its amino acid sequence. The region with the highest value can be defined a most probable compact region and such a region can be interpreted as a folding unit in a protein.

Biomolecules 2014, 4 286

4.3. Identification of Residues Forming Hydrophobic Packing

The residues forming the hydrophobic packing were identified by the decrease of solvent accessible surface area (SASA) upon folding. If the decreases in SASA exceed 10 Å2 in both of a given two residues, this pair of residues is regarded as hydrophobic packing residues. The reduction in the surface area is given by the difference between the surface area of the side chain atoms in the presence of a contact with another residue upon folding and that in the absence of a contact with another residue before folding. The SASAs were numerically calculated by enumerating the grid points dropped onto the mesh surface equidistant from each atom of a side chain [29]. The distance of a grid point from the center of an atom is the van der Waals radius of each atom plus 1.4 Å, which is the probe radius corresponding to van der Waals radius of a solvent molecule (i.e., water). We confirmed that this criterion residue pair packing corresponds to the contact of two respective atoms in each residue within approximately 5 Å.

4.4. Identification of Conserved Residues in the E-to-H Helix Unit in Each Protein

In order to identify the conserved residues across the E-to-H helix unit of each protein hit by the Dali search, homologous amino acid sequences of the helix unit were collected from the amino acid sequence database, UniProt (http://www.uniprot.org/), by means of BLAST. The amino acid sequence of the E-to-H helix unit of a representative protein for each family as well as that of leghemoglobin (1FSL) were queried against the UniRef100 which is non-redundant set of the UniProt database. A threshold of the expected value of a BLAST search for a hit amino acid sequence retrieval was taken sufficiently low value, e = 0.001, to ensure that homologous sequences were obtained. We made a multiple alignment of the obtained amino acid sequences for each query protein using MUSCLE [30]. Because, in general, the cause of the high similarity among homologous amino acid sequences cannot be distinguished either by the structural conservation or by a phylogenetic closeness, we conducted the phylogenetic analysis to find conserved residues across the homologous amino acid sequence caused by 3D structural restraints by comparing the number of amino acid substitutions in each residue. The phylogenetic trees of homologous amino acid sequences were reconstructed to obtain their ancestral amino acid sequences. The phylogenetic reconstruction was carried out by the NJ method implemented in Seaview [31]. The ancestral amino acid sequence estimation was carried out by PAML4.0 [32] where the reconstructed tree was used as the input tree and the JTT model was used as the amino acid substitution model. The number of amino acid substitutions through the tree was determined for every residue from the estimated ancestral amino acid sequences. Since the total length of the phylogenetic tree corresponds to the expected number of amino acid substitutions averaged over all residues and the distribution of the number of amino acid substitutions can be approximated using the Poisson distribution, the residue at a position in a multiple alignment, which shows that the number of amino acid substitutions at this position is within 5% from the lower level expected from Poisson distribution with mean values estimated from tree length, can be regarded as the genuine conserved residue.

Biomolecules 2014, 4 287

Acknowledgments

This work is supported by JSPS KAKENHI to M.M. Grant-in-Aid for JSPS Fellows (Grant No 259198). One of the authors (Takeshi Kikuchi) also expresses his gratitude to the Ministry of Education, Culture, Sports, Science and Technology for the support of the present work through a program for strategic research foundations at private universities, 2010–2014 (Grant No S10010).

Conflicts of Interest

The authors declare no conflict of interest.

References

1. Orengo, C.A.; Jones, D.T.; Thornton, J.M. Protein superfamilies and domain superfolds. Nature 1994, 372, 631–634. 2. Nakajima, S.; Álvarez-Salgado, E.; Kikuchi, T.; Arredondo-Peter, R. Prediction of folding pathway and kinetics among plant hemoglobins by using an average distance map method. Proteins 2005, 61, 500–506. 3. Nishimura, C.; Prytulla, S.; Dyson, H.J.; Wright, P.E. Conservation of folding pathways in evolutionarily distant globin sequences. Nature Struct. Biol. 2000, 7, 679–686. 4. Holm, L.; Park, J. DaliLite workbench for protein structure comparison. Bioinformatics 2000, 16, 566–567. 5. Kikuchi, T.; Némethy, G.; Scheraga, H.A. Prediction of the location of structural domains in globular proteins. J. Protein Chem. 1988, 7, 427–471. 6. Kikuchi, T. Decoding s of proteins using inter-residue average distance statistics to extract information on protein folding mechanisms. In Protein Folding; Walters, E.C., Ed.; Nova Science Publishers. Inc.: New York, NY, USA, 2011; pp. 465–487, ISBN 978-1-61728-990-32011. 7. Tai, C.H.; Sam, V.; Gibrat, J.F.; Garnier, J.; Munson, P.J.; Lee, B. Protein domain assignment from the recurrence of locally similar structures. Proteins 2011, 79, 853–866. 8. Szustakowski, J.D.; Kasif, S.; Weng, Z. Less is more: Towards an optimal universal description of protein folds. Bioinformatics 2005, 21, ii66–ii71. 9. Carlacci, L.; Chou, K.C. Energetic approach to the folding of four a-helices connected sequentially. Protein Eng. 1990, 3, 509–514. 10. Chou, K.C.; Maggiora, G.M.; Némethy, G.; Scheraga, H.A. Energetics of the structure of the four-alpha-helix bundle in proteins. Proc. Natl. Acad. Sci. USA 1988, 8, 4295–4299. 11. Chou, K.C.; Maggiora, G.M.; Scheraga, H.A. The role of loop-helix interactions in stabilizing four-helix bundle proteins. Proc. Natl. Acad. Sci. USA 1992, 89, 7315–7319. 12. Chou, K.C.; Némethy, G.; Pottle, M.; Scheraga, H.A. Energy of stabilization of the right-handed beta-alpha-beta crossover in proteins. J. Mol. Biol. 1989, 205, 241–249. 13. Chou, K.C.; Némethy, G.; Scheraga, H.A. Energetic approach to packing of a-helices: 2. General treatment of nonequivalent and nonregular helices. J. Amer. Chem. Soc. 1984, 106, 3161–3170. 14. Chou, K.C. Review: Applications of graph theory to kinetics and protein folding kinetics. Steady and non-steady state systems. Biophys. Chem. 1990, 35, 1–24.

Biomolecules 2014, 4 288

15. Chou, K.C. Does the folding type of a protein depend on its amino acid composition? FEBS Lett. 1995, 363, 127–131. 16. Chou, K.C.; Carlacci, L. Energetic approach to the folding of alpha/beta barrels. Proteins 1991, 9. 280–295. 17. Chou, K.C.; Zhang, C.T. Predicting protein folding types by distance functions that make allowances for amino acid interactions. J. Biol. Chem. 1994, 269, 22014–22020. 18. Zhang, C.T.; Chou, K.C. An eigenvalue-eigenvector approach to predicting protein folding types. J. Protein Chem. 1995, 14, 309–326. 19. Gerritsen, M.; Chou, K.C.; Némethy, G.; Scheraga, H.A. Energetics of multi-helix interactions in protein folding: Application to myoglobin. Biopolymers 1985, 24, 1271–1293. 20. Chou, K.C.; Lin, W.Z.; Xiao, X. Wenxiang: A web-server for drawing wenxiang diagrams. Natur. Sci. 2011, 3, 862–865. 21. Chou, K.C.; Zhang, C.T.; Maggiora, G.M. Disposition of amphiphilic helices in heteropolar environments. Proteins 1997, 28, 99–108. 22. Zhou, G.P. The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein-protein interaction mechanism. J. Theor. Biol. 2011, 284, 142–148. 23. Zhou, G.P. The Structural Determinations of the leucine zipper coiled-coil domains of the cgmp-dependent protein kinase i alpha and its interaction with the myosin binding subunit of the myosin light chains phosphase. Protein Peptide Lett. 2011, 18, 966–978. 24. Zhou, G.P.; Huang, R.B. The pH-triggered conversion of the PrP(c) to PrP(sc.). Curr. Top Med. Chem. 2013, 13, 1152–1163. 25. Kawai, Y.; Matsuoka, M.; Kikuchi, T. Analyses of Protein Sequences Using Inter-Residue Average Distance Statistics to Study Folding Processes and the Significance of Their Partial Sequences.Protein Peptide Lett. 2011, 18, 979–990. 26. Ichimaru, T.; Kikuchi, T. Analysis of the differences in the folding kinetics of structurally homologous proteins based on predictions of the gross features of residue contacts. Proteins 2003, 51, 515–530. 27. Ishizuka, Y.; Kikuchi, T. Analysis of the Local Sequences of Folding Sites in Sandwich Proteins with Inter-Residue Average Distance Statistics. Open Bioinforma. J. 2011, 5, 59–68. 28. Nakajima, S.; Kikuchi, T. Analysis of the differences in the folding mechanisms of c-type lysozymes based on contact maps constructed with interresidue average distances. J. Mol. Model. 2007, 13, 587–594. 29. Shrake, A; Rupley, J.A. Environment and exposure to solvent of protein atoms. Lysozyme and insulin. J. Mol. Biol. 1973, 79, 351–371. 30. Edgar, R.C. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res.2004, 32, 1792–1797. 31. Gouy, M.; Guindon, S.; Gascuel, O. SeaView version 4: A multiplatform graphical user interface for sequence alignment and phylogenetic tree building. Mol. Biol. Evol. 2010, 27:221–224. 32. Yang, Z. PAML 4: Phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 2007, 24, 1586– 1591.

© 2014 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Table S1: All Families hit by the present Dali search

14-3-3 protein ALDH-like

1-deoxy-D-xylulose-5-phosphate reductoisomerase, Allosteric chorismate mutase

C-terminal domain alpha-Amylases, C-terminal beta-sheet domain

2Fe-2S ferredoxin domains from multidomain proteins alpha-catenin/vinculin

3-Methyladenine DNA glycosylase III (MagIII) alpha-D-glucuronidasE-Hyaluronidase catalytic domain

5-carboxymethyl-2-hydroxymuconate (CHMI) alpha-D-glucuronidase, N-terminal domain

5' nucleotidase-like Alpha-hemoglobin stabilizing protein AHSP

5' to 3' exonuclease catalytic domain alpha-Subunit of urease

5' to 3' exonuclease, C-terminal subdomain alpha-subunit of urease, catalytic domain

6-phosphogluconate dE-Hydrogenase-like, N-terminal Amidase signature (AS)

domain A middle domain of Talin 1

ABC transporter ATPase domain-like Aminoacid dE-Hydrogenase-like, C-terminal domain

Aconitase B, N-terminal domain Aminoacid dE-Hydrogenases

Aconitase iron-sulfur domain Aminoglycoside phosphotransferases

Actin/HSP70 AmyC C-terminal domain-like

Acyl-CoA binding protein AmyC N-terminal domain-like

acyl-CoA oxidase C-terminal domains Amylase, catalytic domain

acyl-CoA oxidase N-terminal domains Animal lipoxigenases

Adapter protein APS, dimerisation domain Anticodon-binding domain of a subclass of class I

Adenylation domain of NAD+-dependent DNA aminoacyl-tRNA synthetases

Adenylyl and guanylyl cyclase catalytic domain Antioxidant defense protein AhpD

Adenylylcyclase toxin (the edema factor) Anti-sigma factor AsiA

a domain/subunit of cytochrome bc1 complex Apolipophorin-III

(Ubiquinol-cytochrome c reductase) Apolipoprotein

Aerobic respiration control sensor protein, ArcB Aquaporin-like

AF0060-like Archaeal DNA-binding protein

AF0941-like Arfaptin, Rac-binding fragment

AF1104-like Arginyl-tRNA synthetase (ArgRS), N-terminal 'additional'

AgoG-like domain

Alcohol dE-Hydrogenase-like, C-terminal domain Aristolochene/

Alcohol dE-Hydrogenase-like, N-terminal domain Armadillo repeat

AldE-Hyde ferredoxin , C-terminal domains Arp2/3 complex 16 kDa subunit ARPC5

AldE-Hyde ferredoxin oxidoreductase, N-terminal domain Arylsulfatase Aspartate/ornithine carbamoyltransferase Calmodulin-like

Aspartokinase allosteric domain-like Calpain large subunit, catalytic domain (domain II)

ATP synthase Calpain large subunit, middle domain (domain III)

ATP synthase (F1-ATPase), gamma subunit cAMP-binding domain a tRNA synthase domain CAPPD, an extracellular domain of amyloid beta A4 protein

Atu0492-like Cell cycle transcription factor e2f-dp

Avirulence protein AvrPto Cell division protein ZapA-like

B12-dependent (class II) ribonucleotide reductase Cellulases catalytic domain

Bacterial dinuclear zinc exopeptidases Cellulose-binding domain family III

Bacterial exopeptidase dimerisation domain Centromere-binding

Bacterial GAP domain Cfr10I/Bse634I

Bacterial glucoamylase C-terminal domain-like Chaperone J-domain

Bacterial glucoamylase N-terminal domain-like Chemotaxis phosphatase CheZ

Bacterial photosystem II reaction centre, L and M subunits Chemotaxis protein CheA P1 domain

Bacterial tryptophan 2,3-dioxygenase CheW-like

Bacteriophage CII protein Cholesterol oxidase

Bacteriorhodopsin-like Choline/Carnitine O-acyltransferase

BAG domain Circadian clock protein KaiA, C-terminal domain

Band 7/SPFH domain Class I alpha-1;2-mannosidase, catalytic domain

BAR domain Class I aminoacyl-tRNA synthetases (RS), catalytic domain

Barrier-to-autointegration factor, BAF Class II aminoacyl-tRNA synthetase (aaRS)-like, catalytic

BAS1536-like domain

BC ATP-binding domain-like Class-II DAHP synthetase

BC C-terminal domain-like Class II glutamine amidotransferases

Bcl-2 inhibitors of programmed cell death Class III anaerobic ribonucleotide reductase NRDD subunit

BC N-terminal domain-like clathrin assemblies

Beta-D-glucan exohydrolase, C-terminal domain Clostridium neurotoxins, catalytic domain beta-Lactamase/D-ala carboxypeptidase Clostridium neurotoxins, "coiled-coil" domain beta-Phosphoglucomutase-like Clostridium neurotoxins, C-terminal domain

Biotin repressor-like Clostridium neurotoxins, the second last domain

BRCA2 helical domain CMD-like

BRCA2 tower domain Cobalamin adenosyltransferase

Bromodomain Cobalamin (vitamin B12)-binding domain

Calcium ATPase, transduction domain A CO dE-Hydrogenase flavoprotein C-terminal domain-like

Calcium ATPase, transmembrane domain M CO dE-Hydrogenase flavoprotein N-terminal domain-like CO dE-Hydrogenase ISP C-domain like Cytochrome c'-like

CO dE-Hydrogenase molybdoprotein N-domain-like Cytochrome c oxidase subunit III-like

Coiled-coil domain of nucleotide exchange factor GrpE Cytochrome c oxidase subunit II-like, transmembrane

Cold shock DNA-binding domain-like region

Colicin E3 receptor domain Cytochrome c oxidase subunit I-like

Computational models partly based on NMR data Cytochrome P450

Conserved hypothetical protein MTH1747 Cytochrome p450 reductase N-terminal domain-like

CorA soluble domain-like Cytoplasmic domain of inward rectifier potassium channel

Coronavirus NSP8-like D-aminoacid aminotransferase-like PLP-dependent

Coronavirus S2 glycoprotein enzymes

COX17-like DBL homology domain (DH-domain)

Creatinase/aminopeptidase Dcp2 box A domain

Creatinase/prolidase N-terminal domain Decarboxylase

Crotonase-like DE-Hydroquinate synthase, DHQS

Cryptochrome/photolyase FAD-binding domain delta-Endotoxin, C-terminal domain

Cryptochrome/photolyase, N-terminal domain delta-Endotoxin (insectocide), middle domain

CSE2-like delta-Endotoxin (insectocide), N-terminal domain

C-terminal domain of alpha and beta subunits of F1 ATP Designed four-helix bundle protein synthase Designed single chain threE-Helix bundle

C-terminal domain of class I lysyl-tRNA synthetase DHN aldolase/epimerase

C-terminal domain of DFF45/ICAD (DFF-C domain) Dihydrodipicolinate reductase-like

C-terminal domain of eukaryotic peptide chain release Dimeric chorismate mutase factor subunit 1, ERF1 Dimerisation domain of CENP-B

C-terminal domain of Ku80 Dimerization-anchoring domain of cAMP-dependent PK

C-terminal fragment of elongation factor SelB regulatory subunit

C-terminal fragment of thermolysin DinB-like

C-terminal UvrC-binding domain of UvrB Diol dE-Hydratase, gamma subunit

CUE domain DLC

Cullin homology domain DNA-binding N-terminal domain of transcription activators

Cullin repeat DNA-binding protein Mj223

Cyclin DNA damage-inducible protein DinI

Cyclophilin (peptidylprolyl isomerase) DnaK suppressor protein DksA, alpha-hairpin domain

Cytochrome b562 DNA ligase/mRNA capping enzyme postcatalytic domain

Cytochrome b of cytochrome bc1 complex DNA polymerase I

(Ubiquinol-cytochrome c reductase) DNA polymerase III clamp loader subunits, C-terminal domain Exocyst complex component

DnaQ-like 3'-5' exonuclease Exodeoxyribonuclease V beta chain (RecC), C-terminal

DNA repair protein MutS, domain I domain

DNA repair protein MutS, domain II EXOSC10 HRDC domain-like

DNA repair protein MutS, domain III Extended AAA-ATPase domain

DNA topoisomerase IV, alpha subunit F1F0 ATP synthase subunit A

Domain of poly(ADP-ribose) polymerase F1F0 ATP synthase subunit C

Domain of the SRP/SRP receptor G-proteins FAD-linked oxidases, N-terminal domain

Double Clp-N motif FAD-linked reductases, N-terminal domain double-SIS domain FAD/NAD-linked reductases, dimerisation (C-terminal)

E2 regulatory, transactivation domain domain

EF2458-like FAD/NAD-linked reductases, N-terminal and central

EF-hand modules in multidomain proteins domains eIF-2-alpha, C-terminal domain FAT domain of focal adhesion kinase eIF2alpha middle domain-like Fe,Mn superoxide dismutase (SOD), C-terminal domain

Electron transfer flavoprotein-ubiquinone Fe,Mn superoxide dismutase (SOD), N-terminal domain oxidoreductase-like FemXAB nonribosomal peptidyltransferases

Elongation factor Ts (EF-Ts), dimerisation domain Ferredoxin domains from multidomain proteins

EntA-Im Ferritin

Enzyme IIa from lactose specific PTS, IIa-lac Fertilization protein

Enzyme I of the PEP:sugar phosphotransferase system FHV B2 protein-like

HPr-binding (sub)domain Fibrinogen coiled-coil and central regions

EpoxidE- FIS-like

Epsilon subunit of F1F0-ATP synthase C-terminal domain FKBP12-rapamycin-binding domain of

Epsilon subunit of F1F0-ATP synthase N-terminal domain FKBP-rapamycin-associated protein (FRAP)

ERP29 C domain-like FKBP immunophilin/proline isomerase

ERP29 N domain-like Flagellar export chaperone FliS

ESAT-6 like FMN-linked

E-set domains of sugar-utilizing enzymes Formate dE-Hydrogenase N, cytochrome (gamma) subunit

ETF-QO domain-like Formin homology 2 domain (FH2 domain)

Eukaryotic DNA topoisomerase I, catalytic core Fumarate reductase respiratory complex cytochrome b

Eukaryotic DNA topoisomerase I, dispensable insert domain subunit, FrdC

Eukaryotic DNA topoisomerase I, N-terminal DNA-binding Fungal zinc peptidase fragment GABA-aminotransferase-like

Eukaryotic type KH-domain (KH-domain type I) GAPDH-like GAT domain Heat-inducible transcription repressor HrcA, N-terminal

Gated mechanosensitive channel domain

G domain-linked domain Heat shock protein 70kD (HSP70), C-terminal subdomain

Glutamate-cysteine ligase Heat shock protein 70kD (HSP70), peptide-binding domain

Glutamyl tRNA-reductase catalytic, N-terminal domain Hect, E3 ligase catalytic domain

Glutamyl tRNA-reductase dimerization domain Helical scaffold and wing domains of SecA

Glutathione peroxidase-like Hemerythrin

Glutathione S- (GST), C-terminal domain Hepatitis B viral capsid (hbcag)

Glutathione S-transferase (GST), N-terminal domain Hermes dimerisation domain

GlyceraldE-Hyde-3-phosphate dE-Hydrogenase-like, Hermes transposase-like

N-terminal domain HisE-like (PRA-PH)

Glycerol-3-phosphate (1)-acyltransferase Histidine kinase

Glycerol-3-phosphate dE-Hydrogenase HlyD-like secretion proteins

(Glycosyl)asparaginase HMA, heavy metal-associated domain

Glycosyltransferase family 36 C-terminal domain HMD dimerization domain-like

Glycosyltransferase family 36 N-terminal domain HMG-box

GMC oxidoreductases Homeodomain

G proteins Homodimeric domain of signal transducing histidine kinase

GreA transcript cleavage factor, C-terminal domain HP1531-like

GreA transcript cleavage protein, N-terminal domain HR1 repeat

GRIP domain HrcA C-terminal domain-like

GroEL chaperone, ATPase domain HRDC domain from helicases

GroEL-like chaperone, apical domain HSC20 (HSCB), C-terminal oligomerisation domain

GroEL-like chaperone, intermediate domain Hsp90 co-chaperone CDC37

Group II chaperonin (CCT, TRIC), apical domain HSP90 C-terminal domain

Group II chaperonin (CCT, TRIC), ATPase domain Hut operon positive regulatory protein HutP

Group II chaperonin (CCT, TRIC), intermediate domain Hyaluronate -like catalytic, N-terminal domain

Group I mobile intron endonuclease -like, central domain

Group V grass pollen allergen Hyaluronate lyase-like, C-terminal domain

GTP cyclohydrolase I Hyaluronidase N-terminal domain-like half-ferritin Hyaluronidase post-catalytic domain-like

Haloperoxidase (bromoperoxidase) Hybrid and chimeric proteins

HAL/PAL-like Hybrid cluster protein (prismane protein)

HCDH C-domain-like Hypothetical protein D-63

Head domain of nucleotide exchange factor GrpE Hypothetical protein MTH393 Hypothetical protein MTH677 domain

Hypothetical protein Ta1238 LeuD-like

Hypothetical protein TM1457 Lipoxigenase N-terminal domain

Hypothetical protein VC0424 Long-chain cytokines

Hypothetical protein YfhH L-sulfolactate dE-Hydrogenase-like

Hypothetical protein YhaI Magnesium transport protein CorA, transmembrane region

Hypothetical Protein YjhP Mago nashi protein

IF2B-like Malate synthase G

I/LWEQ domain Malic enzyme N-domain

IMD domain MarR-like transcriptional regulators

Influenza hemagglutinin (stalk) Mason-Pfizer monkey virus matrix protein

Inositol MazG-like monophosphatase/fructose-1,6-bisphosphatase-like Mechanosensitive channel protein MscS (YggB), C-terminal

Inositol polyphosphate 5-phosphatase (IPP5) domain

Insect pheromone/odorant-binding proteins Mechanosensitive channel protein MscS (YggB), middle

Insert subdomain of RNA polymerase alpha subunit domain

Interferon-induced guanylate-binding protein 1 (GBP1), Mechanosensitive channel protein MscS (YggB),

C-terminal domain transmembrane region

Interferons/interleukin-10 (IL-10) MED7 hinge region

IpaD-like Medium chain acyl-CoA dE-Hydrogenase-like, C-terminal

Iron-containing alcohol dE-Hydrogenase domain

Isoprenyl diphosphate synthases Medium chain acyl-CoA dE-Hydrogenase, NM (N-terminal

ISY1 N-terminal domain-like and middle) domains

KaiB-like Membrane ion ATPase

L27 domain Meta-cation ATPase, catalytic domain P

Lactate & malate dE-Hydrogenases, C-terminal domain Metal cation-transporting ATPase, ATP-binding domain N

Lambda integrase-like, catalytic core Methane monooxygenasE-Hydrolase, gamma subunit lambda integrase N-terminal domain Methicillin resistance protein FemA probable tRNA-binding

Large subunit arm

L-aspartase/ Methionine synthase domain

LcrE-like Methionine synthase SAM-binding domain

LDH N-terminal domain-like Methionyl-tRNA synthetase (MetRS), Zn-domain

LemA-like Methyl-accepting chemotaxis protein (MCP) signaling

Lesion bypass DNA polymerase (Y-family), catalytic domain domain

Lesion bypass DNA polymerase (Y-family), little finger Methyl-coenzyme M reductase alpha and beta chain C-terminal domain NadC N-terminal domain-like

Methyl-coenzyme M reductase alpha and beta chain NAD+-dependent DNA ligase, domain 3

N-terminal domain NADH oxidase/flavin reductase

Methylmalonyl-CoA mutase, N-terminal (CoA-binding) NADPH-cytochrome p450 reductase FAD-binding domain domain-like

Middle domain of eukaryotic peptide chain release factor NADPH-cytochrome p450 reductase-like subunit 1, ERF1 NagZ-like

MIF4G domain-like Neurotransmitter-gated ion-channel transmembrane pore

MIF-related NF-kappa-B/REL/DORSAL transcription factors,

Mismatch glycosylase C-terminal domain

MIT domain Nickel-containing superoxide dismutase, NiSOD

Mitochondrial ATP synthase coupling factor 6 Nickel-iron hydrogenase, large subunit

Mitochondrial glycoprotein MAM33-like NifU C-terminal domain-like

Mob1/phocein NifU/IscU domain modified HD domain Nitric oxide (NO) synthase oxygenase domain

Molybdenum -binding domain Nitrogenase iron protein-like

Molybdopterin synthase subunit MoaE NKL-like monodomain cytochrome c Non-heme 11 kDa protein of cytochrome bc1 complex monomeric chorismate mutase (Ubiquinol-cytochrome c reductase)

Motor proteins Nonstructural protein ns2, Nep, M1-binding domain

MPP-like Nop domain mRNA decapping enzyme-like NSP3 homodimer

MTH1187-like N-terminal domain of adenylylcyclase associated protein,

Multidrug efflux transporter AcrB pore domain; PN1, PN2, CAP

PC1 and PC2 subdomains N-terminal domain of alpha and beta subunits of F1 ATP

Multidrug efflux transporter AcrB TolC docking domain; DN synthase and DC subdomains N-terminal domain of cbl (N-cbl)

Multidrug efflux transporter AcrB transmembrane domain N-terminal domain of enzyme I of the PEP:sugar

MxiH-like phosphotransferase system

Myb/SANT domain N-terminal domain of eukaryotic peptide chain release

Myosin S1 fragment, N-terminal domain factor subunit 1, ERF1

N-acetyl transferase, NAT N-terminal domain of the circadian clock protein KaiA

N-acylglucosamine (NAG) epimerase N-terminal domain of the delta subunit of the F1F0-ATP

NAD-binding domain of HMG-CoA reductase synthase

NadC C-terminal domain-like N-terminal, RNA-binding domain of nonstructural protein NS1 PHBH-like

Nuclear receptor ligand-binding domain Phenylalanyl-tRNA synthetase (PheRS)

Nucleotide and nucleoside kinases PhoB-like

NusA extra C-terminal domains Phoshoinositide 3-kinase (PI3K), catalytic domain

Occludin/ELL domain Phoshoinositide 3-kinase (PI3K) helical domain

Ohr/OsmC resistance proteins Phosphate binding protein-like

Oligosaccharide phosphorylase Phosphoenolpyruvate carboxylase

OmpH-like PhosphonoacetaldE-HydE-Hydrolase-like

Orbivirus capsid Phosphorelay protein-like

Orbivirus core Phosphorelay protein luxU

Outer membrane efflux proteins (OEP) Phosphoserine phosphatase RsbU, N-terminal domain

Outer surface protein C (OspC) Photosystem I reaction center subunit XI, PsaL

Oxygen-evolving enhancer protein 3, Photosystems

P40 nucleoprotein PhoU-like

PadR-like PIN domain

Pantothenate synthetase (Pantoate-beta-alanine ligase, Plant invertase/pectin methylesterase inhibitor

PanC) Plant lipoxigenases

PAPS sulfotransferase PLC-like (P variant)

PDEase Pleckstrin-homology domain (PH domain)

PE Poly(ADP-ribose) polymerase, C-terminal domain

Penicillin-binding protein 2x (pbp-2x), c-terminal domain Poly A polymerase C-terminal region-like

Penicillin binding protein dimerisation domain Poly A polymerasE-Head domain-like

Penta-EF-hand proteins Polynucleotide phosphorylase/guanosine pentaphosphate

PEP carboxykinase C-terminal domain synthase (PNPase/GPSI), domain 3

PEP carboxykinase N-terminal domain Polyphosphate kinase C-terminal domain

Pepsin-like Potassium channel NAD-binding domain

PepX catalytic domain-like POU-specific domain

PepX C-terminal domain-like PPE

Periplasmic domain of cytochrome c oxidase subunit II PPK middle domain-like

PF1790-like PPK N-terminal domain-like

PfkB-like kinase Predicted Cof

PFOR PP module Prefoldin

PFOR Pyr module Pre-protein crosslinking domain of SecA

PG0816-like PriB N-terminal domain-like

Phase 1 flagellin Prion-like Prokaryotic DksA/TraR C4-type zinc finger RecG "wedge" domain

Prokaryotic phospholipase A2 Regulator of G-protein signaling, RGS

Prokaryotic type I DNA topoisomerase RelA/SpoT domain

Prokaryotic type KH domain (KH-domain type II) Rel/Dorsal transcription factors, DNA-binding domain

Proteasome subunits Release factor

Proteinase/alpha-amylase inhibitors Reovirus components

Protein HNS-dependent expression A; HdeA Restriction endonuclease EcoO109IR

Protein kinases, catalytic subunit Retrovirus capsid protein C-terminal domain

Protein-L-isoaspartyl O-methyltransferase Retrovirus capsid protein, N-terminal core domain

Protein prenylyltransferase Ribonucleotide reductase-like

Pseudo ankyrin repeat Ribosomal protein L19 (L19e)

PTS-regulatory domain, PRD Ribosomal protein L29 (L29p)

Putative anticodon-binding domain of alanyl-tRNA Ribosomal protein L7/12, C-terminal domain synthetase (AlaRS) Ribosomal protein L7/12, oligomerisation (N-terminal)

Putative thiamin/HMP-binding protein YkoF domain

Putative transcriptional regulator TM1602, C-terminal Ribosomal protein S15 domain Ribosomal protein S20

PyrH-like Ribosome-binding factor A, RbfA

Pyrimidine 5'-nucleotidase (UMPH-1) Ribosome complexes

Pyruvate-ferredoxin oxidoreductase, PFOR, domain II Ribosome recycling factor, RRF

Pyruvate-ferredoxin oxidoreductase, PFOR, domain III RING finger domain, C3HC4

Pyruvate phosphate dikinase, central domain RNA polymerase

Pyruvate phosphate dikinase, C-terminal domain RNA polymerase alpha subunit dimerisation domain

Pyruvate phosphate dikinase, N-terminal domain RNA-polymerase beta

R1 subunit of ribonucleotide reductase, C-terminal domain RNA-polymerase beta-prime

R1 subunit of ribonucleotide reductase, N-terminal domain RNase III catalytic domain-like

Rabenosyn-5 Rab-binding domain-like Roadblock/LC7 domain

Rab geranylgeranyltransferase alpha-subunit, C-terminal ROK domain ROK associated domain

Rab geranylgeranyltransferase alpha-subunit, insert domain ROP protein

Rad50 coiled-coil Zn hook RuBisCo LSMT catalytic domain

RAP domain RuBisCo LSMT C-terminal, -binding domain

Ras-binding domain, RBD Rubredoxin

RecA protein-like (ATPase-domain) RWD domain

RecG, N-terminal domain S100 proteins Saposin B Succinate dE-Hydrogenase/fumarate reductase

SCF ubiquitin ligase complex WHB domain flavoprotein C-terminal domain

ScpB/YpuH-like Succinate dE-Hydrogenase/fumarate reductase

Sec1/munc18-like (SM) proteins flavoprotein N-terminal domain

Secreted chorismate mutase-like Succinate dE-Hydrogenase/Fumarate reductase

Serine acetyltransferase transmembrane subunits (SdhC/FrdC and SdhD/FrdD)

Serum albumin-like Superantigen MAM

Seryl-tRNA synthetase (SerRS) Swaposin

SH2 domain Synatpobrevin N-terminal domain

SH3-domain T7 RNA polymerase

Short-chain cytokines Tandem AAA-ATPase domain

Siah interacting protein N terminal domain-like TAZ domain

Sigma2 domain of RNA polymerase sigma factors TBP-associated factors, TAFs

Sigma3 domain TENA/THI-4

Sigma4 domain Terpene synthases

Signal peptide-binding domain Terpenoid cyclase C-terminal domain

Single strand DNA-binding domain, SSB Terpenoid cyclase N-terminal domain

SipA N-terminal domain-like Tetracyclin repressor-like, C-terminal domain

Siroheme synthase middle domains-like Tetracyclin repressor-like, N-terminal domain

Siroheme synthase N-terminal domain-like Tetratricopeptide repeat (TPR)

Small-conductance potassium channel Thermolysin-like

Small subunit Tim10/DDP

Smc hinge domain TK-like PP module

SopE-like GEF domain TK-like Pyr module

Spectrin repeat Top domain of virus capsid protein

SPy1572-like Topoisomerase V catalytic domain-like

SRP alpha N-terminal domain-like Topoisomerase V repeat domain

Stabilizer of iron transporter SufD TorD-like

Staphylocoagulase Transcriptional repressor TraM

STAT Transcription factor IIA (TFIIA), alpha-helical domain

STAT DNA-binding domain Transcription factor IIA (TFIIA), beta-barrel domain

Substrate-binding domain of HMG-CoA reductase Transcription factor IIB (TFIIB), core domain

Subtilases Transcription factor NusA, N-terminal domain

Succinate dE-Hydrogenase/fumarate reductase Transcription factor STAT-4 N-domain flavoprotein, catalytic domain Transducin (alpha subunit), insertion domain Transglutaminase core Virus ectodomain

Transglutaminase N-terminal domain Voltage-gated potassium channels

Transglutaminase, two C-terminal domains VPR protein fragments

Transketolase C-terminal domain-like VPS23 C-terminal domain

Translationally controlled tumor protein TCTP VPS28 C-terminal domain-like

(histamine-releasing factor) VPS28 N-terminal domain

Translin VPS37 C-terminal domain-like

Translocated intimin receptor (tir) intimin-binding domain VPS9 domain

TransmembranE-Helical fragments V-type ATP synthase subunit C

Transposase IS200-like XseB-like

Trigger factor ribosome-binding domain YciF-like

TrkA C-terminal domain-like YfiT-like putative metal-dependent hydrolases

TrmB-like YfmB-like

Tryptophan synthase beta subunit-like PLP-dependent YggX-like enzymes YhfA-like t-snare proteins YihX-like

TS-N domain YlxM/p13-like

Tumor suppressor gene Apc YppE-like

TyeA-like Zn metallo-beta-lactamase

Type II restriction endonuclease catalytic domain 14-3-3 protein

Type II restriction endonuclease effector domain 1-deoxy-D-xylulose-5-phosphate reductoisomerase,

Typo IV secretion system protein TraC C-terminal domain

TyrA dimerization domain-like 2Fe-2S ferredoxin domains from multidomain proteins

Tyrosine-dependent oxidoreductases 3-Methyladenine DNA glycosylase III (MagIII)

Ubiquitin activating enzymes (UBA) 5-carboxymethyl-2-hydroxymuconate isomerase (CHMI)

U-box 5' nucleotidase-like

Upper collar protein gp10 (connector protein) 5' to 3' exonuclease catalytic domain

Urocanase 5' to 3' exonuclease, C-terminal subdomain

Vacuolar ATP synthase subunit C 6-phosphogluconate dE-Hydrogenase-like, N-terminal

Vanabin-like domain

Vanillyl-alcohol oxidase-like ABC transporter ATPase domain-like

Variant surface glycoprotein (N-terminal domain) Aconitase B, N-terminal domain

VBS domain Aconitase iron-sulfur domain

Vertebrate phospholipase A2 Actin/HSP70

Viral DNA-binding domain Acyl-CoA binding protein acyl-CoA oxidase C-terminal domains Anticodon-binding domain of a subclass of class I acyl-CoA oxidase N-terminal domains aminoacyl-tRNA synthetases

Adapter protein APS, dimerisation domain Antioxidant defense protein AhpD

Adenylation domain of NAD+-dependent DNA ligase Anti-sigma factor AsiA

Adenylyl and guanylyl cyclase catalytic domain Apolipophorin-III

Adenylylcyclase toxin (the edema factor) Apolipoprotein a domain/subunit of cytochrome bc1 complex Aquaporin-like

(Ubiquinol-cytochrome c reductase) Archaeal DNA-binding protein

Aerobic respiration control sensor protein, ArcB Arfaptin, Rac-binding fragment

AF0060-like Arginyl-tRNA synthetase (ArgRS), N-terminal 'additional'

AF0941-like domain

AF1104-like Aristolochene/pentalenene synthase

AgoG-like Armadillo repeat

Alcohol dE-Hydrogenase-like, C-terminal domain Arp2/3 complex 16 kDa subunit ARPC5

Alcohol dE-Hydrogenase-like, N-terminal domain Arylsulfatase

AldE-Hyde ferredoxin oxidoreductase, C-terminal domains Aspartate/ornithine carbamoyltransferase

AldE-Hyde ferredoxin oxidoreductase, N-terminal domain Aspartokinase allosteric domain-like

ALDH-like ATP synthase

Allosteric chorismate mutase ATP synthase (F1-ATPase), gamma subunit alpha-Amylases, C-terminal beta-sheet domain a tRNA synthase domain alpha-catenin/vinculin Atu0492-like alpha-D-glucuronidasE-Hyaluronidase catalytic domain Avirulence protein AvrPto alpha-D-glucuronidase, N-terminal domain B12-dependent (class II) ribonucleotide reductase

Alpha-hemoglobin stabilizing protein AHSP Bacterial dinuclear zinc exopeptidases alpha-Subunit of urease Bacterial exopeptidase dimerisation domain alpha-subunit of urease, catalytic domain Bacterial GAP domain

Amidase signature (AS) enzymes Bacterial glucoamylase C-terminal domain-like

A middle domain of Talin 1 Bacterial glucoamylase N-terminal domain-like

Aminoacid dE-Hydrogenase-like, C-terminal domain Bacterial photosystem II reaction centre, L and M subunits

Aminoacid dE-Hydrogenases Bacterial tryptophan 2,3-dioxygenase

Aminoglycoside phosphotransferases Bacteriophage CII protein

AmyC C-terminal domain-like Bacteriorhodopsin-like

AmyC N-terminal domain-like BAG domain

Amylase, catalytic domain Band 7/SPFH domain

Animal lipoxigenases BAR domain Barrier-to-autointegration factor, BAF Class II aminoacyl-tRNA synthetase (aaRS)-like, catalytic

BAS1536-like domain

BC ATP-binding domain-like Class-II DAHP synthetase

BC C-terminal domain-like Class II glutamine amidotransferases

Bcl-2 inhibitors of programmed cell death Class III anaerobic ribonucleotide reductase NRDD subunit

BC N-terminal domain-like clathrin assemblies

Beta-D-glucan exohydrolase, C-terminal domain Clostridium neurotoxins, catalytic domain beta-Lactamase/D-ala carboxypeptidase Clostridium neurotoxins, "coiled-coil" domain beta-Phosphoglucomutase-like Clostridium neurotoxins, C-terminal domain

Biotin repressor-like Clostridium neurotoxins, the second last domain

BRCA2 helical domain CMD-like

BRCA2 tower domain Cobalamin adenosyltransferase

Bromodomain Cobalamin (vitamin B12)-binding domain

Calcium ATPase, transduction domain A CO dE-Hydrogenase flavoprotein C-terminal domain-like

Calcium ATPase, transmembrane domain M CO dE-Hydrogenase flavoprotein N-terminal domain-like

Calmodulin-like CO dE-Hydrogenase ISP C-domain like

Calpain large subunit, catalytic domain (domain II) CO dE-Hydrogenase molybdoprotein N-domain-like

Calpain large subunit, middle domain (domain III) Coiled-coil domain of nucleotide exchange factor GrpE cAMP-binding domain Cold shock DNA-binding domain-like

CAPPD, an extracellular domain of amyloid beta A4 protein Colicin E3 receptor domain

Cell cycle transcription factor e2f-dp Computational models partly based on NMR data

Cell division protein ZapA-like Conserved hypothetical protein MTH1747

Cellulases catalytic domain CorA soluble domain-like

Cellulose-binding domain family III Coronavirus NSP8-like

Centromere-binding Coronavirus S2 glycoprotein

Cfr10I/Bse634I COX17-like

Chaperone J-domain Creatinase/aminopeptidase

Chemotaxis phosphatase CheZ Creatinase/prolidase N-terminal domain

Chemotaxis protein CheA P1 domain Crotonase-like

CheW-like Cryptochrome/photolyase FAD-binding domain

Cholesterol oxidase Cryptochrome/photolyase, N-terminal domain

Choline/Carnitine O-acyltransferase CSE2-like

Circadian clock protein KaiA, C-terminal domain C-terminal domain of alpha and beta subunits of F1 ATP

Class I alpha-1;2-mannosidase, catalytic domain synthase

Class I aminoacyl-tRNA synthetases (RS), catalytic domain C-terminal domain of class I lysyl-tRNA synthetase C-terminal domain of DFF45/ICAD (DFF-C domain) Dihydrodipicolinate reductase-like

C-terminal domain of eukaryotic peptide chain release Dimeric chorismate mutase factor subunit 1, ERF1 Dimerisation domain of CENP-B

C-terminal domain of Ku80 Dimerization-anchoring domain of cAMP-dependent PK

C-terminal fragment of elongation factor SelB regulatory subunit

C-terminal fragment of thermolysin DinB-like

C-terminal UvrC-binding domain of UvrB Diol dE-Hydratase, gamma subunit

CUE domain DLC

Cullin homology domain DNA-binding N-terminal domain of transcription activators

Cullin repeat DNA-binding protein Mj223

Cyclin DNA damage-inducible protein DinI

Cyclophilin (peptidylprolyl isomerase) DnaK suppressor protein DksA, alpha-hairpin domain

Cytochrome b562 DNA ligase/mRNA capping enzyme postcatalytic domain

Cytochrome b of cytochrome bc1 complex DNA polymerase I

(Ubiquinol-cytochrome c reductase) DNA polymerase III clamp loader subunits, C-terminal

Cytochrome c'-like domain

Cytochrome c oxidase subunit III-like DnaQ-like 3'-5' exonuclease

Cytochrome c oxidase subunit II-like, transmembrane DNA repair protein MutS, domain I region DNA repair protein MutS, domain II

Cytochrome c oxidase subunit I-like DNA repair protein MutS, domain III

Cytochrome P450 DNA topoisomerase IV, alpha subunit

Cytochrome p450 reductase N-terminal domain-like Domain of poly(ADP-ribose) polymerase

Cytoplasmic domain of inward rectifier potassium channel Domain of the SRP/SRP receptor G-proteins

D-aminoacid aminotransferase-like PLP-dependent Double Clp-N motif enzymes double-SIS domain

DBL homology domain (DH-domain) E2 regulatory, transactivation domain

Dcp2 box A domain EF2458-like

Decarboxylase EF-hand modules in multidomain proteins

DE-Hydroquinate synthase, DHQS eIF-2-alpha, C-terminal domain delta-Endotoxin, C-terminal domain eIF2alpha middle domain-like delta-Endotoxin (insectocide), middle domain Electron transfer flavoprotein-ubiquinone delta-Endotoxin (insectocide), N-terminal domain oxidoreductase-like

Designed four-helix bundle protein Elongation factor Ts (EF-Ts), dimerisation domain

Designed single chain threE-Helix bundle EntA-Im

DHN aldolase/epimerase Enzyme IIa from lactose specific PTS, IIa-lac Enzyme I of the PEP:sugar phosphotransferase system FHV B2 protein-like

HPr-binding (sub)domain Fibrinogen coiled-coil and central regions

EpoxidE-Hydrolase FIS-like

Epsilon subunit of F1F0-ATP synthase C-terminal domain FKBP12-rapamycin-binding domain of

Epsilon subunit of F1F0-ATP synthase N-terminal domain FKBP-rapamycin-associated protein (FRAP)

ERP29 C domain-like FKBP immunophilin/proline isomerase

ERP29 N domain-like Flagellar export chaperone FliS

ESAT-6 like FMN-linked oxidoreductases

E-set domains of sugar-utilizing enzymes Formate dE-Hydrogenase N, cytochrome (gamma) subunit

ETF-QO domain-like Formin homology 2 domain (FH2 domain)

Eukaryotic DNA topoisomerase I, catalytic core Fumarate reductase respiratory complex cytochrome b

Eukaryotic DNA topoisomerase I, dispensable insert domain subunit, FrdC

Eukaryotic DNA topoisomerase I, N-terminal DNA-binding Fungal zinc peptidase fragment GABA-aminotransferase-like

Eukaryotic type KH-domain (KH-domain type I) GAPDH-like

Exocyst complex component GAT domain

Exodeoxyribonuclease V beta chain (RecC), C-terminal Gated mechanosensitive channel domain G domain-linked domain

EXOSC10 HRDC domain-like Glutamate-cysteine ligase

Extended AAA-ATPase domain Glutamyl tRNA-reductase catalytic, N-terminal domain

F1F0 ATP synthase subunit A Glutamyl tRNA-reductase dimerization domain

F1F0 ATP synthase subunit C Glutathione peroxidase-like

FAD-linked oxidases, N-terminal domain Glutathione S-transferase (GST), C-terminal domain

FAD-linked reductases, N-terminal domain Glutathione S-transferase (GST), N-terminal domain

FAD/NAD-linked reductases, dimerisation (C-terminal) GlyceraldE-Hyde-3-phosphate dE-Hydrogenase-like, domain N-terminal domain

FAD/NAD-linked reductases, N-terminal and central Glycerol-3-phosphate (1)-acyltransferase domains Glycerol-3-phosphate dE-Hydrogenase

FAT domain of focal adhesion kinase (Glycosyl)asparaginase

Fe,Mn superoxide dismutase (SOD), C-terminal domain Glycosyltransferase family 36 C-terminal domain

Fe,Mn superoxide dismutase (SOD), N-terminal domain Glycosyltransferase family 36 N-terminal domain

FemXAB nonribosomal peptidyltransferases GMC oxidoreductases

Ferredoxin domains from multidomain proteins G proteins

Ferritin GreA transcript cleavage factor, C-terminal domain

Fertilization protein GreA transcript cleavage protein, N-terminal domain GRIP domain HrcA C-terminal domain-like

GroEL chaperone, ATPase domain HRDC domain from helicases

GroEL-like chaperone, apical domain HSC20 (HSCB), C-terminal oligomerisation domain

GroEL-like chaperone, intermediate domain Hsp90 co-chaperone CDC37

Group II chaperonin (CCT, TRIC), apical domain HSP90 C-terminal domain

Group II chaperonin (CCT, TRIC), ATPase domain Hut operon positive regulatory protein HutP

Group II chaperonin (CCT, TRIC), intermediate domain Hyaluronate lyase-like catalytic, N-terminal domain

Group I mobile intron endonuclease Hyaluronate lyase-like, central domain

Group V grass pollen allergen Hyaluronate lyase-like, C-terminal domain

GTP cyclohydrolase I Hyaluronidase N-terminal domain-like half-ferritin Hyaluronidase post-catalytic domain-like

Haloperoxidase (bromoperoxidase) Hybrid and chimeric proteins

HAL/PAL-like Hybrid cluster protein (prismane protein)

HCDH C-domain-like Hypothetical protein D-63

Head domain of nucleotide exchange factor GrpE Hypothetical protein MTH393

Heat-inducible transcription repressor HrcA, N-terminal Hypothetical protein MTH677 domain Hypothetical protein Ta1238

Heat shock protein 70kD (HSP70), C-terminal subdomain Hypothetical protein TM1457

Heat shock protein 70kD (HSP70), peptide-binding domain Hypothetical protein VC0424

Hect, E3 ligase catalytic domain Hypothetical protein YfhH

Helical scaffold and wing domains of SecA Hypothetical protein YhaI

Hemerythrin Hypothetical Protein YjhP

Hepatitis B viral capsid (hbcag) IF2B-like

Hermes dimerisation domain I/LWEQ domain

Hermes transposase-like IMD domain

HisE-like (PRA-PH) Influenza hemagglutinin (stalk)

Histidine kinase Inositol

HlyD-like secretion proteins monophosphatase/fructose-1,6-bisphosphatase-like

HMA, heavy metal-associated domain Inositol polyphosphate 5-phosphatase (IPP5)

HMD dimerization domain-like Insect pheromone/odorant-binding proteins

HMG-box Insert subdomain of RNA polymerase alpha subunit

Homeodomain Interferon-induced guanylate-binding protein 1 (GBP1),

Homodimeric domain of signal transducing histidine kinase C-terminal domain

HP1531-like Interferons/interleukin-10 (IL-10)

HR1 repeat IpaD-like Iron-containing alcohol dE-Hydrogenase domain

Isoprenyl diphosphate synthases Medium chain acyl-CoA dE-Hydrogenase, NM (N-terminal

ISY1 N-terminal domain-like and middle) domains

KaiB-like Membrane ion ATPase

L27 domain Meta-cation ATPase, catalytic domain P

Lactate & malate dE-Hydrogenases, C-terminal domain Metal cation-transporting ATPase, ATP-binding domain N

Lambda integrase-like, catalytic core Methane monooxygenasE-Hydrolase, gamma subunit lambda integrase N-terminal domain Methicillin resistance protein FemA probable tRNA-binding

Large subunit arm

L-aspartase/fumarase Methionine synthase domain

LcrE-like Methionine synthase SAM-binding domain

LDH N-terminal domain-like Methionyl-tRNA synthetase (MetRS), Zn-domain

LemA-like Methyl-accepting chemotaxis protein (MCP) signaling

Lesion bypass DNA polymerase (Y-family), catalytic domain domain

Lesion bypass DNA polymerase (Y-family), little finger Methyl-coenzyme M reductase alpha and beta chain domain C-terminal domain

LeuD-like Methyl-coenzyme M reductase alpha and beta chain

Lipoxigenase N-terminal domain N-terminal domain

Long-chain cytokines Methylmalonyl-CoA mutase, N-terminal (CoA-binding)

L-sulfolactate dE-Hydrogenase-like domain

Magnesium transport protein CorA, transmembrane region Middle domain of eukaryotic peptide chain release factor

Mago nashi protein subunit 1, ERF1

Malate synthase G MIF4G domain-like

Malic enzyme N-domain MIF-related

MarR-like transcriptional regulators Mismatch glycosylase

Mason-Pfizer monkey virus matrix protein MIT domain

MazG-like Mitochondrial ATP synthase coupling factor 6

Mechanosensitive channel protein MscS (YggB), C-terminal Mitochondrial glycoprotein MAM33-like domain Mob1/phocein

Mechanosensitive channel protein MscS (YggB), middle modified HD domain domain Molybdenum cofactor-binding domain

Mechanosensitive channel protein MscS (YggB), Molybdopterin synthase subunit MoaE transmembrane region monodomain cytochrome c

MED7 hinge region monomeric chorismate mutase

Medium chain acyl-CoA dE-Hydrogenase-like, C-terminal Motor proteins MPP-like Nop domain mRNA decapping enzyme-like NSP3 homodimer

MTH1187-like N-terminal domain of adenylylcyclase associated protein,

Multidrug efflux transporter AcrB pore domain; PN1, PN2, CAP

PC1 and PC2 subdomains N-terminal domain of alpha and beta subunits of F1 ATP

Multidrug efflux transporter AcrB TolC docking domain; DN synthase and DC subdomains N-terminal domain of cbl (N-cbl)

Multidrug efflux transporter AcrB transmembrane domain N-terminal domain of enzyme I of the PEP:sugar

MxiH-like phosphotransferase system

Myb/SANT domain N-terminal domain of eukaryotic peptide chain release

Myosin S1 fragment, N-terminal domain factor subunit 1, ERF1

N-acetyl transferase, NAT N-terminal domain of the circadian clock protein KaiA

N-acylglucosamine (NAG) epimerase N-terminal domain of the delta subunit of the F1F0-ATP

NAD-binding domain of HMG-CoA reductase synthase

NadC C-terminal domain-like N-terminal, RNA-binding domain of nonstructural protein

NadC N-terminal domain-like NS1

NAD+-dependent DNA ligase, domain 3 Nuclear receptor ligand-binding domain

NADH oxidase/flavin reductase Nucleotide and nucleoside kinases

NADPH-cytochrome p450 reductase FAD-binding NusA extra C-terminal domains domain-like Occludin/ELL domain

NADPH-cytochrome p450 reductase-like Ohr/OsmC resistance proteins

NagZ-like Oligosaccharide phosphorylase

Neurotransmitter-gated ion-channel transmembrane pore OmpH-like

NF-kappa-B/REL/DORSAL transcription factors, Orbivirus capsid

C-terminal domain Orbivirus core

Nickel-containing superoxide dismutase, NiSOD Outer membrane efflux proteins (OEP)

Nickel-iron hydrogenase, large subunit Outer surface protein C (OspC)

NifU C-terminal domain-like Oxygen-evolving enhancer protein 3,

NifU/IscU domain P40 nucleoprotein

Nitric oxide (NO) synthase oxygenase domain PadR-like

Nitrogenase iron protein-like Pantothenate synthetase (Pantoate-beta-alanine ligase,

NKL-like PanC)

Non-heme 11 kDa protein of cytochrome bc1 complex PAPS sulfotransferase

(Ubiquinol-cytochrome c reductase) PDEase

Nonstructural protein ns2, Nep, M1-binding domain PE Penicillin-binding protein 2x (pbp-2x), c-terminal domain Poly A polymerase C-terminal region-like

Penicillin binding protein dimerisation domain Poly A polymerasE-Head domain-like

Penta-EF-hand proteins Polynucleotide phosphorylase/guanosine pentaphosphate

PEP carboxykinase C-terminal domain synthase (PNPase/GPSI), domain 3

PEP carboxykinase N-terminal domain Polyphosphate kinase C-terminal domain

Pepsin-like Potassium channel NAD-binding domain

PepX catalytic domain-like POU-specific domain

PepX C-terminal domain-like PPE

Periplasmic domain of cytochrome c oxidase subunit II PPK middle domain-like

PF1790-like PPK N-terminal domain-like

PfkB-like kinase Predicted hydrolases Cof

PFOR PP module Prefoldin

PFOR Pyr module Pre-protein crosslinking domain of SecA

PG0816-like PriB N-terminal domain-like

Phase 1 flagellin Prion-like

PHBH-like Prokaryotic DksA/TraR C4-type zinc finger

Phenylalanyl-tRNA synthetase (PheRS) Prokaryotic phospholipase A2

PhoB-like Prokaryotic type I DNA topoisomerase

Phoshoinositide 3-kinase (PI3K), catalytic domain Prokaryotic type KH domain (KH-domain type II)

Phoshoinositide 3-kinase (PI3K) helical domain Proteasome subunits

Phosphate binding protein-like Proteinase/alpha-amylase inhibitors

Phosphoenolpyruvate carboxylase Protein HNS-dependent expression A; HdeA

PhosphonoacetaldE-HydE-Hydrolase-like Protein kinases, catalytic subunit

Phosphorelay protein-like Protein-L-isoaspartyl O-methyltransferase

Phosphorelay protein luxU Protein prenylyltransferase

Phosphoserine phosphatase RsbU, N-terminal domain Pseudo ankyrin repeat

Photosystem I reaction center subunit XI, PsaL PTS-regulatory domain, PRD

Photosystems Putative anticodon-binding domain of alanyl-tRNA

PhoU-like synthetase (AlaRS)

PIN domain Putative thiamin/HMP-binding protein YkoF

Plant invertase/pectin methylesterase inhibitor Putative transcriptional regulator TM1602, C-terminal

Plant lipoxigenases domain

PLC-like (P variant) PyrH-like

Pleckstrin-homology domain (PH domain) Pyrimidine 5'-nucleotidase (UMPH-1)

Poly(ADP-ribose) polymerase, C-terminal domain Pyruvate-ferredoxin oxidoreductase, PFOR, domain II Pyruvate-ferredoxin oxidoreductase, PFOR, domain III RING finger domain, C3HC4

Pyruvate phosphate dikinase, central domain RNA polymerase

Pyruvate phosphate dikinase, C-terminal domain RNA polymerase alpha subunit dimerisation domain

Pyruvate phosphate dikinase, N-terminal domain RNA-polymerase beta

R1 subunit of ribonucleotide reductase, C-terminal domain RNA-polymerase beta-prime

R1 subunit of ribonucleotide reductase, N-terminal domain RNase III catalytic domain-like

Rabenosyn-5 Rab-binding domain-like Roadblock/LC7 domain

Rab geranylgeranyltransferase alpha-subunit, C-terminal ROK domain ROK associated domain

Rab geranylgeranyltransferase alpha-subunit, insert domain ROP protein

Rad50 coiled-coil Zn hook RuBisCo LSMT catalytic domain

RAP domain RuBisCo LSMT C-terminal, substrate-binding domain

Ras-binding domain, RBD Rubredoxin

RecA protein-like (ATPase-domain) RWD domain

RecG, N-terminal domain S100 proteins

RecG "wedge" domain Saposin B

Regulator of G-protein signaling, RGS SCF ubiquitin ligase complex WHB domain

RelA/SpoT domain ScpB/YpuH-like

Rel/Dorsal transcription factors, DNA-binding domain Sec1/munc18-like (SM) proteins

Release factor Secreted chorismate mutase-like

Reovirus components Serine acetyltransferase

Restriction endonuclease EcoO109IR Serum albumin-like

Retrovirus capsid protein C-terminal domain Seryl-tRNA synthetase (SerRS)

Retrovirus capsid protein, N-terminal core domain SH2 domain

Ribonucleotide reductase-like SH3-domain

Ribosomal protein L19 (L19e) Short-chain cytokines

Ribosomal protein L29 (L29p) Siah interacting protein N terminal domain-like

Ribosomal protein L7/12, C-terminal domain Sigma2 domain of RNA polymerase sigma factors

Ribosomal protein L7/12, oligomerisation (N-terminal) Sigma3 domain domain Sigma4 domain

Ribosomal protein S15 Signal peptide-binding domain

Ribosomal protein S20 Single strand DNA-binding domain, SSB

Ribosome-binding factor A, RbfA SipA N-terminal domain-like

Ribosome complexes Siroheme synthase middle domains-like

Ribosome recycling factor, RRF Siroheme synthase N-terminal domain-like Small-conductance potassium channel Thermolysin-like

Small subunit Tim10/DDP

Smc hinge domain TK-like PP module

SopE-like GEF domain TK-like Pyr module

Spectrin repeat Top domain of virus capsid protein

SPy1572-like Topoisomerase V catalytic domain-like

SRP alpha N-terminal domain-like Topoisomerase V repeat domain

Stabilizer of iron transporter SufD TorD-like

Staphylocoagulase Transcriptional repressor TraM

STAT Transcription factor IIA (TFIIA), alpha-helical domain

STAT DNA-binding domain Transcription factor IIA (TFIIA), beta-barrel domain

Substrate-binding domain of HMG-CoA reductase Transcription factor IIB (TFIIB), core domain

Subtilases Transcription factor NusA, N-terminal domain

Succinate dE-Hydrogenase/fumarate reductase Transcription factor STAT-4 N-domain flavoprotein, catalytic domain Transducin (alpha subunit), insertion domain

Succinate dE-Hydrogenase/fumarate reductase Transglutaminase core flavoprotein C-terminal domain Transglutaminase N-terminal domain

Succinate dE-Hydrogenase/fumarate reductase Transglutaminase, two C-terminal domains flavoprotein N-terminal domain Transketolase C-terminal domain-like

Succinate dE-Hydrogenase/Fumarate reductase Translationally controlled tumor protein TCTP transmembrane subunits (SdhC/FrdC and SdhD/FrdD) (histamine-releasing factor)

Superantigen MAM Translin

Swaposin Translocated intimin receptor (tir) intimin-binding domain

Synatpobrevin N-terminal domain TransmembranE-Helical fragments

T7 RNA polymerase Transposase IS200-like

Tandem AAA-ATPase domain Trigger factor ribosome-binding domain

TAZ domain TrkA C-terminal domain-like

TBP-associated factors, TAFs TrmB-like

TENA/THI-4 beta subunit-like PLP-dependent

Terpene synthases enzymes

Terpenoid cyclase C-terminal domain t-snare proteins

Terpenoid cyclase N-terminal domain TS-N domain

Tetracyclin repressor-like, C-terminal domain Tumor suppressor gene product Apc

Tetracyclin repressor-like, N-terminal domain TyeA-like

Tetratricopeptide repeat (TPR) Type II restriction endonuclease catalytic domain Type II restriction endonuclease effector domain VPR protein fragments

Typo IV secretion system protein TraC VPS23 C-terminal domain

TyrA dimerization domain-like VPS28 C-terminal domain-like

Tyrosine-dependent oxidoreductases VPS28 N-terminal domain

Ubiquitin activating enzymes (UBA) VPS37 C-terminal domain-like

U-box VPS9 domain

Upper collar protein gp10 (connector protein) V-type ATP synthase subunit C

Urocanase XseB-like

Vacuolar ATP synthase subunit C YciF-like

Vanabin-like YfiT-like putative metal-dependent hydrolases

Vanillyl-alcohol oxidase-like YfmB-like

Variant surface glycoprotein (N-terminal domain) YggX-like

VBS domain YhfA-like

Vertebrate phospholipase A2 YihX-like

Viral DNA-binding domain YlxM/p13-like

Virus ectodomain YppE-like

Voltage-gated potassium channels Zn metallo-beta-lactamase

Figure S1: Alignment of the sequences of 1FSL, and those of 1R8J, 1XL3, 2FM9, 2NP5, and 2P06, respectively, based on the 3D structure alignment produced by Dalilite. A segment with capital letters denotes that the 3D structure of this segment is similar to that of the aligned segment of the query protein. A residue with the mark “†” constitutes a common residue pattern.

Figure S2: Multiple alignment of each protein and its homologues. The accession numbers of homologues are presented in the right side of each figure. The histogram at the bottom of each figure denotes the number of substitutions of each residue. A red bar means a helix. (a) Multiple alignment of the sequences of 1FSL and its homologues.

(b) Multiple alignment of the sequences of 1MBN and its homologues.

(c) Multiple alignment of the sequences of 1R8J and its homologues.

(d) Multiple alignment of the sequences of 1XL3 and its homologues.

Copyright of Biomolecules (2218-273X) is the property of MDPI Publishing and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.