<<

View metadata, citation and similar papers at core.ac.uk brought to you by CORE

provided by Elsevier - Publisher Connector

Research Paper 209

A database of globular structural domains: clustering of representative family members into similar folds R Sowdhamini, Stephen D Rufino and Tom L Blundell

Background: A database of globular domains, derived from a non-redundant set Address: Imperial Cancer Research Fund Unit of of , is useful for the sequence analysis of aligned domains, for structural Structural Molecular Biology, Department of Crystallography, Birkbeck College, Malet Street, comparisons, for understanding domain stability and flexibility and for fold London WC1E 7HX, UK. recognition procedures. Domains are defined by the program DIAL and classified structurally using the procedure SEA. Correspondence: Tom L Blundell e-mail: [email protected]

Results: The DIAL-derived domain database (DDBASE) consists of 436 protein Key words: clustering, comparison, chains involving 695 protein domains. Of these, 206 are -class, 191 are - protein domains, protein folds, structural similarity, class and 294 and class. The domains, 63% from multidomain proteins and superfamilies 73% less than 150 residues in length, were clustered automatically using both Received: 19 Jan 1996 single-link cluster analysis and hierarchical clustering to give a quantitative Revisions requested: 13 Feb 1996 estimate of similarity in the domain-fold space. Revisions received: 01 Mar 1996 Accepted: 01 Mar 1996 Conclusions: Highly populated and well described folds (doubly wound / , Published: 13 May 1996 singly wound / barrels, , large Greek-key and flavin-binding /) Electronic identifier: 1359-0278-001-00209 are recognized at a SEA cut-off score of 0.55 in single-link clustering and at 0.65 Folding & Design 13 May 1996, 1:209–220 in hierarchical clustering, although functionally related families are usually clearly distinguished at more stringent values. © Current Biology Ltd ISSN 1359-0278

Introduction tural features [2,26–28] or secondary structure arrange- In the past, most comparative analyses have been at the level ments [29–31]. A simplified description at the level of sec- of whole proteins. Proteins have been grouped first into ondary structures is particularly useful in the comparison homologous families with closely related three-dimensional of proteins that share a similar fold but have considerable structures and significant sequence similarities. Although the structural variations [29,32–37]. The procedure SEA, aligned sequences usually have small insertions and dele- developed earlier to enable the comparison of proteins tions, the local structural environment — accessibil- sharing similar folds [31], has been employed to group ity, secondary structure and sidechain hydrogen bonding — domains into structural classes. is highly conserved for residues at topologically equivalent positions [1–3]. Proteins can also be grouped into superfami- A non-redundant set of globular proteins, for which struc- lies; these share a common fold, have similar packing of sec- tures have been deposited in the Brookhaven Protein ondary structural elements and often have similar functions, Data Bank, has been examined and used to derive the but very little sequence similarity [1,3,4–12]. database. The structural domains and their superfamilies can be used in the areas of fold recog- As more crystal structures of larger proteins have been nition, structure comparisons and understanding domain solved, it has become clear that similarities in topology flexibility and stability. The database can be obtained as a occur most often at the level of globular domains rather ‘tar’ file by anonymous ftp from the /pub/ddbase of than whole proteins. Domains are three-dimensional sub- ftp.cryst.bbk.ac.uk. DDBASE complements ‘domain structures with compact hydrophobic cores [13]. The sequence databases’ such as PRODOM [12]. It also pro- identification of protein domains from three-dimensional vides an automated approach which complements data- structures has been approached in different ways (e.g. bases, such as SCOP [9], that depend on special expertise [6,14–24]). Most methods use the principle that interdo- and intervention for each entry, both in the identification main interresidue distances are higher than interresidue of domains and in their classification. distances within a domain. We have used a method, implemented in the program DIAL [25], that identifies Results and discussion domains in a protein by clustering secondary structures on 436 protein chains have been used for the construction of the basis of their average inter-C distances. the domain database comprising a non-redundant set of proteins. Figure 1 shows the results of DIAL for a repre- Several attempts have been made to align the sequences sentative two-domain protein, -transducin (PDB code: of members of protein families on the basis of local struc- 1tad, A chain). 802 domains were initially identified by 210 Folding & Design Vol 1 No 3

DIAL from the 436 protein chains which, after merging conjoint pairs of domains, 168 to interacting pairs and 189 some conjoint clusters, leads to 695 domains (for merging, to disjoint pairs; 17 pairs have pDf values less than 1.0, see Materials and methods and Supplementary material — although the disjoint factor (Df) for the entire protein published with this paper on the internet). Of these, 206 chain is higher than 1.0 (Fig. 2d). A direct but weak corre- domains belong to the -class, 191 to the -class and 294 lation (data not shown) is found between the loss in to the and class. Figure 2a shows the distribution of solvent-accessible surface area upon aggregation and pDf the number of domains within individual protein chains for the 517 domain pairs in multidomain chains of considered in the domain database. More than half of the DDBASE. Domain pairs with low pDf values (1.0–1.25) protein chains are defined as single domains (260 out of have a higher slope (–0.05) compared to poorly interacting

436 protein chains; 59.6%) and these single domains com- domains (pDf values between 1.25 and 1.5; slope –0.005). prise nearly one-third of the database. Most domains are less than 150 residues in length (509 examples; ∼73%; Principal components analysis Fig. 2b). Many large single-domain folds have extra ele- 428 relatively large domains in the database (with seven or ments that are responsible for interdomain interactions more secondary structures) were compared using the pro- and do not contribute to the core. The central jelly-roll cedure SEA ([31]; also see Materials and methods). fold of the viral coat protein in canine parvovirus (1cas), Smaller domains were avoided because they tend to give for example, is surrounded by long inserts. Roughly half of rise to chance short matches. The results of the scoring the small domains (with 50 or less residues) are single- matrix are projected after resolving by principal compo- domain protein chains (78 out of 168 small domains; nents analysis (Fig. 3). ∼46%; Fig. 2c). Single-link cluster analysis The disjoint factor derived from DIAL [25], a useful Figure 4 shows structural similarity at various levels (cut- measure of the extent of domain–domain interactions off values of 0.40, 0.45, 0.50 and 0.55 with the highest dis- within protein chains, has been exploited to classify the similarity at 1.0) where clustering has been performed identified domains into conjoint (highly interacting), using single-link cluster analysis (SCLA; see Materials and interacting (with intermediate interactions) and disjoint methods). Clusters representing structurally similar (very poorly interacting; see Materials and methods for domain folds are evident. A cut-off of 0.55 identifies 203 more details). In multidomain protein chains, we calcu- protein domains which are grouped into one of the well lated disjoint factors for all possible pairs of domains known folds (see Table 1). The other 225 domains either within a chain. In the present domain database, amongst belong to folds with unique representatives or resemble

517 pairwise disjoint factor (pDf) values, 188 correspond to one of the identified folds but are not picked up either

Figure 1

The results of DIAL for a representative protein chain. (a) A ribbon representation of -transducin A chain (1tadA) using RASMOL [46]. The protein has been coloured from the N and C termini using a gradient in which the residues at the N terminus are blue and those at the C terminus are red. The chain folds into two domains: the N-terminal domain is discontinuous, alternating /, and the second domain is -helical. (b) The secondary structural dendrogram of 1tadA obtained from DIAL. Two clusters, corresponding to the two domains, are evident from the tree diagram. (c) Ribbon drawing using RASMOL [46] with the two domains shown in two different colours. The helical domain is shown in red. Connecting loops which act as domain linkers are not shown. The disjoint factor for the two domains is 1.37. Research Paper Database of globular protein domains Sowdhamini et al. 211

Figure 2

Distribution of the number of domains, residues within domains and pairwise disjoint factors in (a) (b) the 436 protein chains used in the construction of DDBASE. This figure is generated using (a) SETOR [53]. Number of domains identified in 400 individual protein chains. (b) Number of residues 200 in 695 domains. (c) Length distribution in 257 300 single-domain proteins. (d) Pairwise disjoint factors in 507 pairs of domains. 10 outliers are 200 not shown. 100 100 Number of examples Number of examples

1 2 3 4 5 6 7 8 100 200 300 400 500 Number of domains Number of residues in domains

(c) (d)

200 150

100 100

Number of examples 50 Number of examples 100 200 300 400 500 0.50 1.00 1.50 2.00 2.50 Number of residues in domains Pairwise disjoint factor

because their dissimilarity score is greater than 0.55 or for / doubly wound folds contoured at 0.55, contains 55 technical reasons such as ill defined secondary structures. domains but includes many closely related subclusters contouring at lower levels. The tightest subcluster corre- The most highly populated folds recognized at a 0.55 cut- sponds to the type I periplasmic-binding proteins (0.40 off score are the / doubly wound folds with 55 domains, cluster including 1gca1 of -strand order 213456 with - the immunoglobulin-like Greek-key folds with 25 strand 6 antiparallel to the others) with the two domains domains, singly wound / barrels with 21 domains, the of leucine/isoleucine/valine-binding protein being identi- fold with 10 domains and the flavin-binding fold fied at greater cut-offs (0.45 for 2liv1 of order 2134567 and also with 10 domains. The globin fold, which is the most 0.50 for 2liv2 of order 21345). Several other tight / sub- populated -rich cluster, has fewer members than the clusters involve the NADP-binding Rossmann-fold, most populated and and -rich clusters. These obser- which consists of a six-stranded parallel -sheet of order vations, based on a non-redundant dataset of protein 321456. The first of these includes lactate (1ldm1 of order domains, confirm the observation that not all folds are 32145678) and malate (1bmdA1 of order 32145678) dehy- equally populated [8,9]. drogenases at a cut-off of 0.40, and 6-phosphogluconate dehydrogenase (2pgd2 of order 3215687) at 0.50. - Amongst the -rich domain clusters identified at the cut- strand 7 of lactate and malate dehydrogenases is weakly off of 0.55, those of the globins and calcium-binding bonded to -strand 6 and is anti-parallel to the other domains are the most populated. The 10 members of the strands. Similarly, -strand 8 of 6-phosphogluconate globin cluster include the globins (cluster involving 1ash dehydrogenase is anti-parallel. The second Rossmann- at 0.40), phycocyanin (1cpcA) and colicin A (1colA). The fold subcluster is that of alcohol dehydrogenase (2ohxA1 helical bundles include two clusters, the long-chain four- of order 321456) and D-glycerate dehydrogenase (1gdhA1 helical cytokine bundles and the helical domain of protein of order 3214567), which has an additional parallel - kinases. strand, at a cut-off of 0.45. The third subcluster (including 1dhr of order 32145678) is that of the tyrosine-dependent The domains that belong to the and class have three oxidoreductases which possess two additional C-terminal highly populated clusters. The first large cluster, that of -strands with -strand 8 being anti-parallel to the others. 212 Folding & Design Vol 1 No 3

Figure 3 binding domains whose common fold consists of a three- stranded -meander -sheet packed against a mixed - sheet of order 3214 whose crossover helices form a third layer.

Among the -rich proteins, the large Greek-key cluster contains a total of 25 domains including 18 of the immunoglobulin fold, six of the plastocyanin/azurin family and one receptor- from adenovirus (1knb). The immunoglobulin fold domains include immunoglobulin domains, telokin (1tlk), C-terminal

domains of the cd2 (1hngA2) and cd4 (3cd41) cell surface receptors, domains (1fcb), the penultimate

domain of cyclodextrin glycosyltransferase (1cdg3), super- oxide dismutase (1cob) and the C-terminal domain of

galactose oxidase (1gof3). A related Greek-key cluster contains the N-terminal domains of the cd2 (1hngA1) and cd4 (1cid1 and 3cd42) cell surface receptors. The double Greek-key cluster of the crystallin fold includes -crys- tallin B (4gcr) domains and the spore coat protein S (1prs) domains and reflects a structural similarity first predicted from sequence by Wistow et al. [38]. Other Greek-key clusters include those related to the FAD-linked domain of ferredoxin reductase and the serine proteinase barrel.

Amongst the -rich folds are several jelly-roll clusters, Principal components analysis of 428 protein domains. SEA scores for including the viral coat jelly-rolls, the double -helix jelly- all pairs of domains have been used and the results are represented using MOLSCRIPT [54]. The colours of the points are based on their rolls and the lectin-like jelly-rolls. The three proteins with secondary structural composition: -rich domains are blue, -rich lectin-like jelly-rolls share the same fold but have a circu- domains are red and and domains are green. lar permutation of their N and C termini [39]. The sequences of lentil lectin (1lenA) and 1-3,1-4--glucanase (2ayh) have been rearranged (1lenAr and 2ayhr) so as to Several of the domains of the flavodoxin-like fold of order resemble that of serum amyloid P-component (1sac). The 21345 are found to cluster at 0.45 (including 3chy). orthogonal barrels include the lipocalins (1ifc, 1mdc,

Among the G protein family (cluster including 1eft1 at 1mup, 1epaB, 1erb and 1bbpA) and profilin (2btfP), which 0.45), the N-terminal discontinuous domain of transducin does not form a closed barrel but has a second layer (1tadA1) has a fold similar to that of ras-P21 (5p21) and formed by a -hairpin. Another interesting cluster the N-terminal domain of elongation factor Tu (1eft1), includes the membrane-spanning porins, whose fold con- despite poor sequence identity (Fig. 5). Both the connec- sists of a large barrel formed of up–down -strands. The tivity of the -sheets (of order 231456 with -strand 2 -trefoil folds are represented by the plant toxin abrin anti-parallel to the rest) and the location of the substrate- domains (1abrB1 and 1abrB2), interleukin-1 (1ilr), Kunitz binding site at the C-terminal end of the -strands are inhibitor (1tie), histidine-rich -binding protein (1hcd) similar in all the three domains. The metzincin family and acidic fibroblast growth factor (1barA). The pepsin-

(1hfc, 1iag1 and 1lae1), which cluster together at 0.53, is like and retroviral aspartic proteinase domains cluster into the most dissimilar subcluster amongst the / doubly three groups: the N terminus (cluster including 1apvE2) wound folds, possessing only one unit and two addi- and C terminus (cluster including 1apvE1) of the pepsin- tional parallel -strands in common with other members like aspartic proteinases and the retroviral aspartic pro- of the cluster. teinases (cluster including 2hpeA). Although there is a two-fold symmetry that links the two pepsin-like domains

The type II periplasmic-binding protein domains (1pda1 in the full protein [40], neither this distant similarity nor of order 21354 and 1pda2 of order 21304), which form a the similarity between the pepsin-like and viral domains is second / doubly wound cluster, have -strand 5 of picked up at a cut-off of 0.55. The conservation between 1pda1 and -strand 0 of 1pda2 anti-parallel to the rest. The these domains is essentially restricted to the -strands of second large cluster is that of the singly wound / barrels. two orthogonally packed -sheets that are decorated in The third large and cluster is that of the flavin- different ways in different domains [41]. Research Paper Database of globular protein domains Sowdhamini et al. 213

Figure 4

Domains with structurally similar folds. The result of SEA [31] has been examined using single-link cluster analysis (SLCA) and similar domains are identified at different cut-off scores (0.40, 0.45, 0.50 and 0.55). Each domain is denoted by its Brookhaven Protein Data Bank code, chain identifier and the domain number denoted as a subscript (in the order used in the Supplementary material — published with this paper on the internet). The domains within the three structural classes are demarcated using dashed lines. Highly similar protein domains are identified at low cut-off scores. The clusters corresponding to the globin fold, the calcium-binding domains, the TIM barrels, the doubly wound / domains, the jelly-roll folds, the trefoils, the immunoglobulin (Ig) folds and the aspartic proteinase domains are identified at different cut-off scores.

Hierarchical clustering Among the and domains (Fig. 6b), all the folds identi- Domains identified within the SLCA cut-off of 0.55 were fied by SLCA are clearly identifiable at a distance cut-off segregated according to their structural class, -rich, and within the 0.626–0.658 range with the exception of the or -rich, and further clustered using the hierarchical pro- / doubly wound domains which are further subdivided. cedure KITSCH. From the dendrogram of -rich domains The clustering of the / doubly wound domains is made (Fig. 6a), the four -folds previously identified by SLCA complex by the repeat of similar / units. So, for are visible within the 0.5640–0.630 range. Colicin A (1colA) example, the different NADP-binding Rossmann-fold is again found to be a distant member of the globin fold. domains are found in a small cluster (g) proximal to the

Table 1

The results of single-link cluster analysis.

Name of cluster No. examples Name of cluster No. examples a-rich folds b-rich folds

Globins (cluster involving 1ash; see Fig. 4) 10 Immunoglobulin-like Greek-keys (cluster involving 1hngA2)25 EF-hand calcium-binding (cluster involving 2scpA1)6Cell-surface receptors (cluster involving 1hngA1)3 Long-chain four-helical bundles (cluster involving 1lki) 3 Crystallin Greek-keys (cluster involving 1prs1)4 Protein kinase helical bundles (cluster involving 1irk1)2FAD-linked Greek-key domain of 3 ferredoxin reductase (involving 1fnc2) a and b folds Lipocalin-like orthogonal barrel (cluster involving 1erb) 7 Singly wound / barrels (cluster involving 1fbaA) 21 Viral coat jelly-roll (clusters involving 2stv and 1bbt2) 8 / doubly wound -sheets (cluster involving 1wsyB2)55Double -helix jelly-roll (cluster involving 1cauA) 2 Type II periplasmic-binding (cluster involving 1pda1)2Lectin-like jelly-roll (cluster involving 2ayhr) 3 Cytidine deaminase (cluster involving 1ctt1)2-trefoils (cluster involving 1hcd) 6 Tyrosine phosphatase (cluster involving 2hnq) 2 DNA-clamp (cluster involving 1plq1)2 Flavin-binding reductases (cluster involving 3cox1)10-propellers (cluster involving 2bbkH2)3 PLP-dependent transferase (involving 1dge2)2Serine proteinase (cluster involving 1hneE1)2 / -hydrolases (cluster involving 1ede1)2Aspartic proteinase (clusters involving 1smrA2,6 Zn-dependent exopeptidases (cluster involving 1amp) 2 1smrA1 and 2hpeA) + domain of enolase (cluster involving 4enl2)3Small barrels (cluster involving 1gpr1)2 P450 (cluster involving 2hpdA2)3Porins (cluster involving 1pho) 2 214 Folding & Design Vol 1 No 3

Figure 5

Proteins with a common GTP-binding domain. Three proteins in this dataset, ras-P21 (5p21), elongation factor Tu (1eft) and -transducin (1tadA), which belong to this family and also clustered together (see Fig. 4), are shown. The ribbon drawings have been produced using SETOR [53] and the proteins have been superposed using MNYFIT [55]. In each case, the GTP-binding domain is shown with the -strands in bright red, helices in green and the connecting loops in purple. The strand topology is identical in the three proteins. Compared to the single-domain fold of 5p21, two extra helices are present in this domain in the other proteins. Elongation factor Tu has two further domains (coloured light blue and pink) which are -rich, whereas transducin has a helix-rich domain (light blue) as an insert.

adenylate/guanylate kinases (h) but also as isolated out- class I (c), telokin type I domain (d), type E domains from liers (e, f and i). The metzincins (cluster C) and / cyclodextrin glycosyltransferase and galactose oxidase (e), doubly wound domains of the PLP-dependent trans- immunoglobulin variable type 1 domains (f), and variable ferases (H) which belong to the SLCA / doubly wound type 2 domains from CD2 and CD4 (g). The crystallin cluster are found to form their own independent clusters. Greek-key fold, which includes -crystallin B domains Domains that share the singly wound / barrel folds are and the spore coat protein S domains, clusters closely with often characterized by the number of residues that the N- the immunoglobulin-like Greek-keys. The two closed six- terminal -strand buries in the three-layer core and by the stranded Greek-key barrel folds of the FAD-linked cross-section of the barrel, which may vary from circular to domain of ferredoxin reductase (A) and the serine pro- elliptical. This classification is further complicated by the teinase (B) are found to cluster close to each other. The - addition of extra secondary structures and the distortions propeller fold, which consists of two packed meander they cause to the core of the fold. Nevertheless, several smaller clusters can be identified within the singly wound Figure 6 / barrel cluster, including some of the type II chitinase (a) Dendrograms of domains of the -class selected at a 0.55 cut-off and -amylase families (a), PRA isomerase/IPG synthase in SLCA (corresponding to Fig. 4). Structural dissimilarity scores have (b), the muconate lactonizing like family (c), and been derived using the program SEA [31]. Domains corresponding to some of the FMN-linked oxidoreductases (d). the globin and calcium-binding folds occur at distinct nodes. (b) As for (a), but for and domain folds. Clusters corresponding to the singly wound parallel / barrels and the dominant cluster of doubly wound Among the -rich domains (Fig. 6c), hierarchical cluster- parallel / folds can be seen. Smaller clusters are marked A–K and ing (HC) of immunoglobulin-like Greek-key folds results refer to the following fold families: A, cytidine deaminase; B, + in slightly different clusters than those given by SLCA. In domain of enolase; C, metzincins; D, Zn-dependent exopeptidases; E, HC, the plastocyanin/azurin domains cluster separately -hydrolases; F, porphobilinogen deaminase; G, NADP-binding domain of ferredoxin; H, PLP-dependent transferases large domain; I, from the rest, which are divided into seven distinct groups tyrosine phosphatase; J, PLP-dependent transferases small domain; K, containing adenovirus receptor binding domain and super- cytochrome P450. (c) As for (a), but for -class domains. Clusters oxide dismutase (cluster a), fibronectin type III domains include jelly-roll folds, the aspartic proteinase domains and the from neuroglian and growth receptor and con- retroviral proteases, the trefoils and the orthogonal barrels in lipocalins. stant type 2 domains from cd2 and cd4 cell surface recep- A–G denote smaller clusters: A, FAD-linked domain of ferredoxin reductase; B, serine proteinase; C, propellers; D, double helix jelly-roll; tors (b), constant type 1 domains from immunoglobulins, E, porin; F, small barrel; G, DNA-clamp. neonatal Fc receptor and 2- from MHC Research Paper Database of globular protein domains Sowdhamini et al. 215 216 Folding & Design Vol 1 No 3 Research Paper Database of globular protein domains Sowdhamini et al. 217 218 Folding & Design Vol 1 No 3

-sheets (C), is found within the large two-packed -sheet step in the clustering of domain folds, we have performed Greek-key cluster. Of the three jelly-roll folds described an all-against-all SEA comparison of the 428 domains above, two cluster together: the viral coat and the double which have seven or more secondary structures. Further, -helix (D). The lipocalin-like orthogonal barrel fold we have used a combination of SLCA and HC to group clusters in two groups, those with a single helix near their the domains into structurally similar folds. SLCA at a cut- N terminus (h) and those with a helical hairpin near their off of 0.55 SEA score ensures that the chosen domains C terminus (i), profilin (2btfP) clustering closest to this belong to one of the pre-existing folds. As the domains latter group. The proliferating cell nuclear antigen (G) selected by SLCA are used for HC, the dendrograms thus domains cluster close to the lipocalins and consist of two derived contain less noise, giving rise to an overall simpli- roughly orthogonally packed -sheets that do not form a fied and quantitative estimate. closed barrel as do the lipocalins. The three aspartic pro- teinase domains cluster together as one would expect, In this paper, we present a database of protein structural albeit very distantly. At a distance cut-off within the domains. The presence of more than 400 protein entries 0.648–0.658 range, only three of the mentioned folds are makes it possible to address several interesting issues not identified as clusters. These are the FAD-linked relating to protein fold similarities, protein stability, analy- domain of ferredoxin reductases (A) and the serine pro- sis of domain movements and fold prediction as suggested teinases (B), which are merged into a six-stranded Greek- in Table 2. Secondary structural dendrograms are available key barrel cluster, and the aspartic proteinases, which for individual protein chains in the database. Disjoint are split into pepsin-like N-terminal domains (k) and factors for different combinations of nodes in the dendro- retroviral domains (j) on one hand and pepsin-like C-ter- gram are available and the domain boundaries can be minal domain (i) on the other. obtained for individual combination of clusters.

Distant relationships With the rapid increase in the availability of protein Many distant structural similarities are evident at higher sequences from studies, there is an overwhelming ranges of SEA scores such as that between the globins, number of protein sequences for which structures are not phycocyanin and colicin A [8]. The domains of abrin

(1abrB1 and 1abrB2) merge with the trefoil-fold cluster at 0.55 cut-off. The domains of plant toxins like abrin and Table 2 are known to have a trefoil fold [8,9]. Many domains Scope of the domain database DDBASE. sharing a distant structural similarity to the immunoglobu- lins are identified. The pepsin-like N-terminal and C-ter- Stability of multidomain proteins minal domains of the aspartic proteinases and the Residue contacts at domain interfaces for conjoint retroviral domains are found to have similar topologies. As and disjoint domains the structural similarity is a continuum in the fold space, it Ligand-binding induced domain movements is difficult to provide on objective cut-off value, but values Movements in single protein structures complexed with different of 0.55 for SLCA and 0.63–0.65 for HC seem useful. ligands (e.g. CAMP kinase) Movements in different members of a homologous family Conclusions (e.g. aspartic proteinases) The results agree largely with the SCOP database [9] and Differences in structure environments for a protein domain alone the SSAP-derived superfolds [8]. For example, the clus- and considered as part of full protein ters identified are analogous to the / doubly wound, Molecular weight versus solvent accessibility for conjoint, disjoint TIM barrel, Greek-key immunoglobulin, -up–down, and interacting domains globin, jelly-roll and trefoil superfolds reported by Orengo To understand structural invariants in domains with a et al. [8]. An explicit set of protein domains derived from common fold an automatic method has been used for structure compari- Suggestions for experiments from dendrograms son for the first time. The protein chains considered in our Folding intermediates from dendrograms non-redundant dataset include multidomain systems such Protein engineering experiments from dendrograms as the periplasmic-binding proteins. Further, objective Influence of domain folds in substitution tables and in turn methods have been employed for the identification of secondary structure prediction, stability and model validation domain boundaries and for the comparison of domain Fold recognition structures. Hence, it should be possible to update the database as more protein structures become available. In Template generation and threading with specific examples the structure comparison of protein domains, we have sys- Prediction of domain boundaries from analysis of sequence length tematically analyzed the arrangement of secondary struc- and domain linker propensities tures and differences in packing of domain folds where DDBASE is useful for the examination of protein domain flexibility, the the clusters are less well defined or unexpected. As a first study of ligand binding induced motions and structure prediction. Research Paper Database of globular protein domains Sowdhamini et al. 219

yet determined, despite the tremendous upsurge in the RASMOL [46]. Other parameters such as pairwise disjoint factors in number of proteins whose structures are available. the case of multidomain proteins, the number of residues within each domain and the loss in solvent-accessibility of individual domains upon Further, studies of pathways clearly aggregation are also calculated for multidomain proteins. Solvent- indicate that particular domains are shared in a wide range accessibility calculations [47] are performed using the PSA program. of sequences. This strongly suggests the need for fold databases at the level of protein domains if they are to be Comparison of domain folds useful for prediction. Structural domain databases, such as The procedure SEA [31] compares protein structures by considering them at the level of their secondary structure elements. For each sec- the DDBASE reported here, can be used where the simi- ondary structure a set of structural properties were determined, an larities are not evident in terms of high sequence identity associated vector was least-squares fitted using the subroutine HELAX or where signature sequences are not present. [48,49] and a local environment was defined which included all neigh- bouring elements with inter-vector mid-point distances of 20 Å or less. Materials and methods Two secondary structures are compared by identifying the largest set of secondary structural equivalences within their environments by dataset means of a maximal common subgraph algorithm [50,51]. The score For the construction of the domain database, we have used 436 associated with the comparison of a pair of secondary structure envi- protein chains that are largely non-redundant and whose structures ronments is derived from the likelihood of the variations in the proper- have been deposited in the Brookhaven Protein Data Bank [42]. The ties of the equivalenced secondary structures. Once all compatible non-homologous dataset server of Hobohm et al. [43] corresponding secondary structures have been compared, an alignment between pro- to the June 1995 set has been used as a guideline in determining the teins and its dissimilarity score, with a range of zero to one, are protein dataset, in which no two chains have more than 25% identity obtained using a dynamic programming procedure [52]. This compari- (for alignments of 80 or more residues). son procedure allows for considerable overall structural flexibility while requiring conservation of local fold topology. Identification of domain boundaries The procedure DIAL [25] has been applied and domain boundaries Clustering of domains identified (see the extensive Supplementary material published with this Only domains containing more than seven secondary structures were paper on the internet). Secondary structures are first identified as considered for all-against-all structural comparison. This avoided the defined by the Kabsch & Sander algorithm [44] using SSTRUC difficulties associated with small domains which are found to be (D Smith, personal communication). Clustering of secondary structures included in the folds of many larger domains. The clustering of domains is performed using the KITSCH algorithm in the PHYLIP package [45] is analyzed using three different procedures: principal components on the basis of ‘proximity indices’ and the results are projected in the analysis, SLCA and HC. form of a dendrogram using DRAWTREE (Z-Y Zhu, personal communi- cation). A parameter termed the ‘disjoint factor’ (D ) is calculated to f In the first procedure, protein domains are considered in a multidimen- ensure that the clusters in the dendrogram represent structural sional space where the coordinates of a domain are its dissimilarity domains. This is related to the ratio between the average proximity scores with all other domains. Orthogonal vectors or eigenvectors are index of all possible pairs of secondary structures in a protein to the then successively identified such that the scatter of domains along average proximity index after ignoring interclusteral indices which have them is maximal. relatively high values. Whilst a Df of 1.0 is the lowest value that can rep- resent a recognizable organization as domains, higher values denote In the second procedure, all protein domains that can be linked through domains that are well separated and even clearly defined. Protein a chain of domain comparisons with dissimilarity scores less than a domains identified with D values between 1.0 and 1.15 are usually f given threshold are grouped together. SLCA has been performed with border-line cases and the domain organization is questionable. Hence, cut-off distances of 0.40, 0.45, 0.50 and 0.55 and the identified clus- these cases have been inspected using graphics, and where domains ters represented in a contour map. are not known to exist in isolation, the proteins are classed as single- domain folds (also see Supplementary material). Proteins with D values f In the third procedure, the KITSCH program is used to obtain the dendro- less than 1.25 have ‘conjoint’ domains with elaborate interface regions, gram that best fits the dissimilarity values between protein domains [45]. those with Df values in the range 1.25–1.5 have ‘interacting’ domains and those with values more than 1.5 have distinct, well separated ‘dis- joint’ domains. Domain database The clustering of secondary structures in individual protein chains is represented as a dendrogram. The coordinates of the Brookhaven Selection of optimal domain boundaries Protein Data Bank files are appropriately grouped and re-written such Further to our previous report on the identification procedure [25], we that each domain is considered as a separate chain and hence can be have modified the program so that it can consider multiple chains and coloured differently. Further, the loss in solvent accessibility for domain automatically identify all possible combinations of clusters from the pairs upon aggregation, number of residues within each domain, pair- dendrogram derived from the KITSCH procedure. As a result, we now wise disjoint factors and a list of secondary structures in the entire also consider situations where the list of clusters representing domains protein chain and within identified domains are also available for each may omit up to three secondary structures. Disjoint factors and domain chain. For those proteins with D values slightly greater than 1.0 boundaries for all possible situations are recorded. As far as possible, f (1.0–1.15) but whose domains have been merged after a graphical we choose the highest scoring situation where the domain organization inspection, the above details are present before and after merging. The involves only domains containing at least 25 residues and no sec- present domain database is 42 Mb in size. ondary structures are ignored; failing this, other situations with the highest Df where one cluster is small (domains less than 25 residues) or where one/two/three secondary structures have been ignored are Acknowledgements considered. The situation with the highest disjoint factor, as described We thank Dr. Mark Johnson for kindly making the programs PCA and above, is chosen automatically; however, there is a provision to KITSCH available for use, Dr. Srinivasan for HELAX and Dr. Zhan-Yang Zhu examine or select an alternative situation of the user’s choice. The coor- for providing the program DRAWTREE. We thank Drs , Alex dinates of individual domains are written in Brookhaven Protein Data Murzin, Tim Hubbard, Stephen Brenner, N Srinivasan, and Mr Alex May for Bank format for visualization using graphics packages such as helpful discussions. R Sowdhamini and SD Rufino are funded by the ICRF. 220 Folding & Design Vol 1 No 3

References Mol. Des. 8, 5–27. 1. Overington, J.P., Johnson, M.S., Sali, A. & Blundell, T.L. (1990). Ter- 32. Murthy, M.R.N. (1984). A fast method of comparing protein structure. tiary structural constraints on protein evolutionary diversity: templates, FEBS Lett. 168, 97–102. key residues and structure prediction. Proc. Roy. Soc. Lond. B241, 33. Artymiuk, P.J., Grindley, H.M., Rice, D.W., Ujah, E.C. & Willett, P. 132–145. (1991). Searching techniques for tertiary structures of proteins in the 2. Overington, J.P., et al., & Blundell, T.L. (1993). Molecular recognition in protein data bank. In Proceedings of the 1991 Chemical Information protein families: a database of aligned three-dimensional structures of Conference (Colier, H., ed.), pp. 91–106, Infonortics Ltd., Calne. related proteins. Biochem. Soc. Trans. 21, 597–604. 34. Artymiuk, P.J., Grindley, H.M., Park, J.E., Rice, D.W. & Willett, P. 3. Rossmann, M.G. & Argos, P. (1977). The taxonomy of protein struc- (1992). 3-Dimensional structural resemblance between leucine ture. J. Mol. Biol. 109, 99–129. aminopeptidase and carboxypeptidase—as revealed by graph theoreti- 4. Richardson, J.S. (1977). -sheet topology and the relatedness of pro- cal techniques. FEBS Lett. 303, 48–52. teins. 268, 495–500. 35. Orengo, C.A. & Thornton, J.M. (1993). Alpha-fold plus beta-fold revis- 5. Yee, D.P. & Dill, K.A. (1993). Families and the structural relatedness ited – some favored motifs. Structure 1, 105–120. among globular proteins. Protein Sci. 2, 884–899. 36. Orengo, C.A., Flores. T.P., Jones, D.T., Taylor, W.R. & Thornton, J.M. 6. Holm, L. & Sander C. (1994). Parser for units. Proteins (1993). Recurring structural motifs in proteins with different functions. 19, 256–268. Curr. Biol. 3, 131–139. 7. Alexandrov, N.N. & Go, N. (1994). Biological meaning, statistical sig- 37. Orengo, C.A., Flores, T.P., Taylor, W.R. & Thornton, J.M. (1993). Identi- nificance, and classification of local spatial similarities in non-homolo- fication and classification of protein fold families. Protein Eng. 6, gous proteins. Protein Sci. 3, 866–875. 485–500. 8. Orengo, C.A., Jones, D.T. & Thornton, J.M. (1994). Protein superfami- 38. Wistow, G., Summers, L. & Blundell, T.L. (1985). Myxococcus-xanthus lies and domain superfolds. Nature 372, 631–634. spore coat protein-s may have a similar structure to vertebrate lens 9. Murzin, A.G., Brenner, S.E., Hubbard, T. & Chothia, C. (1995). SCOP: beta-gamma-crystallins. Nature 315, 771–773. a structural classification of proteins database for the investigation of 39. Emsley, J., et al., & Wood, S.P. (1994). Structure of pentameric human sequences and structures. J. Mol. Biol. 247, 536–540. serum amyloid-P component. Nature 367, 338–345. 10. Holm, L. & Sander, C. (1993). Protein structure comparison by align- 40. Tang, J., James, M.N.G., Hsu, I.N., Jenkins, J.A. & Blundell, T.L. (1978). ment of distance matrices. J. Mol. Biol. 233, 123–138. Structural evidence for gene duplication in the evolution of the acid 11. Sowdhamini, R., Rufino, S.D. & Blundell, T.L. (1994). The construction proteases. Nature 271, 618–621. and use of databases of protein domain topologies and templates. 41. Šali, A., Veerapandian, B., Cooper, J.B., Moss, D.S., Hofmann, T. & CHEMTRACTS – Biochem. Mol. Biol. 5, 291–306. Blundell, T.L. (1992). Domain flexibility in aspartic proteinases. Pro- 12. Sonnhammer, E. & Kahn, D. (1994). Modular arrangement of proteins teins 12, 158–170. as inferred from analysis of homology. Protein Sci. 3, 482–492. 42. Bernstein, F.C., et al., & Tasumi, M. (1977). Protein Data Bank: a com- 13. Wetlaufer, D.B. (1973). Nucleation, rapid folding and globular intra- puter based archival file for macromolecular structures. J. Mol. Biol. chain regions in proteins. Proc. Natl. Acad. Sci. USA 70, 697–701. 112, 535–542. 14. Schulz, G.E. (1977). Structural rules for globular proteins. Angew. 43. Hobohm, U., Scharf, M., Schneider, R. & Sander, C. (1992). Selection Chem. Int. Edit. 16, 23–33. of a representative set of structures from the Brookhaven Protein Data 15. Argos, P. (1990). An investigation of oligopeptides linking domains in Bank. Protein Sci. 1, 409–417. protein tertiary structures and possible candidates for general gene 44. Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary fusion. J. Mol. Biol. 211, 943–958. structure. Pattern recognition of hydrogen-bonded and geometrical 16. Crippen, G.M. (1978). The tree structural organisation of proteins. J. features. Biopolymers 22, 2577–2637. Mol. Biol. 126, 315–332. 45. Felsenstein, J. (1985). Confidence-limits on phylogenies – an 17. Rose, G.D. (1979). Hierarchical organisation of domains in globular approach using the bootstrap. Evolution 39, 783–791. proteins. J. Mol. Biol. 134, 447–470. 46. Sayle, R.A. & Milner-White, E.J. (1995). RASMOL – biomolecular 18. Go, M. (1981). Correlation of DNA exonic regions with protein struc- graphics for all. Trends Biochem. Sci. 20, 374–376. tural units in . Nature 291, 90–92. 47. Lee, B. & Richards, F.M. (1971). The interpretation of protein struc- 19. Wodak, S.J. & Janin, J. (1981). Location of structural domains in pro- tures: estimation of static accessibility. J. Mol. Biol. 55, 379–400. teins. Biochemistry 20, 6544–6553. 48. Chou, K.-C., Nemethy, G. & Scheraga, H.A. (1984). Energetic 20. Zehfus, M.H. & Rose, G.D. (1986). Compact units in proteins. Bio- approach to the packing of alpha–helices. 2. General treatment of chemistry 25, 5759–5765. nonequivalent and nonregular helices. J. Am. Chem. Soc. 106, 21. Swindells, M.B. (1995). A procedure for detecting structural domains 3161–3170. in proteins. Protein Sci. 4, 103–112. 49. Sowdhamini, R., Srinivasan, N., Ramakrishnan, C. & Balaram, P. 22. Islam, S.A., Luo, J. & Sternberg, J.E. (1995). Identification and analysis (1992). Orthogonal – motifs in proteins. J. Mol. Biol. 223, of domains in proteins. Protein Eng. 8, 513–525. 845–851. 23. Siddiqui, A.S. & Barton, G.J. (1995). Continuous and discontinuous 50. Bron, C. & Kerbosch, J. (1973). Algorithm 457: finding all cliques of an domains – an algorithm for the automatic generation of reliable protein undirected graph [H]. J. Commun. Assoc. Comput. Machinery 16, domain definitions. Protein Sci. 4, 872–884. 575–577. 24. Nichols, W.L., Rose, G.D., Eyck, L.F.T. & Zimm, B.H. (1995). Rigid 51. Grindley, H.M., Artymiuk, P.J., Rice, D.W., & Willett, P. (1993). Identifi- domains in proteins: an algorithmic approach to their identification. cation of tertiary structure resemblance in proteins using a maximal Proteins 23, 38–48. common subgraph isomorphism algorithm. J. Mol. Biol. 229, 25. Sowdhamini, R. & Blundell, T.L. (1995). An automatic method involving 707–721. cluster analysis of secondary structures for the identification of 52. Fredman, M.L. (1984). Algorithms for computing evolutionary similarity domains in proteins. Protein Sci. 4, 506–520. measures with length independent gap penalties. Bull. Math. Biol. 46, 26. Rossmann, M.G., Moras, D. & Olsen, K.W. (1974). Chemical and bio- 553–566. logical evolution of a nucleotide-binding protein. Nature 250, 53. Evans, S.V. (1993). SETOR – hardware-lighted 3-dimensional solid 194–199. model representations of macromolecules. J. Mol. Graphics 11, 27. Orengo, C.A. & Taylor, W.R. (1990). A rapid method of protein struc- 134–138. ture alignment. J. Theor. Biol. 147, 517–551. 54. Kraulis, P.J. (1991). MOLSCRIPT: a program to produce both detailed 28. Sali, A. & Blundell, T.L. (1990). The definition of topological equiva- and schematic plots of protein structures. J. Appl. Crystallogr. 24, lence in homologous and analogous structures: a procedure involving 946–950. a comparison of local properties and relationships. J. Mol. Biol. 212, 55. Sutcliffe, M.J., Haneef, I., Carney, D. & Blundell, T.L. (1987). Knowl- 403–428. edge-based modelling of homologous proteins. Part I: Three-dimen- 29. Mitchell, E.M., Artymiuk, P.J., Rice, D.W. & Willett, P. (1989). Use of sional frameworks derived from simultaneous superposition of multiple techniques derived from graph theory to compare secondary structure structures. Protein Eng. 1, 377–384. motifs in proteins. J. Mol. Biol. 212, 151–166. 30. Orengo, C.A., Brown, N.P. & Taylor, W.R. (1992). Fast structure align- ment for protein databank searching. Proteins 14, 139–167. 31. Rufino, S.D. & Blundell, T.L. (1994). Structure-based identification and clustering of protein families and superfamilies. J. Computer-Aided