doi:10.1006/jmbi.2000.4414 available online at http://www.idealibrary.com on J. Mol. Biol. (2001) 306, 591±605

Automated Discovery of Structural Signatures of Fold and Function

Marcel Turcotte1, Stephen H. Muggleton2 and Michael J. E. Sternberg1*

1Imperial Cancer Research There are constraints on a protein sequence/structure for it to adopt a Fund, Biomolecular Modelling particular fold. These constraints could be either a local signature invol- Laboratory, P.O. Box 123 ving particular sequences or arrangements of secondary structure or a London, WC2A 3PX, UK global signature involving features along the entire chain. To search sys- tematically for protein fold signatures, we have explored the use of 2University of York Inductive Logic Programming (ILP). ILP is a machine learning technique Department of Computer which derives rules from observation and encoded principles. The Science, Heslington, York derived rules are readily interpreted in terms of concepts used by YO1 5DD UK experts. For 20 populated folds in SCOP, 59 rules were found automati- cally. The accuracy of these rules, which is de®ned as the number of true positive plus true negative over the total number of examples, is 74 % (cross-validated value). Further analysis was carried out for 23 signatures covering 30 % or more positive examples of a particular fold. The work showed that signatures of protein folds exist, about half of rules discov- ered automatically coincide with the level of fold in the SCOP classi®- cation. Other signatures correspond to homologous family and may be the consequence of a functional requirement. Examination of the rules shows that many correspond to established principles published in speci®c literature. However, in general, the list of signatures is not part of standard biological databases of protein patterns. We ®nd that the length of the loops makes an important contribution to the signatures, suggesting that this is an important determinant of the identity of protein folds. With the expansion in the number of determined protein struc- tures, stimulated by structural genomics initiatives, there will be an increased need for automated methods to extract principles of from coordinates. # 2001 Academic Press Keywords: ; protein function; knowledge discovery; *Corresponding author folding

Introduction to extract structural features that are common to a particular tertiary fold or to a function. This paper Within the collection of determined protein explores the use of Inductive Logic Programming structures, there is a wealth of principles govern- (ILP),3 a form of machine learning to extract auto- ing the complex sequence-conformation-function matically local three-dimensional signatures that relationships. Historically, many of these principles are speci®c to a particular protein fold due to a have been identi®ed by extensive human examin- stereochemical or a functional requirement. ation. However, there is now a rapid expansion in Protein domains adopt a common fold if there is the number of determined protein structures due a similar sequential and three-dimensional arrange- to structural genomics projects1 combined with ment of their regular a- and b-secondary struc- improvements in the experimental methodology.2 tures. Recently several classi®cation of protein Computational methods will therefore be required structures such as SCOP (Structural Classi®cation of ),4 CATH (which clusters proteins at four major levels, Class, Architecture, Topology Abbreviations used: ILP, inductive logic 5 programming; SCOP, structural classi®cation of and Homologous superfamily) and FSSP (Fold proteins. classi®cation based on Structure-Structure align- E-mail address of the corresponding author: ment of Proteins)6 have been developed that ident- [email protected] ify proteins with common folds. In SCOP, which is

0022-2836/01/030591±15 $35.00/0 # 2001 Academic Press 592 Automated Discovery of Structural Signatures the scheme considered here, proteins of the same diction). However, knowledge of these structural fold are subdivided into superfamilies which rep- signatures primarily relies on human expertise resent a grouping of all proteins that are homolo- which is based on in-depth analysis together with gous (i.e. diverged from a common ancestor). knowledge of the relevant literature. Indeed, to our Proteins from different superfamilies within the knowledge, there is no single source that docu- same fold are presumed to be analogues that con- ments these structural signatures. With the increase verged to a stable folding arrangement. The basis in the number of protein structures, automated for this classi®cation scheme is the protein expert methods are required therefore to identify systema- A. Murzin who considers the evolutionary evi- tically structural signatures. dence provided by the sequences, structures and Machine learning techniques, such as arti®cial functions of proteins. neural networks and hidden Markov models, have There are several constraints on a protein been applied to study several problems of molecu- sequence/structure in order for it to adopt a par- lar biology.28 Those techniques have been particu- ticular fold. Firstly, there is the thermodynamic larly successful in analysing sequence data but stability of the ®nal structure. For example, there methods to study the three-dimensional structure could be a requirement for certain lengths of sec- are much less developed. One factor limiting ondary structures to form a stable protein core.7±9 their application is the ability to model long range There can also be constraints on the folding path- interactions. way and certain local substructures might act as We present an application of Inductive Logic ef®cient nucleation sites.10± 12 In addition to struc- Programming (ILP) to learn rules relating local tural needs, the required function can dictate a structures to the concept of fold de®ned by SCOP. common structural feature, particularly for pro- ILP algorithms automatically derive rules from teins from the same superfamily.13 These con- examples and background knowledge. Previously straints can be considered as global or local. A ILP has been applied to several structural molecu- global signature should identify a limited number lar biology problems including protein secondary of important residues or structural arrangements. structure prediction,29 drug design,30,31 packing of Global constraints should be more focussed than beta-strands,32 and chemical mutagenesis.33 Several just a sequence pro®le that is generated from a features suggest it may be particularly well suited multiple alignment. Local features, which are the to study problems encountered in protein struc- focus of this paper, relate to a short region that ture. Firstly, structures are the result of complex may involve a particular sequence or arrangement interactions between sub-structures, and the ability of secondary structures. of ILP algorithms to learn relations may prove to Sequence signatures are a well characterised fea- be a key feature, for example to model the inter- ture of proteins (e.g. Prosite14 and Prints15). These action of two consecutive secondary structures. sequence patterns were primarily derived from Secondly, ILP systems can make use of problem- visual inspection of aligned sequences. Automated speci®c background knowledge. Vast amounts of methods of discovery can also be used to identify knowledge have been accumulated over the years sequence patterns using a variety of approaches of research on protein structure and can be used including unsupervised searching for commonly effectively. Thirdly, ILP is a logic-based approach occurring sequence words,16± 18 Gibbs sampling19 to machine learning and the formalism of logic is and the use of pattern graphs.20 very expressive. The hypotheses (rules) are easily Structural patterns are harder to classify but amenable to human interpretation. Thus, we exam- have also been identi®ed. Kasuya et al.13 started ine here whether ILP can discover principles of with Prosite sequence patterns and mapped them protein fold and function. onto the protein coordinates to identify the local structure corresponding to the sequence motif. Other groups21 ± 23 have focussed on conserved Approach arrangements of sequentially distant residues, par- ticularly of their side-chain atoms, to discover In this application, sets of rules were learnt sep- automatically global signatures of residues that arately for each fold. A machine learning task form protein active sites. Mirny et al.24 performed therefore consist in learning rules for a fold from structural superpositions to identify global patterns examples and background knowledge (see of key residues in the ®ve most populated protein Figure 1). Herein, positive and negative examples folds. There are, however, many structural local were used. A list of positive examples was derived signatures that have been discovered primarily by from SCOP by selecting representative domains for manual inspection, e.g. features of the fold25 the fold under study. A list of counter (negative) and of the Rossmann fold.26 examples was also derived from SCOP, but this A powerful demonstration of the presence of time by selecting domains from different folds of such local structural signatures was the success of the same structural class (all-a, all-b, a/b and Murzin et al.27 in using these signatures to predict a ‡ b), see below for details of the selection pro- successfully protein folds from sequence during cess. The background knowledge contained struc- the blind trial known as CASP2 (the second Critical tural information about the positive and negative Assessment of techniques for protein Structure Pre- examples and also general principles, such as how Automated Discovery of Structural Signatures 593

Our study was restricted to the ®ve most popu- lated folds of each of the four main classes (see Table 1). This ensured that each fold contained a suf®cient number of examples. Since the folds are well populated this also gives a good coverage of the domains found in SCOP. Indeed, the subset comprises 381 domains and therefore represents 30 % of the total number of domains found in the four main classes.

Results

Figure 1. Flow of information. ILP uses background Protein signatures knowledge and examples to derive new rules. A total of 59 signatures were derived automati- cally for 20 populated folds in SCOP (see Table 1). The complete list of rules can also be found from our web site www.bmm.icnet.uk/ilp/fold. This implies that on average three rules per fold to calculate the hydrophobic moment of a were generated to match all the examples. For the secondary structure element. (TIM)-barrels as many as ten rules were needed. In For instance, in the globin fold, the list of posi- general, the number of rules correlates well with tive examples comprised 13 domains which the number of positive examples (correlation coef®- includes haemoglobin I, myoglobin and phycocya- cient 0.875). nin. Although Progol can learn from positive The predicted overall accuracy, de®ned as the examples only, negative examples were included, number of correct assignments over the total num- since they were readily available; the list of counter ber of examples, was estimated using standard examples for the includes cytochrome c, cross-validation tests. The estimated accuracy is four-helical cytokines and cyclin. For each 74.35(Æ1.31) %, which is statistically better at 99 % example, positive and negative, structural infor- level of con®dence than random assignments mation was derived. This includes attributes, such (t-test); see later for details. The accuracy for folds as the total number of residues, but also relational of the all-a and all-b classes is slightly better. This information, such as the adjacency of the second- might be because proteins of these two classes are ary structures, and local information, such as the in general smaller and less complex than those of average hydrophobicity of each secondary struc- the two remaining classes, a/b and a ‡ b. ture element and the presence of proline residues. Recall is de®ned as the number of true positives See below for a complete description. over the total number of positives whilst precision To learn a rule, Progol selects a positive example is the number of true positives over true positives and extracts all the related information from the plus false positives. The rules with recall of 30 % or background knowledge. This information is used more are presented in Table 2, those rules will be in a combinatorial search that incrementally builds referred to as ``power rules''. In our study, it was more speci®c rules until it ®nds one that maximises impossible to derive rules with recall greater than its measure of compression. Speci®cally, the 17 % for the (TIM)-barrel fold. This fold has also measure of compression, equation (1), seeks to the largest number of superfamilies and families in maximise the number of positive examples covered our data set, 17 and 28, respectively. The combined (p), minimises the number of negative examples effects of diversity and complexity of the fold has covered (n), while minimising the length of the prevented Progol from ®nding general signatures. rule (c): As mentioned above, ten rules were produced for the (TIM)-barrels, at the other end of the scale, a single rule per fold was generated for three folds, f ˆ p n c 1† lipocalins, P-loop and SH2. Since there were some outliers (false negative) and some domains do not Given two rules covering the same number of posi- match the rule that was found, the recall ranges tive and negative examples, this measure favours from 86 % to 100 % for those folds. The precision is the shortest one, i.e. the one that obeys the prin- high, ranging from 80 to 100 %. This is in agree- ciple of parsimony or Occam's razor. Once an opti- ment with the prede®ned parameter noise which mal rule has been found, the positive examples it was set to 20 %. matches are removed from the database and the In the following sections ®ve signatures are algorithm resumes with the next available positive detailed. Those rules were selected because of their example. The process continues until no positive particular biological interest. For example, in the example remains. At this point, Progol has found a Rossmann fold, the local signature is characteristic set of rules that represents all the positive of binding site. The Prolog represen- examples. tation of these signatures and their English trans- 594 Automated Discovery of Structural Signatures

Table 1. Cross-validation results

Fold Examples Families Superfamilies Rules FN % Accuracy All-a DNA 3-helical 30 17 4 4 1 82 Æ 4 EF hand-like 14 722169Æ 7 Globin-like 13 212195Æ 4 4-Helical cytokines 10 312273Æ 8 Lambda repressor 10 312063Æ 9 Other folds (92) 210 139 111 - - - 78 Æ 3 All-b Ig beta-sandwich 45 12 8 6 3 71 Æ 4 Tryp ser proteases 21 413182Æ 5 OB-fold 20 11 4 3 4 77 Æ 5 SH3-like barrel 16 762277Æ 6 Lipocalins 14 211279Æ 6 Other folds (56) 220 123 90 - - - 76 Æ 2 a/b b/a (TIM)-barrel 55 28 17 10 5 66 Æ 4 Rossmann-fold 21 712378Æ 5 P-loop 14 411281Æ 6 Periplasmic II 13 213164Æ 8 a/b-Hydrolases 12 10 1 3 0 75 Æ 7 Other folds (70) 200 131 88 - - - 71 Æ 2 a ‡ b Ferredoxin-like 26 21 17 3 2 80 Æ 5 Zincin-like 13 824156Æ 8 SH2-like 13 111079Æ 6 beta-Grasp 12 663164Æ 8 Interleukin 8 9112085Æ 7 Other folds (96) 240 158 113 - - - 73 Æ 3 74 Æ 1 For each fold, the Table lists the number of positive examples, the number of families and superfamilies, the total number of rules automatically derived, the number of false negative examples (FN), i.e. positive examples which are not represented by any of the rules, and the accuracy.

lation are listed in Table 3. This is followed by a the problems of secondary structure assignment discussion of the relationship between the signa- are corrected, the three domains match the general tures and the SCOP hierarchy. rule as well.

Globin Lambda repressor The globin fold comprises diverse sequences The l repressor-like fold contains small domains such as myoglobin, hemoglobin and phycocyanins. that bind DNA in a similar way. The second helix Yet the three-dimensional structures of these pro- of a helix-turn-helix motif (called the recognition teins are similar. One hallmark of this fold is the helix) makes sequence-speci®c contacts with the presence of a conserved proline in helix B. This edge of the base-pairs situated in the major groove observation has been reported previously by Bash- of the DNA molecule.34 Although very important, ford et al.25 and was here discovered automatically. the network of hydrogen bonds between the recog- This is illustrated in Figure 2(a) where the proline nition helix and DNA is not the only determinant residues (represented as ball-and-stricks) coincide of the speci®city. For the well studied proteins 434 with a sharp bend in the main chain. The signature repressor35 and Cro,36 it has been proposed that is present in the two families, globins and phyco- the af®nity for the central base-pairs of the oper- cyanins, that constitute the globin fold. Progol has ator depends on the extensive van der Waals inter- also produced a second rule because three domains actions of the loop that follows the helix-turn-helix appeared to be exceptions. Looking back at the motif. It is thought that the proteins also recognise data revealed that in the case of myoglobin (PDB the conformation of the sugar-phosphate backbone 1myg) the ®rst and second helix were merged which, in turn, depends on a speci®c DNA together and prevented the match. However, in the sequence. A-T base-pairs favour bending toward case of glycera globin (PDB 2hbg) and leghemoglo- the minor groove while G-C base-pairs favour bin (PDB 1bin) the second helix, which is reported bending toward the major groove.37 This mechan- as a contiguous yet bent helix in all our examples, ism is referred to as ``indirect readout''.38 The was separated in two and precluded the match. If structure of the eukaryotic Oct-1 POU-speci®c Automated Discovery of Structural Signatures 595

Table 2. Protein signatures with recall greater than 30 %

All-a: DNA-binding three-helical bundle: (recall ˆ 60 %, precision ˆ 81 %, n ˆ 30, fold). The length is 34 to 105 residues, there are exactly three helices, the loop between the 1st and 2nd helix is two to four residues long. EF hand-like: (recall ˆ 64 %, precision ˆ 90 %, n ˆ 14, fold). The 1st strand is followed by a helix, the 2nd strand is immediately followed by a helix. Globin-like: (recall ˆ 69 %, precision ˆ 100 %, n ˆ 13, superfamily/unique). The 1st helix is followed by a 2nd one that contains a proline. Four-helical cytokines: (recall ˆ 40 %, precision ˆ 80 %, n ˆ 10, family). The 2nd strand is immediately followed by a helix (i.e. no coil). Four-helical cytokines: (recall ˆ 40 %, precision ˆ 80 %, n ˆ 10, family). The 1st helix is long and followed by another helix. l Repressor-like DNA-binding domains: (recall ˆ 70 %, precision ˆ 88 %, n ˆ 10, superfamily/unique). The length varies 53 to 88 residues, the loop between the 3rd and 4th helix is three to nine residues long. All-b: Immunoglobulin b-sandwich: (recall ˆ 49 %, precision ˆ 79 %, n ˆ 45, fold). There is at most one helix, the loop between the 5th and 6th strands is three to seven residues long. Trypsin serine proteases: (recall ˆ 57 %, precision ˆ 93 %, n ˆ 21 superfamily/unique). The loop between the 12th and 13th strand is four to 12 residues long. OB-fold: (recall ˆ 35 %, precision ˆ 100 %, n ˆ 20, fold). The 1st strand is long and followed by an other strand. The 1st helix is followed by a strand. OB-fold: (recall ˆ 30 %, precision ˆ 86 %, n ˆ 20, fold). There are five or six strands, the loop between the 1st and 2nd strand is two to four residues long, the 4th strand is followed by another strand. SH3-like barrel: (recall ˆ 69 %, precision ˆ 92 %, n ˆ 16, fold). There are four to six strands, the loop between the 3rd and 4th strand is one to three residues long. Lipocalins: (recall ˆ 86 %, precision ˆ 86 %, n ˆ 14, superfamily/unique). The length varies from 110 to 179 residues. The loop between the 7th and following strand is two to four residues long. a/b class: NAD(P)-binding Rossmann-fold domains: (recall ˆ 62 %, precision ˆ 100 %, n ˆ 21, superfamily/unique). The 1st strand is followed by a helix, the two structures are separated by a short connection, about one residue. Also, the 6th strand is followed by a helix. P-loop containing nucleotide triphosphate hydrolases: (recall ˆ 86 %, precision ˆ 80 %, n ˆ 14, superfamily/unique). The loop which connects 1st strand and following helix is three to seven residues long. Periplasmic binding protein-like II: (recall ˆ 38 %, precision ˆ 100 %, n ˆ 13, superfamily/unique). The loop between the 10th strand and the helix that follows is one residue long. Periplasmic binding protein-like II: (recall ˆ 31 %, precision ˆ 80 %, n ˆ 13, superfamily/unique). The 4th strand contains a proline and is followed by a helix. a/b-Hydrolases: (recall ˆ 58 %, precision ˆ 88 %, n ˆ 12, superfamily/unique). The loop connecting the 1st strand and the one that follows is one to three residues long and the 7th helix is followed by a strand. a ‡ b class: Ferredoxin-like: (recall ˆ 54 %, precision ˆ 100 %, n ˆ 26, fold). There are three to five b-strands, the 1st strand is followed by a helix and the loop between the 2nd helix the strand that follows it is two to four residues long. Zincin-like: (recall ˆ 31 %, precision ˆ 100 %, n ˆ 13, superfamily) The loop between the 2nd and 3rd strand is 2 to 4 residues long, the 2nd helix is followed by a helix. Zincin-like: (recall ˆ 31 %, precision ˆ 100 %, n ˆ 13, superfamily). The loop between the 1st helix and the following strand is four to ten residues long, the 7th strand is followed by a strand. SH2-like: (recall ˆ 100 %, precision ˆ 81 %, n ˆ 13, family/unique). The length varies from 97 to 116 residues. The loop between the 2nd and 3rd strand is two to four residues long. b-Grasp: (recall ˆ 42 %, precision ˆ 100 %, n ˆ 12, fold). The domain contains the following motif: b2-a1-b3 and the loop between a1 and b3 is two to four residues long. Interleukin 8-like chemokines: (recall ˆ 78 %, precision ˆ 100 %, n ˆ 9, family/unique). The length range from 62 to 78 residues long, the loop which separates the 2nd and 3rd strand is one to three residues long. The statistics in parentheses are: recall, de®ned as true positive/total number of positive examples, precision, de®ned as true posi- tive/total number of prediction, n is the total number of positive examples, a ®nal comment indicates if the rule coincides with a fold, a superfamily or a family. Superfamily/unique denotes the fact that a rule coincides with a superfamily but the fold has only one superfamily therefore it is impossible at this time to know if the rule represents the fold or the superfamily.

domain also shows similar contacts in the loop chain cytokines, the short chain cytokines and the region.39 The signature generated by Progol shows interferons/interleukin-10. In our data-set, the that the length of this loop is conserved amongst interferons/interleukin-10 family contained only the bacteriophage domains: lambda C1 repressor one example and it did not match any of the two (1lmb), 434 C1 repressor (2r63), cro 434 (2cro) and rules that were found. For the majority of the P22 C2 repressor (1adr), and also the eukaryotic folds, Progol produced more than one rule to cover Oct-1 POU-speci®c domain (1pou). This region is all the positive examples. In the cytokines, the shown in red in Figure 2(b). mapping of the rules onto the examples coincides with the classi®cation of SCOP into families. One Cytokines rule describes the long-chain family and its signa- Cytokines regulate the development and activi- ture highlights the fact that the domains start with ties of many cell types. The four-helical cytokines a long helix (see Figure 2(c)). The second rule fold, studied here, contains three families: the long covers the domains of the short-chain family. The 596 Automated Discovery of Structural Signatures

Figure 2. The Figure illustrates the structural elements making up the rules. Elements involved in structural signa- tures are shown in red. All the ®gures were prepared using MOLSCRIPT version 2.156, except that of the lambda repressor which was produced using PREPI (S. Islam unpublished).

distinctive signature in this case is the absence of a helix also participates in the hydrogen bonding coil between the last strand-helix pair (see network of the sheet, except for one domain where Figure 2(d)). This fact was used by Rozwarski the sheet is distorted. In IL-4, Kruse et al.41 showed et al.40 to tether a structural alignment in a com- that residues from the C terminus of this helix parison of the shot-chain helical cytokines. Looking are involved in either receptor binding or receptor at the structures reveals that the ®rst residue of the activation. Automated Discovery of Structural Signatures 597

Table 3. Prolog representation for the six rules pre- tains a large insertion more than 50 residues in sented in the detailed analysis together with the English length, including two helices and two strands. translation

Rule 1 (Globin fold) Helix A at position 1 is followed by helix B. B contains a proline. Composition of the signatures fold('Globin-like', X) :- adjacent(X, A, B, 1, h, h), We now look at the composition of the signa- has_pro(B). tures. In particular, we investigate the frequency of Rule 2 (lambda repressor) The protein is between 53 and 88 residues long. Helix A at position 3 is followed by helix B. The coil between A use of each predicate with a view to understand and B is about six residues long. the fold signatures; see later for a detailed descrip- fold('lambda repressor', X) :- tion the exact de®nitions. The frequency of occur- total_length(53 ˆ

P-loop Table 4. Composition of the signatures The P-loop fold comprises three families: the nucleotide and nucleoside kinases, the G proteins Feature All rules Power rules and nitrogenase iron protein-like. The ®rst loop is adjacent 80 32 necessary for the proper binding of the guanine length_loop 42 19 nucleotide and gives its name to the fold: the length_sec_structure 15 2 diphosphate-binding loop or P-loop. This region, total_length 13 5 number_strands 5 3 strand-loop-helix, contains the conserved sequence number_helices 4 2 motif, [AG]-x(4)-G-K-[ST], which is used in the has_pro 3 2 Prosite database14 to characterise several ATP/ average_hydrophobicity 3 0 GTP-binding proteins (Figure 2(f)). The signature hydrophobic_moment 2 0 found by Progol matches 12 of the 14 domains The Table indicates the number of occurrence of each feature found in SCOP, the exceptions are PDB 1dar, in the rules, we distinguish between two subsets: all rules and which has a shorter loop and PDB 2reb, which con- power rules, the later refers to rules with 30 % or more recall. 598 Automated Discovery of Structural Signatures niques, such as decision trees, relationships of what is considered most favourable according to three structures would have to be listed explicitly, equation (1). Since Progol produced on average otherwise they would not be discovered. Usage of three rules per fold, it is possible that the signa- a higher number of adjacent predicates was pre- tures coincide with superfamilies and families in vented in order to speed up calculations and SCOP. Table 5 shows the number of signatures ensure that Progol could completely sample the that correspond to each level. A distinction is space of possible rules. made between (1) signatures that coincide with a The length of connecting loops is the second superfamily because the fold has only one super- most frequently used predicate. Surprisingly, the family and (2) the genuine case where the fold has length of the loop appears almost as frequently in more than one superfamily but the signature signatures of the folds as it does in the superfami- occurs only in one. Of course, in the former case, lies; seven and nine occurrences, respectively, in the signatures may represent characteristics which the power rules. The length of a loop affects the are speci®c to the superfamily but there are no packing of the secondary structures it connects and data to support it. The same distinction is made for this might explain their occurrences in fold rules. families. For superfamily and family signatures, the conser- About half of the 59 rules represent fold signa- vation of the length of connecting loops can also be tures. This number goes down to about a third explained by the requirements for binding sites when only the power rules are considered. In both and functional regions (see Rossmann and P-loop cases only a small number of family rules were folds, above). The length of the loops used to con- produced. The four-helical cytokines provide struct signatures varies from one structural class to examples of family rules. In our data-set this fold another, with the loop of the all-b class being the has two families, the short and the long chains, longest. and Progol produced two rules each one covering Of all four classes, the signatures of the all-a exactly one family. class make more use of total_length which The signatures often indicate commonality indicates that length is conserved amongst these between superfamilies without always obeying the folds. rigid structure of SCOP. This is seen in the three- Lastly, to our surprise, the hydrophobicity as helical bundle fold where some of the rules involve well as the presence of proline residues do not strands and occur in all but the homeodomains, make signi®cant contribution to the signatures. In which contain only helices. the work of King et al.32 the hydrophobicity was Fold signatures are likely to be the result of found to be a main determinant of the topology of stereochemical constraints. For example, the length sheets. However, this study suggests that hydro- of a loop affects the packing of the ¯anking phobicity is not important in dictating how a secondary structures. Proteins from different sequence will adopt a particular fold. Further ana- superfamilies of the same fold are presumed to be lyses are required to investigate this surprising analogues that have converged toward a stable result. In contrast, our study suggests that the folding arrangement. However, as Russell et al.42 length of the loops plays an important role in suggest, certain folds demonstrate binding site determining protein folds identity, perhaps this is similarity in the absence of homology. For those because the loop affects the packing of the ¯anking folds, the signatures might indicate functional elements. constraints.

Relationship between the signatures and SCOP hierarchy Discussion The protocol used for the analysis does not force This analysis is currently limited by the absence rules to be learnt at the fold level. The system has of structure superposition. It is dif®cult to align the ability to learn more than one rule to cover all reliably and systematically all the structures of a the positive examples. The choice will depend on given fold and identify the common secondary

Table 5. Relationship between the signatures and SCOP hierarchy

All rules Power rules Level Multiple Unique Total Multiple Unique Total Fold 28 - 28 8-8 Superfamilies 9 15 24 2911 Family 4 3 7 224 The Table presents the number of rules that coincide with a particular level of SCOP. Two subsets of the rules are presented: all rules and power rules, the latter refers to rules with 30 % or more recall. We also distinguish between rules which coincide with a fold and characterise domains from more than one superfamily, indicated by multiple, from those who coincide with a fold but have one superfamily, indicated by unique, in that case further data is needed to decide if the rule represents a genuine feature of a fold or a superfamily. Similar distinctions apply to the other two levels, superfamilies and families. Automated Discovery of Structural Signatures 599 structure elements.43 This causes problems for Kasuya et al.13 who studied the three-dimensional folds with large insertions, although several mech- motifs corresponding to Prosite patterns. Their anisms have been put in place to alleviate these analysis also showed that structure can help to dis- problems. In some proteins the secondary struc- tinguish false positive matches. However, we have tures matched by a rule are not structurally equiv- analysed the relationship between Prosite motifs alent. Such large insertions mean that for the and SCOP and found that Prosite motifs mainly (TIM)-barrel and the immunoglobulin folds the coincide with families (12 %) or even sub-families algorithm was prevented from learning general (40 %) of SCOP and a mere 8 % coincide with folds. rules. As a further measure, preliminary studies In the globins, only leghaemoglobin has a Prosite suggest that a simple solution such as numbering motif. In l-repressor, there are three motifs but the secondary structures from C- as well as N- those are speci®c to a family or even a subset of a terminal region can be quite effective. SCOP family. Similarly, in the Rossmann fold, This study shows that signatures of protein folds and superfamilies can be automatically and sys- there are two Prosite motifs: one that represents tematically discovered. Once the background members of the tyrosine-dependent oxido- knowledge has been de®ned, this approach pro- reductases family, and another one that vides an unbiased way to test hypotheses. Once a represents formate/glycerate dehydrogenases and rule has been found, it can be tested experimen- NAD-domains. Finally, in the P-loop fold, the tally, for example, by varying the length of the ATA_GTP_A motif matches seven out of 14 loops in engineered proteins. domains distributed over three families. Thus, in Our key conclusions are: general, the signatures discovered by ILP are not described in Prosite. (1) Protein fold signatures exist. Indeed, half the The rules discovered automatically in this work rules produced represent fold signatures. The other can be classi®ed into four groups. Firstly, there are half characterises aspects of protein function and signatures which are easily available to the com- the signatures were probably conserved for that munity via the biological databases of protein pat- reason. terns. The Prosite motif ATA_GTP_A matches half (2) The length of loops makes a major contri- of the domains in the P-loop fold and is probably bution to the identity of protein folds. Surprisingly, the best example of this kind. However, the signa- the length of loops is used almost as frequently in tures found in the patterns databases contain pri- signatures of folds as in signatures of superfami- marily sequence rather than structural motifs. As a lies. However, the reasons for this might differ. In consequence, these databases focus mainly on folds, the length of the loops may be critical to induce the correct packing of the ¯anking families of proteins. Secondly, there are signatures elements. While for superfamilies, signatures invol- which can be found in the literature. The signa- ving loops often correspond to functionally import- tures for the globin, l-repressor, cytokines, Ross- ant regions. mann-fold and P-loop are examples of this kind. (3) Hydrophobicity and the presence of proline These signatures are known but often require seem to play a less important role than we pre- extensive literature search to be found. Thirdly, viously anticipated. there are signatures which are not found in the lit- (4) The length of the domains is a conserved erature nor in the public databases but are known feature of the all-a class. to experts. The power rules are an example of this The majority of the principles of protein struc- kind, 15 of the 23 signatures from Table 2 were ture have been discovered by extensive human presented to the expert developing SCOP, A. Mur- analyses. New initiatives and improvement of the zin, in the form of a questionnaire with multiple methods for structure determination will lead to an choices. Murzin identi®ed correctly virtually all the explosion of the number of structures. Bourne rules, which illustrates that the rules are genuine et al.44 predicts that over the next ®ve years the signatures that are known to the world's expert in number of structures could grow by as much as a protein structure. Finally, some of the rules are dif- factor of 3. ®cult to analyse and it is possible that further lit- Increasingly, protein structure is involved in the erature search or detailed analysis would reveal process of genome annotation. Sophisticated their meaning. The work presented here used a methods are developed to detect remote homology restricted background knowledge and yet pro- in the hope that knowledge can be transfered to the protein under study. However, high-through- duced rules which are known to the world experts put structure determination projects by nature will on protein structure, which shows the potential of populate the databases with proteins for which this approach. The next step will be to incorporate very little is known about their biology. Therefore additional information in the background knowl- there is a need for tools that automatically extract edge to see how the rules can be improved. principles and assist the construction of classi®- This study shows that ILP can learn expert cation schemes. type rules from complex biological data. Other An interesting step towards a structural charac- areas of bioinformatics may well be amenable to terisation of protein signatures was taken by knowledge discovery using ILP. 600 Automated Discovery of Structural Signatures

Materials and Methods Table 6. Background knowledge, i.e. the building blocks of the protein fold signatures Data set total_length(Lo ˆ

Background knowledge but more importantly, for any real world application, the number of input variables therefore created would In preparation for a machine learning task, the user exceed the capacity of the machine learning algorithm. must de®ne the predicates that can be used to construct Local information makes up the third category. The new rules and how they can be combined with one predicates describe properties of the secondary struc- another. This effectively de®nes the space of possible tures and connecting loops. We have included the aver- rules. age hydrophobicity, the hydrophobic moment, the The background knowledge for the protein fold appli- length and the presence of a proline. For the loops, only cation comprises nine predicates, see Table 6. They can the length is accounted for. be classi®ed into three categories: global, relational and Several mechanisms have been incorporated to cir- local. The global predicates include the domain length, cumvent problems caused by variations of the attributes the number of helices and the number of strands. Other such as length but also caused by the insertions and machine learning techniques, such as neural network deletions. Intervals of values are used by the global pre- and decision trees, only use this type of representation. dicates and the values for the boundaries are learnt by Relational information comes from two different the algorithm. For the predicates adjacent and sources in this application. The ®rst source is provided length_loop, we have used prede®ned intervals. by the predicate adjacent, which describes the Another measure was to number the helices and strands relationship between two consecutive secondary struc- ture elements. The other source comes from the associ- separately to allow the learning algorithm to rely on the ation of variables, as exempli®ed by the rule for the most conserved numbering scheme. Finally, for the pre- b-Grasp fold (see Table 2), whose signature involves two dicates modelling local information, the actual values adjacent predicates sharing a secondary structure have been replaced by symbolic constants. The distri- bution of the length of the secondary structure, average which effectively de®nes a triple, b2-a1-b3, although no triple relationship was explicitly encoded in the back- hydrophobicity and hydrophobic moment were ground knowledge (see Table 6). The ability to manip- calculated for the four main classes. If the actual ulate relational information is a distinctive feature of ILP value was lower or equal to the mean minus (or plus) systems. To incorporate this information into a neural two standard deviations the value was replaced by network for example, all the possible pairs for very_lo (very_hi), if it was lower or equal to the adjacent and all possible associations would need to mean minus (or plus) one standard deviation the value be prede®ned. Not only would this be a tedious process was assigned lo (hi). Automated Discovery of Structural Signatures 601

The background knowledge also contains two types of This constitutes a reservoir of information that can be constraints. Integrity constraints are used to ensure that used to construct new rules. In the second step, Progol every rule considered contains at least one of the follow- searches the space of all possible rules, starting with the ing predicates: length_sec_structure, average_- most general one, which is ``everything is a Globin'': hydrophobicity, hydrophobic_moment, length_loop or has_pro. In a preliminary study,48,49 [C:-8,13,20,0 fold('Globin', X).] we compared the rules obtained in the presence or absence of integrity constraints. We concluded that for- The search uses a branch-and-bound-like algorithm cing the rules to contain local features adds complexity guided by a measure of compression. This measure and richness to the rules as judged by our knowledge of depends on the number of positive and negative protein structure. The second form of constraint prevents examples covered as well as the length of the clause. The rules from having more than two adjacent predicates rule is specialised: ``every domain such that the ®rst and allows Progol to search the entire hypothesis space. helix is followed by another helix'':

Machine learning algorithm [C:-6,13,17,0 fold('Globin', X) :- adja- cent(X,A,B,1,h,h).] The results were obtained with the ILP system Pro- gol version 4.4.50 A detailed presentation of the algor- which leads to a new value for the compression measure. ithm can be found by Muggleton et al.51,52 and only The clause is further specialised: ``every domain such the main steps will be outlined here. The learning that the ®rst helix is followed by another helix and procedure involves two steps. First, a positive example another helix''. is randomly selected and the most speci®c clause is calculated; it consists in ®nding all related predicates [C:-2,13,12,0 fold('Globin', X) :- adja- 52 from the background knowledge. The second step cent(X,A,B,1,h,h), consists in using the predicates found in the ®rst step adjacent(X,B,C,2,h,h).] to construct a hypothesis which maximises com- ... pression (see equation (1)). The best hypothesis is selected and the examples it matches are removed. The procedure resumes with the ®rst step and At the end of the search, Progol has found the rule that continues until no more examples remain. maximises its measure of compression: f ˆ 8,p ˆ 13,n ˆ 1,h ˆ 0 [Result of search is] Example of the execution of Progol fold('Globin', X) :- adjacent(X,A,B,1,h,h), The learning algorithm performs two main steps: (i) the construction of the most speci®c clause and (ii) a adjacent(X,B,C,2,h,h), general to speci®c search to ®nd an optimal rule. To total_length(135 ˆ

Cross-validation 14. Hofmann, K., Bucher, P., Falquet, L. & Bairoch, A. (1999). The Prosite database, its status in 1999. Nucl. Performance analyses were carried out over cross-vali- Acids Res. 27, 215-219. dation test sets. Two different tests were applied 15. Attwood, T. K., Croning, M. D. R., Flower, D. R., depending on how much data were available. If the total Lewis, A. P., Mabey, J. E., Scordis, P., Selley, J. N. & number of examples (positive ‡ negative) was greater Wright, W. (2000). PRINTS-S: the database formerly than 60, a tenfold cross-validation test was applied, known as PRINTS. Nucl. Acids Res. 28, 225-227. otherwise it was leave-one-out. This made it possible to 16. Smith, R. F. & Smith, T. F. (1990). Automatic gener- run the entire cross-validation under two days using six ation of primary sequence patterns from sets of of the 12 processors of our SiliconGraphics Power Chal- related protein sequences. Proc. Natl Acad. Sci. USA, lenge computer. The ®nal rules were learnt on the com- 97, 118-122. plete data sets. 17. Saqi, M. A. & Sternberg, M. J. E. (1994). Identi®- cation of sequence motifs from a set of proteins with related function. Protein Eng. 7, 165-71. 18. Rigoutsos, I., Floratos, A., Ouzounis, C., Gao, Y. & Parida, L. (1999). Dictionary building via unsuper- vised hierarchical motif disacovery in the sequence space of natural proteins. Proteins: Struct. Funct. Genet. 37, 264-277. Acknowledgements 19. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. & Wootton, J. C. (1993). This work is supported by a BBSRC/EPSRC Bioinfor- Detecting subtle sequence signals: a Gibbs sampling matics grant. strategy for multiple alignment. Science, 262, 208- 214. 20. Johassen, I. (1997). Ef®cient discovery of conserved References patterns using a pattern graph. CABIOS, 13, 509-522. 1. Burley, S. K., Almo, S. C., Bonanno, J. B., Capel, M., 21. Artymiuk, P., Poirrette, A. R., Grindley, H. M., Rice, Chance, M. R., Gaasterland, T., Lin, D. W., SÏali, A., D. W. & Willett, P. (1994). A graph-theoretic Studier, F. W. & Swaminathan, S. (1999). Structural approach to the identi®cation of three-dimensional genomics: beyond the human genome project. patterns of amino acid side-chains in protein struc- Nature Genet. 23, 151-157. tures. J. Mol. Biol. 243, 327-344. 2. Berman, H. M., Westbrook, J., Feng, Z., Gilliland, 22. Russell, R. B. (1998). Detection of protein three- G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & dimensional side-chain patterns - new examples of Bourne, P. E. (2000). The . Nucl. convergent evolution. J. Mol. Biol. 279, 1211-1227. Acids Res. 28, 235-242. 23. Jonassen, I., Eidhammer, I. & Taylor, W. R. (1999). 3. Muggleton, S. & De Raedt, L. D. (1994). Inductive Discovery of local packing motifs in protein struc- logic programming: theory and methods. J. Logic tures. Proteins: Struct. Funct. Genet. 34, 206-219. Programming, 19/20, 629-679. 24. Mirny, L. A. & Shakhnovich, E. I. (1999). Universally 4. LoConte, L., Ailey, B., Hubbard, T. J. P., Brenner, conserved positions in protein folds: reading evol- S. E., Murzin, A. G. & Chothia, C. (2000). SCOP: a utionary signals about stability, folding kinetics and structural classi®cation of proteins database. Nucl. function. J. Mol. Biol. 291, 177-196. Acids Res. 28, 257-259. 25. Bashford, D., Chothia, C. & Lesk, A. M. (1987). 5. Pearl, F. M. G., Lee, D., Bray, J. E., Sillitoe, I., Todd, Determinants of a protein fold. Unique features of A. E., Harrison, A. P., Thornton, J. M. & Orengo, the globin amino acid sequences. J. Mol. Biol. 196, C. A. (2000). Assigning genomic sequences to 199-216. CATH. Nucl. Acids Res. 28, 277-282. 26. Wierenga, R. K., Terpstra, P. & Hol, W. G. J. (1986). 6. Holm, L. & Sander, C. (1998). Touring protein fold Prediction of the occurence of the ADP-binding b-a- space with DALI/FSSP. Nucl. Acids Res. 26, 316-319. b-fold in proteins, using and amino acid sequence 7. Chothia, C. & Finkelstein, A. V. (1990). The classi®- ®ngerprint. J. Mol. Biol. 187, 101-107. cation and origins of protein folding patterns. Annu. 27. Murzin, A. G. & Bateman, A. (1997). Distant hom- Rev. Biochem. 59, 1007-1039. ology recognition using structural classi®cation of 8. Holm, L. & Sander, C. (1998). Dictionary of recur- proteins. Proteins: Struct. Funct. Genet. Supplement rent domains in protein structures. Proteins: Struct. 1, 105-112. Funct. Genet. 33, 88-96. 28. Baldi, P. & Brunak, S. (1998). Bioinformatics: The 9. Salem, G. M., Hutchinson, E. G., Orengo, C. A. & Machine Learning Approach, MIT Press. Thornton, J. M. (1999). Correlation of observed fold 29. Muggleton, S., King, R. & Sternberg, M. J. E. (1992). frequency with the occurrence of local structural Protein secondary structure prediction using logic- motifs. J. Mol. Biol. 287, 969-981. based machine learning. Protein Eng. 5, 647-657. 10. Shakhnovich, E. (1997). Theoretical studies of pro- 30. King, R. D., Muggleton, S., Lewis, R. A. & tein folding thermodynamics and kinetics. Curr. Sternberg, M. J. (1992). Drug design by machine Opin. Struct. Biol. 7, 29-40. learning: the use of inductive logic programming to 11. Fersht, A. (1997). Nucleation mechanism in protein model the structure-activity relationships of tri- folding. Curr. Opin. Struct. Biol. 7, 3-9. methoprim analogues binding to dihydrofolate 12. Dobson, C. M. & Karplus, M. (1999). The fundamen- reductase. Proc. Natl Acad. Sci. USA, 89, 11322-11326. tals of protein folding: bringing together theory and 31. Hirst, J. D., King, R. D. & Sternberg, M. J. E. (1994). experiment. Curr. Opin. Struct. Biol. 9, 92-101. Quantitative structure-activity relationships by 13. Kasuya, A. & Thornton, J. M. (1999). Three-dimen- neural networks and inductive logic programming. sional structure analysis of prosite patterns. J. Mol. I. The inhibition of dihydrofolate reductase by pyri- Biol. 286, 1673-1691. midines. J. Comput. Aided Mol. Des. 8, 405-420. Automated Discovery of Structural Signatures 603

32. King, R. D., Clark, D. A., Shirazi, J. & Sternberg, learning of protein three-dimensional fold signa- M. J. (1994). On the use of machine learning to tures. Machine Learn. 43, 81-95. identify topological rules in the packing of beta- 50. Muggleton, S. & Firth, J. (1999). CProgol4.4: theory strands. Protein Eng. 7, 1295-1303. and use. In Inductive Logic Programing and Knowledge 33. King, R. D., Muggleton, S. H., Srinivasan, A. & Discovery in Databases (DzÏeroski, S. & Lavrac, N., Sternberg, M. J. E. (1996). Structure-activity relation- eds), in the press. ships derived by machine learning: the use of atoms 51. Muggleton, S. (1992). Inductive Logic Programming, and their bond connectivities to predict mutageni- Academic Press, London. city by inductive logic programming. Proc. Natl 52. Muggleton, S. (1995). Inverser Entailment and Pro- Acad. Sci. USA, 93, 438-442. gol. New Gen. Comput. J. 13, 245-286. 34. Harrison, S. C. & Aggarwal, A. K. (1990). DNA rec- 53. Pearson, W. R. & Lipman, D. J. (1988). Improved ognition by proteins with the helix-turn-helix motif. tools for biological sequence analysis. Proc. Natl Annu. Rev. Biochem. 59, 933-969. Acad. Sci. USA, 85, 2444-2448. 35. Aggarwal, A. K., Rodgers, D. W., Drottar, M., 54. Pesce, A., Couture, M., Dewilde, S., Guertin, M., Ptashne, M. & Harrison, S. C. (1988). Rocognition of Yamauchi, K., Ascenzi, P., Moens, L. & Bolognesi, a DNA operator by the repressor 434: a view at M. (2000). A novel two-over-two a-helical sandwich high resolution. Science, 242, 899-907. fold is characteristic of the truncated hemoglobin 36. MondragoÂn, A. & Harrison, S. C. (1991). The phage family. EMBO J. 19, 2424-2434. Ê 55. Eisenberg, D., Weiss, R. M. & Terwilliger, T. C. 434 Cro/OR1 complex at 2.5 A resolution. J. Mol. Biol. 219, 321-334. (1984). The hydrophobic moment detects periodicity 37. Calladine, C. R. & Drew, H. R. (1986). Principles of in protein hydrophobicity. Proc. Natl Acad. Sci. USA, sequence-dependent ¯exure of DNA. J. Mol. Biol. 81, 140-144. 192, 907-918. 56. Kraulis, P. J. (1991). MOLSCRIPT: a program to pro- 38. Otwinowski, Z., Schevitz, R. W., Zhang, R. G., duce both detailed and schematic plots of protein Lawson, C. L., Joachimiak, A., Marmorstein, R. Q., structures. J. Appl. Crystallog. 24, 946-950. Luisi, B. F. & Sigler, P. B. (1988). Crystal structure of trp repressor/operator complex at atomic resolution. Nature, 335, 321-329. Appendix 39. Klemm, J. D., Rould, M. A., Aurora, R., Herr, W. & Pabo, C. O. (1994). Crystal structure of the Oct-1 A Testing the rules on the latest release of POU domain bound to an octamer site: DNA recog- SCOP nition with tethered DNA-binding modules. Cell, 77, 21-32. Rules derived automatically from the version 40. Rozwarski, D. A., Gronenborn, A. M., Clore, G. M., 1.39 of SCOP, were tested on new entries only Bazan, J. F., Bohm, A., Wlodawer, A., Hatada, M. & found in the latest release, 1.50. This analysis sup- Karplus, P. A. (1994). Structural comparisons among ports the same conclusions as the cross-validation the short-chain helical cytokines. Structure, 2, 159- test and provides further evidence that the rules 173. represent genuine features of protein fold and 41. Kruse, N., Shen, B.-J., Arnold, S., Tony, H.-P., function. MuÈ ller, T. & Sebald, W. (1993). Two distinct func- tional sites of human interleukin 4 are identi®ed by To establish a list of new entries, we compared variants impaired in either receptor binding or all the sequences of SCOP 1.50 to those of version 53 5 receptor activation. EMBO J. 12, 5121-5129. 1.39 using fasta and a cutoff E value of 10 42. Russell, R. B., Sasieni, P. D. & Sternberg, M. J. E. (number of matches expected by chance). The (1998). Supersites within superfolds. Binding site resulting test set was made non-redundant at 40 % similarity in the absence of homology. J. Mol. Biol. identity. All entries with no match with the old 282, 903-918. SCOP were included in the study. To allow for a 43. Gerstein, M. & Levitt, M. (1998). Comprehensive direct comparison with the results reported in the assessment of automatic structural alignment against main text, the new test sets were constructed with a manual standard, the scop classi®cation of pro- the same ratio of positive and negative examples, teins. Protein Sci. 7, 445-456. i.e. 1/3 and 2/3, respectively. For the globins and 44. Bourne, P. E. (1999). Editorial. Bioinformatics, 15, 715- 716. interleukin-8 folds there were no new entries that 45. Kelley, L. A., MacCallum, R. M. & Sternberg, M. J. E. could not be detected by sequence comparison (2000). Enhanced genome annotation using struc- method alone, therefore they are not included. tural pro®les in the program 3D-PSSM. J. Mol. Biol. Table A1 summarises the results. The overall 299, 510-522. accuracy is 73(Æ2), compared to 74(Æ1) measured 46. Hutchinson, E. G. & Thornton, J. M. (1996). on cross-validation data-sets. Both accuracies are PROMOTIF - a program to identify and analyze higher than that of a random prediction (with 99 % structural motifs in proteins. Protein Sci. 5, 212-220. con®dence level) and the difference between the 47. Clocksin, W. & Mellish, C. (1981). Programming in two is not signi®cant (with 99 % con®dence level). Prolog, Springer-Verlag, Berlin. Since the number of new entries is low, it is dif®- 48. Turcotte, M., Muggleton, S. & Sternberg, M. (1998). cult to analyse the rules individually; Table A2 lists Protein fold recognition. In Proc. of the 8th Inter- national Workshop on Inductive Logic Programming the rules for folds with more than three positive (ILP-98) (Page, C., ed.), pp. 53-64, LNAI 1446, examples. The rules for DNA-binding three-helical Springer-Verlag, Berlin. bundle, immunoglobulin, P-loop, a/b-hydrolases 49. Turcotte, M., Muggleton, S. & Sternberg, M. (2000). and ferredoxin performed well on both statistics, The effect of relational background knowledge on recall and precision. However, the rules for OB- 604 Automated Discovery of Structural Signatures

Table A1. A test of time, applying the rules learnt from SCOP 1.39 to new entries only found in the latest release, 1.50

Fold Examples Families Superfamilies FN % Accuracy All-a: DNA 3-helical 11 11 3 4 85 Æ 6 EF hand-like 5 4 2 2 60 Æ 13 4-Helical cytokines 2 2 1 1 67 Æ 19 Lambda repressor 1 1 1 0 100 Æ 0 All-b: Ig beta-sandwich 19 9 7 9 81 Æ 5 Tryp ser proteases 2 2 1 2 67 Æ 19 OB-fold 12 5 3 10 68 Æ 8 SH3-like barrel 4 4 3 2 57 Æ 13 Lipocalins 2 2 1 1 83 Æ 15 a/b: b/a (TIM)-barrel 21 18 14 11 72 Æ 6 Rossmann-fold 3 2 6 22 76 Æ 7 P-loop 16 10 1 9 70 Æ 6 Periplasmic II 2 1 1 1 83 Æ 15 a/b-Hydrolases 5 5 1 2 67 Æ 12 a ‡ b: Ferredoxin-like 16 13 10 7 77 Æ 6 Zincin-like 2 2 1 1 50 Æ 20 SH2-like 2 2 1 2 75 Æ 15 Beta-Grasp 9 5 5 6 67 Æ 9 73 Æ 2 For each fold, the Table lists the number of positive examples, the number of families and superfamilies, the number of false negative examples (FN), i.e. positive examples which are not represented by any of the rules, and the accuracy. fold and Rossmann-fold have not, although the secondary structures speci®ed by the rule, but number of new examples is large, 12 and nine, failed because the length of the ®rst strand is out respectively. Looking at the positive examples of range. In three cases, the ®rst strand is too long, missed by the ®rst rule of the OB-fold, we ®nd that it is classi®ed as very_long. For example, the seven out of 12 have the correct ordering of the ®rst strand of the N-terminal domain of aspartyl-

Table A2. Protein signatures learnt on SCOP 1.39 applied to new entries only found in SCOP 1.50

All-a: DNA-binding 3-helical bundle: (recall ˆ 36 %, precision ˆ 100 %, n ˆ 11). The length is 34 to 105 residues, there are exactly three helices, the loop between the 1st and 2nd helix is two to four residues long. EF Hand-like: (recall ˆ 20 %, precision ˆ 100 %, n ˆ 5). The 1st strand is followed by a helix, the 2nd strand is immediately followed by a helix. All-b: Immunoglobulin b-sandwich: (recall ˆ 24 %, precision ˆ 63 %, n ˆ 21). There is at most one helix, the loop between the 5th and 6th strands is three to seven residues long. Trypsin serine proteases: (recall ˆ 0 %, precision ˆ N/A, n ˆ 2). The loop between the 12th and 13th strand is four to 12 residues long. OB-fold: (recall ˆ 0 %, precision ˆ 0%, n ˆ 12). The 1st strand is long and followed by another strand. The 1st helix is followed by a strand. OB-fold: (recall ˆ 17 %, precision ˆ 67 %, n ˆ 12). There are five or six strands, the loop between the 1st and 2nd strand is two to four residues long, the 4th strand is followed by another strand. SH3-like barrel: (recall ˆ 25 %, precision ˆ 50 %, n ˆ 4). There are four to six strands, the loop between the 3rd and 4th strand is one to three residues long. a/b class: NAD(P)-binding Rossmann-fold domains: (recall ˆ 11 %, precision ˆ 33 %, n ˆ 9). The 1st strand is followed by a helix, the two structures are separated by a short connection, about one residue. Also, the 6th strand is followed by a helix. P-loop containing nucleotide triphosphate hydrolases: (recall ˆ 44 %, precision ˆ 54 %, n ˆ 16). The loop which connects 1st strand and following helix is three to seven residues long. a/b-Hydrolases: (recall ˆ 60 %, precision ˆ 75 %, n ˆ 5). The loop connecting the 1st strand and the one that follows is one to three residues long and the 7th helix is followed by a strand. a ‡ b class: Ferredoxin-like: (recall ˆ 44 %, precision ˆ 88 %, n ˆ 16). There are 3 to 5 b-strands, the 1st strand is followed by a helix and the loop between the 2nd helix the strand that follows it is two to four residues long. b-Grasp: (recall ˆ 22 %, precision ˆ 50 %, n ˆ 9). The domain contains the following motif: b2-a1-b3 and the loop between a1 and b3 is two to four residues long. The statistics in parentheses are: recall, de®ned as true positive/total number of positive examples, precision, de®ned as true posi- tive/total number of prediction, n is the total number of positive examples. Automated Discovery of Structural Signatures 605 tRNA synthetase (PDB 1b8a) is 15 amino acid resi- one domain satis®es the second part of the rule dues long and our de®nition for long requires 11 which requires the sixth strand to be followed by a to 13 amino acid residues. This suggests that the helix. For two domains, a sheet is inserted and dis- categories should perhaps be de®ned hierarchi- rupts the numbering scheme. For the other two cally, that is very_long secondary structures domains the topology of the sheet is different from should also be classi®ed as long as well. For the the usual 321456: for L-alanine dehydrogenase remaining ®ve examples, the strands are too short. (PDB 1pjc) the topology is 3214576 and for S-ade- Looking at the structures reveals a problem with nosylhomocystein hydrolase (PDB 1a7a) the top- the de®nition of the boundaries of the secondary ology is 32145876, here even structure structures. In aspartyl-tRNA synthetase the ®rst superposition would not help. strand bridges both sides of the open barrel, whilst The recent determination of two new globin in TIMP-2 (PDB 1bqq) the equivalent region structures reminds us that even well established contains two strands, which when considered rules are sometimes violated. Those are the ®rst equivalent to the same region in 1b8a, would crystal structures representative of the truncated amount to 14 amino acid residues. Such problems 54 can only be solved by including superposition hemoglobin (trHb) family. Unlike the classical information in the background knowledge. The fold, forming a three-over-three a-helical sandwich, situation for the Rossmann fold is slightly different. the trHbs form a two-over-two a-helical sandwich. Ê The requirements for the ®rst part of the rule, i.e. The two structures are very similar, 1.3 A the ®rst strand followed by a helix separated by a un-weighted r.m.s. over 115 residues and share short coil, are satis®ed by most domains. In all 37 % identity. The protist structure (PDB 1dlw, positive examples, the ®rst strand is followed by a Paramecium caudatum) has no proline residue in the helix, and for ®ve out of nine domains the length BC corner but the green algae structure (PDB 1dly, of the loop is in the correct range. However, only Chlamydomonas eugametos) does.

Edited by J. Thornton

(Received 6 April 2000; received in revised form 20 December 2000; accepted 22 December 2000)