research papers

Acta Crystallographica Section D Biological The in structural Crystallography ISSN 0907-4449

Julian Gough The SUPERFAMILY library repre- Received 31 January 2002 senting all proteins of known structure predicts the domain Accepted 21 August 2002 architecture of protein sequences and classi®es them at the MRC Laboratory of Molecular Biology, Hills Road, cambridge CB2 2QH, England SCOP superfamily level. This analysis has been carried out on all completely sequenced genomes. The ways in which the database can be useful to crystallographers is discussed, in Correspondence e-mail: [email protected] particular with a view to high-throughput structure determi- nation. The application of the SUPERFAMILY database to different target-selection strategies is suggested: novel folds, novel domain combinations and targeted attacks on genomes. Use of the database for more general inquiry in the context of structural studies is also explained. The database provides evolutionary relationships between target proteins and other proteins of known structure through the SCOP database, genome assignments and multiple sequence alignments.

1. Introduction As a result of the increase in the rate of experimental deter- mination of DNA sequences and the consequent success of the genome-sequencing projects, there are now (publicly avail- able) over 60 completely sequenced genomes spanning all kingdoms of life. More recently, there have arisen structural genomics projects (Smith, 2000), which are expected to begin reaching their production phases during 2002±2003. These projects will accelerate the production of new protein struc- tures and hence signi®cantly increase the available structural data. Because of the dif®culty and cost of solving three- dimensional structures, it is not possible to attempt to solve the structure of every protein. A more targeted approach than that used by the sequencing projects is needed to maximize the return on the projects' efforts. Using experimental and computational tools, targets can be selected which are expected to be in some way novel. Some of the ways in which the SUPERFAMILY (Gough & Chothia, 2002) database can contribute to targeting strategies are discussed here. Most targeting strategies aim to achieve as complete as possible coverage of something, for example a proteome or a functional pathway. There are a limited number of common structural folds (Chothia, 1992) and some targeting strategies of structural genomics projects will signi®cantly increase the proportion of this limited number for which we have a struc- tural representative. The improved completeness of fold coverage will change the current view of protein-structure space.

2. SUPERFAMILY

# 2002 International Union of Crystallography The SUPERFAMILY database (Gough & Chothia, 2002; Printed in Denmark ± all rights reserved Gough et al., 2001) is a library of hidden Markov models

Acta Cryst. (2002). D58, 1897±1900 Gough  SUPERFAMILY database 1897 research papers

(HMMs; Eddy, 1996; Hughey & Krogh, 1996; Krogh et al., consist of a single domain, whereas medium-sized proteins 1994) of domains of known structure created using the SAM may consist of one or more domains. Large proteins consist of (Karplus et al., 1998) software package. Services and data are multiple domains. A domain is de®ned as the minimum available at http://supfam.org. evolutionary unit, so a protein will only have parts classi®ed into separate domains if those parts are observed indepen- 2.1. What it does dently in either on their own or in combination with The purpose of SUPERFAMILY is to detect and classify in other domains. protein sequences evolutionary domains for which there is a Structural, functional and sequence information is used to known structural representative. Given a protein about which group together in the hierarchy at the superfamily level nothing is known other than the amino-acid sequence, the domains for which there is evidence for a common evolu- object is to assign known structural domains or more speci®- tionary ancestor. Domains belonging to the same superfamily cally domains at the SCOP (Murzin et al., 1995) superfamily have very similar structure and hence usually the same or a level. related function. 2.1.1. SCOP. The SCOP database classi®es all proteins in the PDB (Berman et al., 2000) into domains which are hier- 2.2. How it does it archically organized at levels of similarity. Small proteins SUPERFAMILY uses a library of HMMs representing all superfamilies in SCOP. HMMs are sequence pro®les very similar to PSI-BLAST (Altschul et al., 1997) pro®les. Pro®les are built from multiple sequence alignments and represent a group of sequences (in this case a superfamily) rather than a single sequence. The pro®le should embody the features that characterize the superfamily which it is supposed to represent. By comparing a sequence to a pro®le, far more distant rela- tionships can be detected than by comparing two sequences, which is what pairwise methods such as BLAST (Altschul et al., 1990) and FASTA (Pearson & Lipman, 1988) do.

2.3. Why use it? To compare the ability of different methods to detect and classify domains in SCOP (Murzin et al., 1995), a test was carried out, the results of which are shown in Fig. 1. The test Figure 1 The number of true and false positives found by four different methods comprises of an all-against-all search of sequences of struc- when scoring all sequences of structures in SCOP ®ltered to 40% tures in SCOP ®ltered to 40% sequence identity (Brenner et sequence identity.

Figure 3 Figure 2 The SUPERFAMILY coverage of 58 complete genomes. The coverage of The number of domains assigned by SUPERFAMILY to the Escherichia genes is the percentage of genes with at least one domain assignment; the coli genome at different error rates (using different E-value thresholds). coverage of sequence is the percentage of amino acids in all genes In this case, the E values are calculated such that at 0.01 the expected covered by domain assignments. The two-letter codes for the genomes error rate is 1%. The curve shows that in the critical region around the shown in this graph are those de®ned by the SUPERFAMILY database. 1% error rate the number of domains assigned is not very sensitive to the The full names are available at http://supfam.org/SUPERFAMILY/ threshold. cgi-bin/genome_names.cgi.

1898 Gough  SUPERFAMILY database Acta Cryst. (2002). D58, 1897±1900 research papers al., 2000). The SCOP classi®cation is then used to decide 3.1.2. Domain combinations. The HMM library has been whether the relationships detected are true or false. used to assign structural domains to sequences in all of the It is clear from Fig. 1 that pro®le methods (SAM-T99, PSI- completely sequenced genomes (see Fig. 3). Assignments BLAST) perform far better in this test of remote homology currently cover approximately 40% of the amino acids in detection than pairwise methods (WU-BLAST). SUPER- eukaryote genomes (50% of the sequences) and 45% of the FAMILY is based upon SAM-T99 (Karplus et al., 1998), but prokaryote genomes (55% of the sequences). Genome improves upon and adds to it, speci®cally aiming at SCOP sequences may have had some of their domains assigned but superfamily classi®cation. For more information, please refer not all, hence the discrepancy between the coverage of amino to Gough et al. (2001). acids and number of sequences. 2.3.1. Accuracy and reliability.TheE-value scores In the PDB there are structures of proteins with different provided by the SAM software which are used for the combinations of domains. By examining the assignments of assignments provide a theoretical value for the expected error domains to genome sequences, it is possible to observe protein rate. Large-scale tests show that these theoretical expectations sequences with combinations of domains for which a structure are very close to the observed error rates (Gough et al., 2001). has not yet been solved (Apic et al., 2001a,b). These novel Close examination of hundreds of assignments where the combinations provide targets for structural genomics projects results are checked (Bujnicki et al., 2001) indicate a 1% error and are available at http://www.mrc-lmb.cam.ac.uk/genomes/ rate. DomCombs. Although the novel aspect of these targets is not Examination of the assignments made by SUPERFAMILY the fold but the combination of folds, these targets offer at different error rates (E-value threshold) in Fig. 2 shows that certainty in this aspect. It is not possible (using any method) to in the critical region the number of assignments are not highly ®nd targets which are certain to have a novel fold. sensitive to the threshold chosen. 3.1.3. Targeted attacks. The aim of some projects is to achieve maximum structural coverage of a particular genome, e.g. Mycobacterium tuberculosis (http://www.doe-mbi.ucla.edu/ TB/pubs.php). As explained above, approximately half of all 3. Applications to structural genomics genome sequences are assigned to a superfamily for which The HMM library was designed for genome analysis leading to there is a known structural representative. The genome evolutionary studies (Apic et al., 2001a; Teichmann et al., analysis provides a good starting point for target selection on 2001), but also has applications in structural genomics. any project targeted at a particular genome.

3.1. Target selection 3.2. General inquiry The aim of structural genomics projects is to solve new As well as target selection, the model library can be used structures which give us a more complete view of the world of more generally to obtain information about a protein which . There are several different views of the could be relevant to structural studies. structural world, which lead to alternative approaches to the 3.2.1. Evolutionary relationships. If the structure of a selection of target proteins for the projects for structure protein is known or if there is a similarity to a protein of determination. Some approaches which may be aided by the known structure, the SCOP database provides the classi®ca- use of the SUPERFAMILY database are described here. tion at the superfamily level which links the domains of 3.1.1. Novel folds. There are a limited number of common proteins to others with a common ancestor. As mentioned protein folds in nature (Chothia, 1992), many of which have at before, if no relationship to a structure is known the model least one representative structure solved. There are, however, library may be able to detect one. SCOP also subgroups many folds which have not yet been determined and it is the superfamily domains into families which are more closely aim of some projects to determine the structure of proteins related and usually have the same function. with novel folds, with a view to generating complete coverage 3.2.2. Genome occurrence. Once the superfamily is known, of fold space. Any target sets which have been designed to ®nd it is possible through the genome analysis to see the occur- novel folds can be scanned against the HMM library to see if rence of the members in the different genomes. For a given they might belong to a superfamily for which there is already a superfamily, all of the members in every genome that have structural representative. The effect of this is to remove been assigned by SUPERFAMILYare listed. The distribution distant homologues to known folds, but since a negative result across genomes may reveal interesting features such as is inconclusive, the aim is not to identify novel folds directly. particular species which are missing the superfamily in ques- Of the new structures which are deposited in the PDB every tion or which for some reason have a much greater number of week, half of those with no sequence homology to any other members than others. The distribution across the different PDB sequence using BLAST (E value > 0.1), can be assigned kingdoms of life may also be of interest. by the HMM library to a superfamily with a known structural 3.2.3. Sequence alignments. One of the services provided representative. Of those which cannot be assigned, very on the World Wide Web is multiple sequence alignment. There roughly half have novel folds. This test was independently are alignments of PDB sequences belonging to the same carried out by LiveBench (Bujnicki et al., 2001). superfamily and it is possible for the user to add their own

Acta Cryst. (2002). D58, 1897±1900 Gough  SUPERFAMILY database 1899 research papers sequences to the alignment. All of the genome sequences from Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., the different organisms which have been assigned to the same Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids superfamily can also be aligned. Multiple sequence alignments Res. 28, 235±242. Brenner, S. E., Koehl, P. & Levitt, M. (2000). Nucleic Acids Res. 28, of homologous proteins reveal patterns of evolutionary 254±256. conservation which represent the structural and functional Bujnicki, J. M., Elofsson, A., Fischer, D. & Rychlewski, L. (2001). constraints on the protein. There is an automatic statistical , 17, 750±751. analysis of the multiple alignments designed to detect features Chothia, C. (1992). Nature (London), 357, 543±544. and aid such analysis. Eddy, S. R. (1996). Curr. Opin. Struct. Biol. 6, 361±365. Gough, J. & Chothia, C. (2002). Nucleic Acids Res. 30, 268±272. Gough, J., Karplus, K., Hughey, R. & Chothia, C. (2001). J. Mol. Biol. Thanks to for discussions and to the Medical 313, 903±919. Hughey, R. & Krogh, A. (1996). Comput. Appl. Biosci. 12, 95±107. Research Council for funding. Karplus, K., Barrett, C. & Hughey, R. (1998). Bioinformatics, 14, 846± 856. References Krogh, A., Brown, M., Mian, I. S., Sjolander, K. & Haussler, D. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1994). J. Mol. Biol. 235, 1501±1531. (1990). J. Mol. Biol. 215, 403±410. Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). J. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Mol. Biol. 247, 536±540. Miller, W. & Lipman, D. J. (1997). Nucleic Acids Res. 25, 3389±3402. Pearson, W. R. & Lipman, D. J. (1988). Proc. Natl Acad. Sci. USA, 85, Apic, G., Gough, J. & Teichmann, S. A. (2001a). J. Mol. Biol. 310, 2444±2448. 311±325. Smith, T. (2000). Nature Struct. Biol. 7, 927. Apic, G., Gough, J. & Teichmann, S. A. (2001b). Bioinformatics, 17, Teichmann, S. A., Rison, S. C., Thornton, J. M., Riley, M., Gough, J. & Suppl. 1, S83±S89. Chothia, C. (2001). J. Mol. Biol. 311, 693±708.

1900 Gough  SUPERFAMILY database Acta Cryst. (2002). D58, 1897±1900