Exploring the Structure and Function Paradigm Oliver C Redfern, Benoit Dessailly and Christine a Orengo
Total Page:16
File Type:pdf, Size:1020Kb
Available online at www.sciencedirect.com Exploring the structure and function paradigm Oliver C Redfern, Benoit Dessailly and Christine A Orengo Advances in protein structure determination, led by the Figure 1 shows that as the international genomics initiat- structural genomics initiatives have increased the proportion of ives gather pace, both the number of sequences and novel folds deposited in the Protein Data Bank. However, these protein families are still growing at an exponential rate, structures are often not accompanied by functional annotations although the rate of expansion of protein families is with experimental confirmation. In this review, we reassess the substantially less. This trend is also observed among meaning of structural novelty and examine its relevance domain families, which are 10-fold fewer (<10 000) than to the complexity of the structure-function paradigm. the number of protein families. By targeting these, the Recent advances in the prediction of protein function from structural genomics initiatives can aim to characterise the structure are discussed, as well as new sequence-based major building blocks of whole proteins and since these methods for partitioning large, diverse superfamilies into domains recur in different combination in the genomes, it biologically meaningful clusters. Obtaining structural data for will be an important step towards understanding the these functionally coherent groups of proteins will allow us to complete structural repertoire in nature. Furthermore, better understand the relationship between structure and the complement of molecular functions found within function. an organism is likely to be even fewer. For example, 97% of proteins in yeast can be assigned one or more of Address 4000 unique GO terms. Therefore, target selection Department of Structural and Molecular Biology, University College strategies that attempt to sample structurally uncharac- London, London WC1E 6BT, United Kingdom terised domain families/subfamilies with distinct func- Corresponding author: Orengo, Christine A. ([email protected]) tions (known or predicted) will help to increase our knowledge of both structure and function space. Current Opinion in Structural Biology 2008, 18:394–402 In this review, we first consider how approaches to This review comes from a themed issue on defining the structural novelty of newly solved structures Sequence and Topology are changing and how this might impact on protein Edited by Nick Grishin and Sarah Teichmann function. Similarly, we explore the ways in which analyses of diverse protein structural families have increased our understanding of how protein functions 0959-440X/$ – see front matter have evolved and how this knowledge can be exploited to Published by Elsevier Ltd. improve methods for predicting function from structure. DOI 10.1016/j.sbi.2008.05.007 We then look at recent approaches to predicting function from sequence, focussing on those that are suitable for annotating genes at the whole genome scale, involving thousands or even tens of thousands of proteins. Such Introduction broad coverage of genomes with functional information is Over the past 10 years several international structural essential not only for interpreting functional genomics genomics initiatives have been funded [1,2] with diverse data but also for dividing sequence space in a more themes. Some are targeting specific biological and biologically focussed way, allowing targeted selection medical systems, whereas others, like the Protein Struc- of structurally uncharacterised functional families for ture Initiative (PSI) in the United States, whilst incorpor- structure determination. ating some biological themes also aim to extend our ability to provide structural data to the increasing number of genomic sequences. By boosting the structural reper- How do we define structural novelty? toire, it is hoped that we will be able to improve our There have been several recent attempts to define a understanding of fold space and how proteins evolve new protein fold [6] or question whether it is even realistic functions. Consequently, the four major PSI Centres have to seek to partition the protein universe at the fold level targeted large sequence families that are most likely to [7,77]. However, it appears that regardless of how struc- adopt novel structures [1](http://www.structuralgenomics. tural novelty is defined, significant structural change org), although these often have little or no functional often mediates divergence in functional properties, annotation [3]. To maximise the biomedical benefit of particularly in the very large, diverse superfamilies that PSI structures, recent reviews have proposed broadening tend to dominate the sequenced genomes [8]. Currently, selection criteria to explicitly focus on the relevance to the precise mechanisms by which this occurs are difficult human disease [4] or to provide structural characterisation to fathom given the sparsity of data. For example, in the of families with known functions [5]. largest of these diverse superfamilies, less than 10% of Current Opinion in Structural Biology 2008, 18:394–402 www.sciencedirect.com Structure and function paradigm Redfern, Dessailly and Orengo 395 Figure 1 This figure shows the numbers (as a log scale) of genome sequences (PROTEIN), OFAMs (putative orthologous gene families), CATH-S30 (30% non-redundant sequence families), PFAM families, and CATH domain superfamilies, as genomes are added to the database in increasing order of size. sequence diverse relatives (<30% sequence identity) structural similarities correlate with function, especially have a close homologue with known structure [9]. for the large structural families comprising hundreds of structures, we need fast and sensitive global structure The fact that homologues can change their fold has been comparison methods. A number of benchmark studies commented on previously by Grishin [10] and others. have been performed recently [13,14] which demonstrate The extent of this is becoming clearer as the new gener- the value of using normalised RMSD measures to assess ation of powerful sequence profile (e.g. HMM-HMM) structural similarity. Perhaps the most appropriate means methods have detected some very remote homologues of assessing global similarity, SIMAX [14], normalises by with different folds (see [11] and references therein). the number of aligned residues as a fraction of the number Furthermore, extreme examples of chameleon sequences of residues in the larger protein, and thus decreases the that can adopt several alternative three-dimensional con- significance of small common structural motifs. figurations [12] have complicated our view of the sequence to structure paradigm. Applying this approach to compare structures classified in CATH (version 3.1) and assuming that two structures have The traditional notion of a fold (as defined by SCOP and ‘similar folds’ or belong to structurally similar groups CATH) is to cluster domains based on a similar arrange- (SSGs) if they superpose with a SIMAX score <5A˚ (see ment and connectivity of core secondary structures. How- Figure 2a), we find that there are approximately 3000 ever, these common elements may account for less than different SSGs classified to date. This contrasts with the half of the domain size and in some fold groups relatives 1000 folds currently identified by SCOP and CATH exhibit as much as a threefold size variation in terms of using rather subjective criteria. In fact, Figure 3a shows their numbers of secondary structures [8]. At what point that the largest 300 CATH superfamilies that are highly these extra secondary structure elements can be thought recurrent in the genomes [8,9], contain nearly 40% of to represent a change in fold is subjective and some have SSGs in CATH and some contain more than 10 different suggested that we can only view structure space as a SSGs (see Figure 3b). Analyses of these very large, diverse continuum [7]. The lack of a clear definition of a protein families in CATH [8] have revealed that relatives with fold becomes especially pertinent when assessing the significant divergence in structure or fold tend to have power of structure prediction methods and the structural different functional roles and protein interactions (Figures novelty that the PSI structures bring to the PDB. 2b and 4). Hence, by solving additional structures from these very large superfamilies, structural genomics could In order to more objectively gauge the novelty of newly enrich our understanding of structure–function relation- deposited structures and determine the extent to which ships within these superfamilies, most of which are struc- www.sciencedirect.com Current Opinion in Structural Biology 2008, 18:394–402 396 Sequence and Topology Figure 2 (a) Two domains which differ by less than 5 A˚ (SIMAX) can vary structurally but share the same fold. (b) Two domains from the galectin-type carbohydrate recognition domain superfamily. The domains are coloured so that residues having the same a-helix or b-sheet conformation in 75% of domains appear red. Domains 1bkzA0 and 1dypA0 are in the similar orientations so that secondary structure embellishments to the core of 1dypA0 can be seen. 1dypA00 has a significant number of extra secondary structure elements, many of which are located at the binding site and are modifying the geometry of the active site. turally under-represented and highly recurrent in the sensu lato, a sensible approach is to study these relationships genomes