Comparative and Functional Genomics Comp Funct Genom 2004; 5: 480–490. Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/cfg.414 Feature Workshop — Predicting the Structure of Biological Molecules Isaac Newton Institute, Centre for Mathematical Sciences, Cambridge University, UK, 26–30 April 2004

Damian Counsell* Rosalind Franklin Centre for Genomics Research, Genome Campus, Cambridge CB10 1SB, UK

*Correspondence to: Abstract Damian Counsell, RFCGR, Wellcome Trust Genome This April, in Cambridge (UK), principal investigators from the Mathematical Campus,Cambridge,CB101SB, Biology Group of the Medical Research Council’s National Institute of Medical UK. Research organized a workshop in structural at the Centre for E-mail: Mathematical Sciences. Bioinformatics researchers of several nationalities from [email protected] labs around the country presented and discussed their computational work in biomolecular structure prediction and analysis, and in protein . The meeting was intensive and lively and gave attendees an overview of the healthy state of protein Received: 10 June 2004 bioinformatics in the UK. Copyright  2004 John Wiley & Sons, Ltd. Revised: 18 June 2004 Keywords: protein bioinformatics; structural genomics; predic- Accepted: 21 June 2004 tion; molecular evolution; molecular simulation; structural bioinformatics

This workshop was organized by members of Cambridge, should open proceedings. In his talk, the Group at the UK ‘Structural constraints on protein mutations’, he Medical Research Council’s National Institute for described his work with Rajkumar Sasidharan Medical Research (http://mb2.nimr.mrc.ac.uk/), (http://www.mrc-lmb.cam.ac.uk/genomes/sraj/). Franca Fraternali, Richard Goldstein and Willie It was good for the rest of the meeting that, regard- Taylor. It was a low-key affair, organized late, yet it less of Chothia’s standing, there was no reluctance was probably the best scientific meeting I have ever to challenge his arguments and his contribution attended; I was interested in advance in the content provoked the first of many stimulating discussions of practically every session. Most of the seminars that took place both during and after presenta- were well-prepared, clear, relevant and refreshingly tions. concise. Even allowing for usually well-informed When questioned, Chothia admitted to an inten- questions and interruptions, sessions rarely over- tional looseness with the term ‘positive selection’ ran (or if they did, it didn’t feel that way). Unfor- in his description of the degree and type of residue tunately, because I heard about the meeting only type conservation in different locations in protein shortly before it took place, I was unable to attend structures. He outlined how residue conservation every presentation in full. Although the speakers varied with degree of site exposure and summa- and attendees were of many nationalities, they are rized the residue properties most likely to be shared all currently working in the UK. in the same sites across homologues. Among the After Willie Taylor’s introduction it was appro- intriguing statistics he presented, Chothia noted priate that (Laboratory of Molec- that the normalized frequency of changes in sur- ular Biology, Cambridge, UK; http://www.mrc- face residues was five to six times higher than core lmb.cam.ac.uk/genomes/Cyrus.html), one of the residues. The most ‘selected’ (conserved) residue most prominent and longstanding researchers in positions were least likely to vary in their size the field of protein structure bioinformatics in first, followed by their physicochemical character.

Copyright  2004 John Wiley & Sons, Ltd. Predicting the structure of biological molecules 481

For the least ‘selected’ positions, the priorities were the other uses for the output of his docking sim- reversed. ulations — they can be used, for example, in the In summary: average selectivity values for given prediction of binding patches on proteins. sites in proteins are calculable, the frequency of Everyone I spoke to was especially impressed variation can be explained in terms of the prop- with the volume and depth of analysis that had been erties and locations of the analysed sites, and the performed by Sanne Abeln (http://www.stats.ox. frequency with which residues vary at given sites ac.uk/people/students.htm), still a first-year stu- had a medium correlation with the overall under- dent in Charlotte Deane’s bioinformatics group in lying frequency of random mutations. Richard the Statistics Department at Oxford University. In Goldstein asked if Sasidharan and Chothia’s study ‘Fold usage on genomes and protein structure evo- showed that proteins tended towards robustness lution‘ she described her huge survey of protein and Chothia admitted not. Willie Taylor asked structures across species. She compared the num- about possible resemblances between the substitu- ber of distinct folds with genome size, examined tion matrix derived from Chothia’s structure-based the number of occurrences of folds, ‘duplications’ alignments and the Dayhoff matrix. Chothia said of folds, and families per fold and related them. that the two were similar. She had asked what these data could say about Juan Fernandez-Recio (Crystallography and the ‘ages’ of folds, evolutionary mechanisms and evolutionary relationships between folds. By tak- Biocomputing Unit, Department of , + ; http://www-cryst.bioc. ing large sequence sets (150 genomes from all cam.ac.uk/∼juan/) next presented ‘Protein kingdoms) and widely used bioinformatics tools –protein docking by global energy minimization’, (PSI-BLAST and SCOP), and applying them on a large scale, she not only made too many interest- work that he began in Ruben Abagyan’s lab at the ing observations to list here, but had already begun Scripps Institute [9]. He aims to find general meth- to devise plausible explanations for many of the ods for predicting the structures of protein–protein phenomena she observed. complexes based solely on the structures of the It seems that distributions of the popularity of members of those complexes. Useful because struc- folds are often described by power laws. Some tures of complexes are hard to determine, these folds at least appear to be missing in certain have increased in importance as lower-resolution genomes. The data she collected for αβ proteins structure determination methods have become more are different from folds in the other fold classes powerful and generated more data. (similar comparisons against αβ proteins were While cheaper analyses treating docking part- made at several points over the course of the ners as rigid bodies are easier to calculate, they meeting). Abeln cautioned that it is very difficult produce unrealistic energy landscapes, unlikely to to make phylogenetic trees from this kind of data lead to even approximately correct solutions. Mod- since: els including fully flexible protein structure require the exploration of huge conformational spaces. • There are no clear relations between the different Fernandez-Recio and co-workers seek a compro- measures of fold usage (i.e. occurrences of mise: the first step is to treat structures as rigid folds across genomes, duplications of folds on bodies with ‘soft’ van der Waals’ radii permitting a genome, and families per fold). • atomic overlap; the second step is to permit flex- When a fold diverges to a new fold on one ibility elsewhere. Other efficiencies come through genome, occurrence and duplications are set representing molecules in terms of their internal, back to one, and it is therefore difficult to obtain rather than Cartesian, coordinates. This combina- evolutionary relations between folds from these tion resulted in one of the top ‘blind’ performers in measures. the (CASP-like) CAPRI protein docking prediction Interesting power law-based relations also emer- competition (http://capri.ebi.ac.uk/). Unlike many ged from their analyses of fold distributions across reviews of bioinformatics methods by their devel- families and superfamilies. Just as there had been opers, Fernandez-Recio went on to give examples discussion of Chothia’s use of the term ‘positive of both the successful and unsuccessful applica- selection’, there was some debate over Abeln’s tions of his approach. He also discussed some of allusions to ‘old folds’ in her discussion of the

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 480–490. 482 D. Counsell possible evolution of folds. The idea of ‘trapped optimised surfaces (POPS) [10], has already been folds’ having difficulty evolving was another theme integrated into GROMOS96, and demonstrated to which re-emerged later in the week, when Ben be only about 30% slower than vacuum meth- Blackburne described his hugely simplified in silico ods — orders of magnitude cheaper than explicit minimal proteins. water molecular dynamic simulations. The second day was chaired by Richard Gold- In order to obtain an energy term to add the sol- stein and the first speaker, Kenji Mizuguchi vent contribution to the force field, one needs to (Department of Biochemistry, University of Cam- have solvation parameters that, multiplied with the bridge, UK, http://www-cryst.bioc.cam.ac.uk/∼ surface terms, give the free energy of solvation. kenji/), addressed ‘Sequence–structure homology So far, theoreticians have used experimentally- recognition’. Mizuguchi first clearly described the obtained solvation energies of transfer of atoms central problems of homology modelling: identify- from water to vapour. Fraternali sketched out a ing the best structural templates against which to new approach to the calculation of these param- model the sequence of an unknown fold and find- eters that makes use of explicit water simulations ing the best alignment between that sequence and on a selected number of conformations of differ- its target. He was classically biological in his use of ent peptides and proteins. From solute-restrained terminology, distinguishing between the identifica- MD simulations of these conformers, calculated tion of analogous (corresponding, but not related) in explicit water, it is possible to obtain distribu- and truly homologous (corresponding and related) tions of the atomic forces exerted by that water and folds. thereby parametrize the POPS forces accordingly. After an overview of existing methods for fold For the second part of her talk, Fraternali con- recognition and alignment, he outlined FUGUE centrated on more bioinformatic analysis of struc- (http://www-cryst.bioc.cam.ac.uk/fugue/), a sys- tural data using POPS. The method has been tem he developed along with Jiye Shi and Tom parametrized in order to reproduce solvent accessi- Blundell [18]. FUGUE exploits structural data bilities at atomic level (POPSA) and at the residue in the form of environment-specific substitution level (POPSR), based on a training set of about tables — 64 of them — and gap penalties. These 100 proteins of different sizes and topologies. The are applied alongside modern sequence align- formula reproduces accessibilities calculated with ment techniques and refined by testing to see the program NACESS with less than 10% error. how the environment definitions affect perfor- Fraternali has shown how the formula proved mance. Mizuguchi claimed 70–100 hits/day on useful in identifying protein–protein and pro- the FUGUE Website and that the method out- tein–RNA interactions in large macromolecular performs other blind prediction servers in align- assemblies like the ribosome — even based on low ment/assignment. Unfortunately, Mizuguchi’s clear resolution structures (C-α and P atoms only) like explanation of the problems and approach didn’t the 70S ribosome. Differences between the 30S as leave him time to discuss other applications, but a separate subunit and as part of the 70S complex I look forward to reading about them elsewhere (with the 50S subunit) have been highlighted in [19,20,22]. It was also satisfying during question- this way. Because of the presence of the P-tRNA ing afterwards to hear him be sensibly dismissive of in the 70S ribosome, localized conformational rear- any attempt to attach statistical confidence values rangements occurring within the subunits, exposing to FUGUE’s output, given the absence of an under- Arg and Lys residues to negatively charged binding lying mathematical model. For a wider view of the sites of P-tRNA, can be identified. POPSR can also importance of fold recognition, he recommended be used to estimate the loss of free energy of sol- his review in Drug Discovery Today [17]. vation upon complex formation, particularly useful Franca Fraternali (http://mathbio.nimr.mrc. in designing new protein–RNA complexes and in ac.uk/taylor/members/ffranca/) was the first of suggesting more focused experimental work. the organizers to lead a seminar. She described the Like many of the most effective bioinformatic parametrization of a simple and easy-to-derive ana- approaches, POPS is an approximation to make lytical formula for taking account of solvent effects large-scale problems tractable. In this case, Frater- in molecular dynamics simulations, using acces- nali used it to tackle the problem of the large multi- sible surface areas. The method, parametrically component ribosome structures and to produce

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 480–490. Predicting the structure of biological molecules 483 illuminating data. A POPS web server has been between pairs of residues). Although denatured made available at http://ibivu.cs.vu.nl/programs/ ACBP molecules are highly heterogeneous, Ven- popswww/ [6]. druscolo claimed that the sensitivity of the compu- Michele Vendruscolo’s (http://www.ch.cam.ac. tational technique allowed him and his co-workers uk/CUCL/staff/mv.html) group, in Cambridge’s to identify long-range conformational tendencies. Chemistry Department, studies non-native struc- He also gave other example applications: the tures of proteins and uses molecular dynamics identification of rare (e.g. once a day) but large to translate experimental measurements into struc- structural fluctuations from the native state [26], tures. Vendruscolo made the important point that based on hydrogen exchange with solvent; the we know far less about the cellular states of pro- investigation of transition states too short-lived to teins than about their crystal states, as determined be investigated properly experimentally; and the by X-ray crystallography. We urgently need to modelling of amyloid fibres using solid-state NMR- understand the forms proteins take when they form derived distance restraints. aggregates, intermediates, assemblies, or when they Jose´ Saldanha (http://mathbio.nimr.mrc.ac. are the nuclei of misfolded forms. uk/taylor/members/jsaldan/) of Willie Taylor’s Vendruscolo outlined his group’s use of rest- lab then led us through a rich case history of the rained simulations to investigate such problems. application of comparative modelling to the anal- The approach generates an ensemble of struc- ysis of a therapeutic target molecule. Although a tures for study for which specific experimentally- useful technique, comparative modelling can be measured restraints are satisfied. Various exper- difficult to present scientifically because its applica- imental techniques can be used to obtain the tion rarely makes a good ‘story’. It is often a step restraints. Vendruscolo outlined the technique with in a larger process or a door to a wider biologi- an example of three amino acids for which a dozen cal question. Saldanha had worked in collaboration or so interactions and specific bonds had to be satis- with Daruka Mahadevan, a consultant oncologist at fied. Once an experimental technique and a struc- the University of Arizona. Saldanha did bioinfor- tural interpretation of the derived data have been matics to analyse targets proposed by his collabo- chosen, the model for the interactions emerges and rator; Mahadevan performed expression studies. a pseudo-energy function penalizes deviations from Saldanha first provided some background on the experimentally derived restraints. Vendruscolo prostate cancer, the second most common form of argued that these were essential because molecu- death in males, and on prostate-specific membrane lar dynamics simulations cannot entirely replace antigen (PSMA), the main target for his investi- experiments in structure determination problems. gations, giving reasons why it might well be a He then detailed some specific case studies of better marker for prostate disease than the widely- published applications of the restrained simula- known prostate-specific antigen (PSA). PSMA is tion technique, beginning with a 2004 JACS paper a 750 amino acid protein, implicated in many [15] using data from site-directed spin-labelling body functions — questions were later asked about of acyl co-enzyme A binding protein (ACBP) the wisdom of choosing such a widely-used tar- to investigate the residual structure present in get. Saldanha’s choice rested on several bases: the unfolded protein. Restraints were imposed there are several isoforms of PSMA, and the form on the average over a set of copies (replicas) expressed in prostate cancer is distinct from the of the molecule and the technique was imple- others; tumour endothelial cells express it, but mented through 25 different non-interacting models not normal endothelial cells; and other researchers of the molecule — multiple simulations increased are targeting PSMA in prostate cancer. There the accuracy of the back-calculation of non- is also good clinical evidence from early trials restrained values. Not all of several hundred pos- that PSMA can be manipulated specifically and sible restraints are used in any given model, but safely. those used have to be mutually consistent. Saldanha ran through the range of bioinformatics Vendruscolo showed contact maps of the native programs that were applied to the problem, includ- and denatured states, maps of the average dis- ing BLAST (sequence search), PSIPRED (sec- tances between pairs of residues (these were, in ondary structure prediction), THREADER (fold fact, based on the probabilities of the interactions recognition), SAP (a structure-based sequence

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 480–490. 484 D. Counsell alignment program) and QUANTA (a commer- protein backbones that are commonly found to sta- cial modelling suite). This process of bioin- bilize some secondary structures and can also stabi- formatic characterization ran from determining lize ligand binding. The structural alignments come domain boundaries to alignment to structure predic- from firstly centring on the 3-D template match tion. It turned out that the transferrin receptor was (e.g. enzyme active site) then expanding the align- likely to be the best template. Although distantly ment based on sections considered ‘fittable’ (within related to PSMA, it has a similar domain structure. an RMSD cut-off) that consist of at least seven The two molecules may share similar properties consecutive residues. of dimerization and a similar binding–recycling Sadly, I was only able to catch the end of David ∼ model. Burke’s (http://www-cryst.bioc.cam.ac.uk/ Saldanha’s model(s) proved consistent with mut- dave/) presentation, ‘Ab initio structure predic- tion’ [2,4], and the subsequent discussion. When agenesis data and suggested an apical domain that I arrived, Burke was addressing the question of might be involved in substrate binding. Docking of how to filter tens of thousands of models of loops. the natural dipeptide substrate, NAAG, hinted that Currently, van der Waals’ overlap was the main cri- the specificity pocket might be distinctive enough terion, but he suggested that molecular dynamics to help in the design of inhibitors, but a full 3-D force fields, solvent accessibility and comparison structure is yet to be experimentally determined. with known structures could all be applied to win- Workers in ’s large group at now the output from modelling programs. Burke the European Bioinformatics Institute (EBI) have also summarized the questions that still concerned been seeking to infer function from structural him — and concern many structural bioinformati- information for some time now. James Watson cians: (http://www.ebi.ac.uk/Information/Staff/person • maint.php?person id=345) outlined their efforts Is it best to separate the selection of the models from the generation of models? to obtain functional assignments within struc- • Has the majority of the reasonable peptide con- tural genomics work, particularly in collaboration formations in the protein universe been observed with the Midwest Center for Structural Genomics in the structures deposited in the PDB to date? (http://www.mcsg.anl.gov/). • How can distantly related molecules be mod- Watson pointed out that, when it works, func- elled? tional assignment from three-dimensional structure is more appropriate to the identification of bio- Many of us had heard Willie Taylor (http://math chemical rather than biological function. Currently bio.nimr.mrc.ac.uk/taylor/members/wtaylor/) sequence methods are the most successful way to talk before, but he promised us that ‘Folds, knots assign function, but structure-based methods can and tangles’ would include both ‘something old provide additional functional information. There and something new’ amongst a collection of meth- are still plenty of occasions when no bioinformatic ods which, although apparently disconnected, all methods work and function can only be identified could contribute towards ab initio structure pre- diction. He began by describing the universe of by direct experiment. non-redundant folds by type (α, αβ and β) and Watson described ProFunc, a bioinformatics pointed out that this division of foldspace, while pipeline combining a variety of methods [13]. superficially illuminating, says less about deep sim- The structural contributions come from match- ilarities between fold classes, than about how we ing homologous folds, a variety of 3-D template look at proteins. methods, binding site identification and structure Now that Taylor and his co-workers are actively motif (for example helix–turn–helix) conservation. interested in model ‘proteins’ (i.e. non-biological Databases of 3-D templates describe enzyme active structures devised in silico), he has found that they sites, ligand binding sites and DNA binding sites. are difficult to classify by eye and they have used Hits to these templates are ranked by comparing Ptitisyn and Finkelstein’s concept of structural the surrounding environment of the match and cal- layers to find a way to compare them without the culating a similarity score. He also described the perennial problems of using, say, RMS deviations use of ‘nests’, small structural motifs involving between α and β proteins.

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 480–490. Predicting the structure of biological molecules 485

Taylor’s talks benefit from being supported by • Random walks generated with RAMBLE. live demos of actual programs running on a • Filtering performed using radius of gyration. Linux laptop, rather than static computer slides. • Filtering for knots. He first used RasMol to show the cell matrices • Filtering for complexity. he plots from the distribution of his fold types • Folds scored (of the order of 105 in number) with along axes of complexity, and ‘curl and stag- CAO (Contact Accepted MutatiOn) [14]. ger’. He has described this classification and its • POPS (the solvent accessibility algorithm des- sub-classifications as a ‘’ of protein cribed by Fraternali) and SPREK. structures [25]. In his demonstration this repre- sentation was completely dynamic, with individual Alternative structures produced using his group’s spheres being clickable to give the SAP represen- ab initio methods can be ranked in order by fold tation of each protein fold’s superimposed struc- and clustered. He hoped to have a comprehensive tures — colour-coded by their strength of mutual system using these or similar techniques up and correspondence [23]. running in time for the next CASP meeting. He now uses this scheme for the classification Another local speaker, Vijayalakshmi Chelliah of model proteins. When asked about the RasMol of Cambridge University’s Biochemistry Depart- renderings of such elements, Taylor pointed out ment, moved us on from protein structure determi- that these projections represent the architecture of nation to protein function determination with her the protein, failing to discriminate, for example, talk on ‘The identification of interacting sites in between parallel and anti-parallel β-strands, but protein families’. She started from the reasonable the full topology for each protein is recorded in premise that critically important residues tend to a ‘topology-string’ and can be used if needed be conserved by the members of protein fami- [11]. Taylor then moved on to questions of ab lies. She had used HOMSTRAD to generate 96 initio protein structure prediction and contrasted environment-specific substitution tables for pro- his whole-structure interests with the loop-focused tein residues and taken these as a background work of David Burke, who had preceded him. against which to detect important sites, those where Taylor used a constrained random walk to gen- residues are more conserved in families than would erate structures, along the way occasionally gen- be expected from the tables. erating secondary structure elements — sometimes The method is simple and logical: domains. A random walk combined with a sys- tem for the generation of layers produces struc- • Make a structure-based alignment of family tures which are more protein-like. Occasionally members. this approach results in the production of knots. • Compare the observed and expected substitution This behaviour had to be suppressed with ‘smooth- patterns. ing’. Some real proteins in the PDB could not be • Measure the informational difference between smoothed down to a line. It turned out that these the two. special cases are knotted. This curious, almost- incidental discovery led to a publication in Nature The higher the score, the more distant the two [24]. distributions are. High-scoring positions identified Smoothing can be used to compare the complex- in this way are those considered most likely to be ity of proteins. According to the number of self-hits functional. These scores can then be mapped onto of smoothed proteins, TIM barrels are simpler than structures to find high-scoring clusters. For this last Rossman folds, for example. It is possible to grow stage, Chelliah used Kin3Dcont, part of the kin- protein traces in silico through the building of local contour program (http://kinemage.biochem.duke. contacts and plot the ease of building a given fold edu/index.php) produced by the Richardson Lab making only local connections from a specified at Duke University, North Carolina. point in its structure. Chelliah was careful to ignore large gaps when Finally, Taylor ran through some of the elements making alignments and to restrict her analysis used in his ab initio folding experiments: to sequences with less than 80% mutual identity in order to minimize the noise from ‘briefly’ • Secondary structure predicted with PSI-PRED. conserved, but functionally unimportant, residues.

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 480–490. 486 D. Counsell

In most of around 250 families the ‘averaged Because small deviations from ideal geometry are out’ active site predicted was between 0 A˚ and 9 A˚ allowed in the real world and flexibility comes at from the true active site, but the method missed computational cost, RAPPER fixes many param- functional sites that were indirectly involved in eters (bond lengths, angles) and samples residue- the activity of proteins and sites that were buried. specific propensity tables and hand-curated con- Along the way to these results she made some formation libraries. The algorithm constructs rea- interesting observations: sonable 3-D models consistent with prior structural • constraints and additional arbitrary ones, and pro- Critical residues that were also structurally gresses from the N- to C-terminus of a structure, important did not score as highly as might have pruning additions in the wrong conformation. been expected by this method. • RAPPER has been applied to loop modelling [2], Even inaccessible residues turned out to be very (re)construction of native ensembles [7], compar- highly conserved — Chelliah put this down to ative modelling, and crystallographic model gen- their being important to the structural integrity eration [8]. More details of the program and its of active sites in the molecule. • variants are available from the RAPPER Website: She felt that this might have been countered by http://raven.bioc.cam.ac.uk/index.php looking for sites retained in both orthologues David Jones (Department of Computer Sci- and paralogues and tested this by adding in ence, Bioinformatics Unit, University College Lon- phylogenetic information. As it turned out, the don, http://www.cs.ucl.ac.uk/staff/D.Jones/index. addition of close homologues generated more html) spoke on the ‘Detection of native disorder noise. in proteins’. To begin, he joked about the irony of She observed, as people often do with methods his spending years trying to predict structure from like this, that the predictions were best when sequence before trying to predict ‘non-structure’ residues were in truly equivalent positions within from sequence. He also graciously credited similar structures. Jon Ward (http://www.cs.ucl.ac.uk/staff/J.Ward/) Returning to structure prediction, ‘Conforma- with having done most of the work. After running tional sampling for protein structure determination through the basic assumptions of sequence–stru- and prediction’ was the title of Mark DePristo’s cture interdependency, he discussed the various talk. DePristo is another member of Cambridge’s kinds of disordered proteins that were known to Biochemistry Department (http://raven.bioc.cam. exist. Some proteins are partially or completely ac.uk/∼mdepristo/). He described a method devel- unfolded yet remain functional, and we assume that oped (and now used) to check protein models, this is because their molecules form an ensemble of but which turns out to have a range of useful states, rather than a unique structure. These disor- structure-related applications. He introduced his dered states could be compact or extended molten hybrid approach by summarizing the problem in globules or random coils and, interestingly, can a series of simple figures. If the solution of a pro- fold fully on binding. tein structure is a global minimum on an energy Jones talked about the blurry line between true (or other scoring function) landscape, then our aim disorder and experimental uncertainty in determin- should be to smooth out that landscape to avoid ing protein structures as well as the experimental local minima and sample enough of it to find the methods that can be used to detect disorder. He pro- true minimum. Since there is no definitive solu- posed functional classes of disordered regions in tion, we must carefully choose heuristics. DePristo proteins: ‘springs and linkers’, modification sites, explained the advantages of molecular dynam- regions important to the timing of complex assem- ics/simulated annealing approaches over conjugate bly, and molecular recognition sites. Functional gradient/steepest descent ones. importance is often assumed to correlate with evo- His framework for such investigations, RAP- lutionary conservation and the work on predicting PER, avoids optimizing a non-linear function. disorder seems to produce results consistent with Instead it chooses many starting points and applies this. He also outlined some previous work to iden- local minimum-finding methods. Once a general tify signals of disorder in proteins. class of structures has been specified, the poten- Ward and Jones had trained a support vec- tial energies of those structures can be compared. tor machine (SVM) on a non-redundant set of

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 480–490. Predicting the structure of biological molecules 487 crystal structures and found that they could use phylogenies of minimalist proteins [3]. Blackburne it to identify 40% of disordered residues with a had explored the relationships between hypotheti- 1% error rate. The performance was better for cal 2-D proteins catalogued in the sort of protein longer regions — over 30 amino acid residues database the inhabitants of ‘Flatland’ [1] might rec- in length — for which the detection fraction and ognize. In Blackburne’s planar protein universe, error rates were 80% and 0.1%, respectively. The residues are of only two types, hydrophobic or SVM was then applied across genomes and detec- hydrophilic. Proteins fold when strings of such tion rates compared with biological function (as residues arranged on a square or tetrahedral lat- assigned by gene ontology classifications) [27]. tice of available points turn in on themselves in He believed other workers’ predictions of dis- a plausible way. Folds that arrange those residues order in prokaryotic proteins were likely to be with the lowest energy are ‘native’. A ‘fit’ pro- overestimates. In eukaryotes, molecules associ- tein is one which has a pocket — i.e. two external ated with the actin cytoskeleton scored highly, residues around a hole that could be ‘functional’. while the bacteria-like environment of mitochon- With so few degrees of freedom, all sequences dria seemed to contain few disordered protein of given short lengths and all structures derived components. There was also high correlation with from them can be known. The proteins can be DNA-transposition and development and mor- arranged in families, where a family is a group phogenesis. Molecular functions more likely to in which all the possible relatives can be generated be associated with protein disorder predictions from another by mutation and yet still meet the included transcription regulators, protein kinases rules for the formation of viable structures; the and transcription factors. Metabolic and biosyn- relationships between the model structures can be thetic protein functions scored low. The disor- visualized in graphs, whose nodes are the structures der prediction server, DISOPRED, is available at and whose edges are point mutations between http://bioinf.cs.ucl.ac.uk/disopred/disopred.html them. There are outliers, and some families are Another Chothia group member, Martin Mad- more weakly connected to related families than era (http://stash.mrc-lmb.cam.ac.uk/mm238/) others. There are ‘bottlenecks’ where there are few talked about his work on ‘Comparisons of sequence evolutionary routes from one family to another. families’ and his responsibility for the Chothia ‘Hubs’ bridge multiple families. ‘Funnels’ form group’s ‘Superfamily’ database at the LMB [16]. when the structures are arranged such that the This is a library of HMM models for all proteins nodes radiate out to variants of decreasing stability. of known 3-D structure. He recounted a history Some phenomena can be compared in an illu- of protein sequence comparison methods, of the minating way with the evolution of real proteins. problems of characterizing more distantly related For example, in Blackburne’s world neutral evo- protein groupings, and he detailed more recent lution seems necessary for minimal proteins to improvements in this resource. He gave a clear reach functional states and longer chains offer more overview of pairwise vs. sequence profile vs. HMM potential for such noisy change. Other characteris- methods and, having made the case for HMMs, he tics of these artificial proteins are more problem- discussed the refinements implemented in Super- atic: their sequences are not directional and inser- family, which relies on the segmentation of PDB tions and deletions cannot have the same meaning structures into domains and the combination of when there are so few residue types. multiple HMMs to represent its groupings. The The subsequent discussion addressed the rele- domain-based analysis of Superfamily can now be vance of such evolutionary landscapes to real pro- used to compare whole genomes for their domain teins, whether the graphs had scale-free properties, composition. other aspects of real protein behaviour which ought We moved from better models of real, stable, to be modelled (Cyrus Chothia), and the corre- folded proteins, to predictions of disordered pro- spondence between Blackburne’s neutral mutation- teins to completely imaginary proteins. Benjamin tolerant proteins and Chothia’s stable-to-mutation Blackburne (http://slater.chem.nott.ac.uk/∼ proteins (Willie Taylor). bpb/), formerly of Jonathan Hirst’s group at Not- Richard Goldstein’s (http://mathbio.nimr. tingham University and now a member of Richard mrc.ac.uk/goldstein/members/rgoldstein/) talk, Goldstein’s group, talked about the properties of his ‘Modelling molecular evolution’, covered an area

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 480–490. 488 D. Counsell of growing interest, the effort to combine sequence structure, or the extent to which a position con- and structural analysis to investigate the evolution strained [21]. and function of proteins. He described methods Goldstein then focused on the application of aimed at increasing our understanding of the struc- this general approach to the specific question of tural basis for variations in amino acid residue the GPCRs. Despite representing only 1% of the substitution rates, identifying functional sites and, genome, they are estimated to be the target of in particular, for characterizing members of the almost half all drugs and only one signalling large and pharmacologically important family of process does not involve a member of this family. G protein-coupled receptors (GPCRs). Although only one known high-resolution structure First, he highlighted a central flaw in com- is available, Goldstein’s group worked with a parative sequence analysis: most approaches are dataset of about 200 GPCRs, and analysed them based on a model that assumes positions in to produce patchworks of model assignments along sequences represent independent samplings from the lengths of sequences. all possible sequences and ignore the phyloge- Some properties of these molecules gave a strong netic relationships between related proteins. He signal. It is harder, for example, to identify the also reminded us — as molecular phylogeneticists inner and outer surface of transmembrane (TM) often have to remind biochemists and molecu- helices, such as those in the 7-TM structure of lar biologists — that residues ‘conserved’ between the GPCRs, than it is to identify the inner and closely related sequences are not as significant as outer faces of ‘normal’ protein structure helices. investigators often believe. Goldstein et al.’s site classes correlate with the Rather than ignore these problems or devise ‘innerness’ and ‘outerness’ of these helices. Also, ad hoc fixes, Goldstein, Goldman and others have a propensity to involvement in dimerization seems more recently attempted to model evolution explic- to correlate with slowly varying sites. The European Bioinformatics Institute’s Hugh itly. To begin, Goldstein developed substitution ∼ matrices for different types of local structure, but Shanahan (http://www.biochem.ucl.ac.uk/ shanahan/) described more function-from-structure has since devised a more general approach. Each work, this time targeted at predicting DNA-binding protein can be divided up into zones, without mak- proteins from 3-D motifs and electrostatic infor- ing assumptions about which models apply where; mation. There is no shortage of important DNA- the probability of any given location belonging binding proteins and a huge and growing inter- to a particular site class is a parameter which is est in the regulation of transcription. Shanahan itself optimized by an expectation maximization quoted estimates of up to 7% of eukaryotic and algorithm. 3% of prokaryotic genes coding for DNA bind- Once a set of environment categories has emer- ing proteins. Equally, structural genomics projects ged, Goldstein and co-workers assign qualitative will generate many uncharacterized structures. labels to them (e.g. ‘hydrophilic’), and the a pos- Although he acknowledged the importance and teriori probabilities of each position belonging to utility of sequence-based approaches, he argued class can be estimated. By applying this approach that function varies significantly as sequence iden- to large enough families of aligned sequences with tity between unknown and known (template) pro- structural information, he claimed, it is possible to tein sequences falls below 40%. He pointed out identify locations where different types of selective that, although at least one neural net-based method pressure have been operating and obtain insights exists for identifying DNA binding proteins, it into the underlying basis of such selective pres- has a high false-positive rate and requires high- sure, e.g. how physicochemical properties such as resolution atomic data, and claimed that homology- size and hydrophobicity are differentially important based modelling produces lower false-positive in different classes of site. scores. This approach can be used to identify function- Shanahan further contended that, of the four ally important locations — sites belonging to the main known classes of : slowest evolving rate classes — and different over- all probabilities that a position is involved in gen- • Helix–turn–helix. eral function, stabilization, dimerization, packing, • Helix–hairpin–helix.

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 480–490. Predicting the structure of biological molecules 489

• Helix–loop–helix. homology, might in future not require full electro- • Zinc finger. static calculations to be performed and that it might be possible to use data from homology models to the middle two are more easily identified with provide a cross-check for HMM searches. (HMM) methods; zinc fin- The final talk of the meeting rounded the ger proteins are too structurally variable. Shanahan event off perfectly. Chris Calladine (http://www- concentrated on the first, helix–turn–helix (H–T- civ.eng.cam.ac.uk/crc/crc web.htm), who retired H) structures. He began by summarizing the pro- only a couple of years ago from the Cambridge cedure to identify structural templates: University Department of Structural Engineering, • Search the literature for H-T-H motifs. dazzled us with a multidisciplinary, multimedia • Identify HMMs in Pfam or SMART. presentation on the ‘Mechanics of interfaces in α- • Identify structural templates from domains using helical supercoils’. He used overheads, animation the CATH super-structural family (the H-level of and a succession of cork-and-cardboard models to that database). show how juxtaposed helices could abut in diverse • Scan the Protein DataBank with templates. ways, interlocking the ‘knobs’ of their respective • Add any new H–T–H DNA-binding proteins to sidechains. The knobs of one helix fit into the the list. ‘holes’ between the knobs of the other when they • Repeat until no other structures are found. pack. For simple superhelices and four-helix bun- dles — as distinct from the helix-built cylinders The group obtained 90 non-redundant structures Calladine later touched on [5] — there were three in the PDB and generated seven structural tem- standard modes of knobs-into-holes packing, which plates to cover that set, applying an accessibility he illustrated with overlaid interface figures pro- criterion. At first the results didn’t seem much bet- duced as overheads, as simple figures and as clev- ter than those obtained with HMMs: 0.5% false erly constructed three-dimensional models. positives. Then they refined the method by inte- One of the most pleasing things about structural grating the potential over a region close to the bioinformatics is that its practitioners collaborate accessible surface of motifs and tested this by using across specialisms to tackle difficult, interesting the electrostatic data to attempt to identify the bind- and messy problems out of both curiosity and ing region in known DNA-binding proteins [12]. necessity — not merely to meet the conditions of A method to detect DNA-binding sites on the interdisciplinary funding programmes. Calladine’s surface of a protein structure is important for func- work exemplified this beautifully. He has worked tional annotation. They analysed residue patches in this area in collaboration with Charlie Laughton on the surface of DNA-binding proteins and pre- (molecular dynamics) at Nottingham University dicted DNA-binding sites using a single feature and Ben Luisi and Venkatash Pratap (structural of these surface patches. They first surveyed sur- biology) at Cambridge. Pratap wrote software that face patches and DNA-binding sites for accessi- finds α-helices and their neighbours, identifies the bility, electrostatic potential, residue propensity, local superhelical angle of their arrangement and hydrophobicity and residue conservation. From categorizes those arrangements according to those this, they observed that the DNA-binding sites usu- angles. Pratap’s animation of a bistable ‘switch’ in ally fell in the top 10% of patches with the largest positive electrostatic scores. This knowledge led to the packing of a right-handed, four-helix bundle their development of a prediction method in which of α-helices in one of the three main classes patches of surface residues were selected such that of arrangement formed the finale of Calladine’s they excluded residues with negative electrostatic presentation. scores. They used this method to make predictions for a dataset of 56 non-homologous DNA-binding pro- Acknowledgements teins and identified 68% of the dataset correctly. Using this data, they improved the false-positive I would like to thank the speakers-especially Franca score to 0.02%. Shanahan added that the hybrid Fraternali- for their contributions, clarifications, and cor- method involves fewer parameters than sequence rections to this article.

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 480–490. 490 D. Counsell

References based Markov model of protein evolution. Comput Biol Chem 27: 93–102. 15. Lindorff-Larsen K, Kristjansdottir S, Teilum K, et al. 2004. 1. Abbott EA. 1884. Flatland: A Romance of Many Dimensions. Determination of an ensemble of structures representing the Shambhala: Boston, MA. denatured state of ACBP. J Am Chem Soc 126: 3291–3299. 2. de Bakker PI, DePristo MA, Burke DF, Blundell TL. 2003. 16. Madera M, Vogel C, Kummerfeld SK, Chothia C, Gough J. Ab initio construction of polypeptide fragments: accuracy of 2004. The in 2004: additions loop decoy discrimination by an all-atom statistical potential and improvements. Nucleic Acids Res 32: (Database issue): and the AMBER force field with the Generalized Born D235–239. solvation model. Proteins 51: 21–40. 17. Mizuguchi K. 2004. Fold recognition for drug discovery. Drug 3. Blackburne BP, Hirst JD. 2001. Evolution of functional model Discovery Today: Targets 3: 18–23. proteins. J Chem Phys 115: 1935–1942. 18. Shi J, Blundell TL, Mizuguchi K. 2001. FUGUE: 4. Burke DF, Deane CM. 2001. Improved protein loop prediction sequence–structure homology recognition using environment- from sequence alone. Protein Eng 14: 473–478. specific substitution tables and structure-dependent gap 5. Calladine CR, Sharff A, Luisi BF. 2001. How to untwist an penalties. J Mol Biol 310: 243–257. α-helix: structural principles of an α-helical barrel. J Mol Biol 19. Shirai H, Blundell TL, Mizuguchi K. 2001. A novel super- 305: 603–618. family of enzymes that catalyze the modification of guanidino 6. Cavallo L, Kleinjung J, Fraternali F. 2003. POPS: a fast groups. Trends Biochem Sci 26: 465–468. algorithm for solvent accessible surface areas at atomic and 20. Shirai H, Mizuguchi K. 2003. Prediction of the structure and residue level. Nucleic Acids Res 31: 3364–3366. function of AstA and AstB, the first two enzymes of the 7. DePristo MA, de Bakker PIW, Lovell SC, Blundell TL. 2002. arginine succinyltransferase pathway of arginine catabolism. Ab initio construction of polypeptide fragments: efficient FEBS Lett 555: 505–510. generation of accurate, representative ensembles. Proteins 21. Soyer OS, Dimmic MW, Neubig RR, Goldstein RA. 2003. Struct Funct Genet 51: 41–55. Dimerization in aminergic G-protein-coupled receptors: 8. DePristo MA, de Bakker PIW, Lovell SC, Blundell TL. 2004. application of a hidden-site class model of evolution. Heterogeneity and inaccuracy in protein structures solved by Biochemistry 42: 14 522–14 531. X-ray crystallography. Structure 12: 831–838. 22. Stebbings LA, Mizuguchi K. 2004. HOMSTRAD: recent 9. Fernandez-Recio J, Totrov M, Abagyan R. 2004. Identifica- developments of the Homologous Protein Structure Alignment tion of protein–protein interaction sites from docking energy Database. Nucleic Acids Res 32: (Database issue): D203–207. landscapes. J Mol Biol 335: 843–865. 23. Taylor WR. 2000. Protein structure comparison using SAP. 10. Fraternali F, Cavallo L. 2002. Parameter optimized surfaces Methods Mol Biol 416: 657–660. (POPS): analysis of key interactions and conformational 24. Taylor WR. 2000. A deeply knotted protein structure and how changes in the ribosome. Nucleic Acids Res 30: 2950–2960. it might fold. Nature 406: 916–919. 11. Johannissen LO, Taylor WR. 2004. Protein fold comparison 25. Taylor WR. 2002. A ‘periodic table’ for protein structures. by the alignment of topological strings. Protein Eng 16: Nature 416: 657–660. 949–955. 26. Vendruscolo M, Paci E, Dobson CM, Karplus M. 2003. Rare 12. Jones S, Shanahan HP, Berman HM, Thornton JM. 2003. fluctuations of native proteins sampled during equilibrium Using electrostatic potentials to predict DNA-binding sites on hydrogen exchange. J Am Chem Soc 125: 15 686–15 687. DNA-binding proteins. Nucleic Acids Res 31: 7189–7198. 27. Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT. 2004. 13. Laskowski RA, Watson JD, Thornton JM. 2003. From protein Prediction and functional analysis of native disorder in structure to biochemical function? J Struct Funct Genomics 4: proteins from the three kingdoms of life. J Mol Biol 337: 167–177. 635–645. 14. Lin K, Kleinjung J, Taylor WR, Heringa J. 2003. Testing homology with Contact Accepted mutatiOn (CAO): a contact-

Copyright  2004 John Wiley & Sons, Ltd. Comp Funct Genom 2004; 5: 480–490.