<<

G ENOME:UNLOCKING B IOLOGY’ S S TOREHOUSE VIEWPOINT Structure Prediction and Structural David Baker1 and Andrej Sali2

Genome sequencing projects are producing linear amino acid sequences, the protein chain flicker between different but full understanding of the biological role of these will require local structures consistent with their local knowledge of their structure and function. Although experimental struc- sequence, and folding to the native state oc- ture determination methods are providing high-resolution structure infor- curs when these local segments are oriented mation about a subset of the proteins, computational structure prediction such that low free energy interactions are methods will provide valuable information for the large fraction of se- made throughout the protein (17). In simu- quences whose structures will not be determined experimentally. The first lating this process, each short segment is class of protein structure prediction methods, including threading and allowed to sample the local structures adopt- comparative modeling, rely on detectable similarity spanning most of the ed by the sequence segment in known protein modeled sequence and at least one known structure. The second class of structures, and a search is carried out through methods, de novo or ab initio methods, predict the structure from se- the combinations of these local structures for quence alone, without relying on similarity at the fold level between the compact tertiary structures that bury the hy- modeled sequence and any of the known structures. In this Viewpoint, we drophobic residues and pair the ␤-strands. begin by describing the essential features of the methods, the accuracy of This strategy resolves some of the problems the models, and their application to the prediction and understanding of with both the conformational search and the protein function, both for single proteins and on the scale of whole free energy function: The search is greatly . We then discuss the important role that protein structure accelerated because switching between dif- prediction methods play in the growing worldwide effort in structural ferent possible local structures can occur in a genomics. single step, and fewer demands are placed on the free energy function because the use of Modeling of a sequence based on known positions of conserved atoms from the tem- fragments of known structures ensures that structures consists of four steps: finding plates to calculate the coordinates of other the local interactions are close to optimal. known structures related to the sequence to atoms (7). A third group of methods uses be modeled (i.e., templates), aligning the se- either distance geometry or optimization Accuracy and Applications of Models quence with the templates, building a model, techniques to satisfy spatial restraints ob- The accuracy of a comparative model is re- and assessing the model (1). tained from the sequence-template alignment lated to the percentage sequence identity on The templates for modeling may be found (8–10). There are also many methods that which it is based, correlating with the rela- by sequence comparison methods, such as specialize in the modeling of loops (11) and tionship between the structural and sequence PSI-BLAST (2), or by sequence-structure side chains (12) within the restrained envi- similarity of two proteins (Fig. 1) (1, 18, 19). threading methods (3) that can sometimes ronment provided by the rest of the structure. High-accuracy comparative models are based reveal more distant relationships than purely on more than 50% sequence identity to their sequence-based methods. In the latter case, De novo Structure Prediction templates. They tend to have about 1 Å root fold assignment and alignment are achieved Although comparative modeling is limited to mean square (RMS) error for the main-chain by threading the sequence through each of the protein families with at least one known struc- atoms, which is comparable to the accuracy structures in a library of all known folds. ture, de novo structure prediction has no such of a medium-resolution nuclear magnetic res- Each sequence-structure alignment is as- limitation. De novo methods start from the as- onance (NMR) structure or a low-resolution sessed by the energy of a corresponding sumption that the native state of a protein is at x-ray structure. The errors are mostly mis- coarse model, not by sequence similarity as the global free energy minimum and carry out a takes in side-chain packing, small shifts or in sequence comparison methods. large-scale search of conformational space for distortions of the core main-chain regions, Comparative structure prediction produc- protein tertiary structures that are particularly and occasionally larger errors in loops. Me- es an all-atom model of a sequence, based on low in free energy for the given amino acid dium-accuracy comparative models are based its alignment to one or more related protein sequence. The two key components of such on 30 to 50% sequence identity. They tend to structures. Comparative model building in- methods are the procedure for efficiently carry- have about 90% of the main-chain modeled cludes either sequential or simultaneous mod- ing out the conformational search and the free with 1.5 Å RMS error. There are more fre- eling of the core of the protein, loops, and energy function used for evaluating possible quent side-chain packing, core distortion, and side chains. In the original comparative ap- conformations. To allow rapid and efficient loop modeling errors, and there are occasion- proach, a model is constructed from a few searching of conformational space, often only a al alignment mistakes (18). Finally, low-ac- template core regions and from loops and subset of the atoms in the protein chain is curacy comparative models are based on less side chains obtained from either aligned or represented explicitly; the potential functions than 30% sequence identity. The alignment unrelated structures (4–6). Another family of must then include terms that reflect the aver- errors increase rapidly below 30% sequence comparative methods relies on approximate aged-out effects of the omitted atoms and sol- identity and become the most substantial or- vent molecules. igin of errors in comparative models. In ad- 1Howard Hughes Medical Institute, University of Recently, there have been a number of dition, when a model is based on an almost Washington, Seattle, WA 98195, USA. E-mail: promising advances in de novo structure pre- insignificant alignment to a known structure, 2 [email protected]. Laboratory of Molecular diction (13–16). A particularly successful it may also have an entirely incorrect fold. Biophysics, Pels Family Center for Biochemistry and , The Rockefeller University, New method, called Rosetta, is based on a picture Accuracies of the best model building meth- York, NY 10021, USA. E-mail: [email protected]. of protein folding in which short segments of ods are relatively similar when used optimal-

www.sciencemag.org SCIENCE VOL 294 5 OCTOBER 2001 93 G ENOME:UNLOCKING B IOLOGY’ S S TOREHOUSE ly (19, 20). Other factors such as template tagenesis and affinity chromatography exper- diction, may contribute substantially to ef- selection and alignment accuracy usually iments, even though the ligand specificity ficient structural characterization of large have a larger impact on the model accuracy, profile of this protein is different from that of macromolecular assemblies. especially for models based on less than 40% the template structure. Another example is The accuracy and reliability of models sequence identity to the templates. prediction of a binding site for a charged produced by de novo methods is much lower There is a wide range of applications of ligand based on a cluster of charged residues than that of comparative models based on protein structure models (Figs. 1 and 2). For on the protein, as was done for mouse mast alignments with more than 30% sequence example, high- and medium-accuracy com- cell protease 7 (Fig. 2B) (22). The prediction identity, but the basic topology of a protein or parative models frequently are helpful in re- of a proteoglycan binding patch was con- domain can in some cases be predicted rea- fining functional predictions that have been firmed by site-directed mutagenesis and hep- sonably well (Fig. 1, D and E). For roughly based on a sequence match alone because arin-affinity chromatography experiments. 40% of proteins shorter than 150 amino acids ligand binding is more directly determined by Fortunately, errors in the functionally impor- that have been examined, one of the five most the structure of the binding site than by its tant regions in comparative models are many commonly recurring models generated by sequence. It is often possible to correctly times relatively low because the functional Rosetta has sufficient global similarity to the predict features of the target protein that do regions, such as active sites, tend to be more true structure to recognize it in a search of the not occur in the template structure. The size conserved in evolution than the rest protein structure database. Reasonable mod- of a ligand may be predicted from the volume of the fold. The utility of low-accuracy com- els can in some cases be produced for do- of the binding site cleft (Fig. 2A). For exam- parative models can be illustrated by a mo- mains of even very large proteins by using ple, the complex between docosahexaenoic lecular model of the whole yeast ribosome, multiple sequence alignments to identify do- fatty acid and brain lipid-binding protein was whose construction was facilitated by fit- main boundaries (Fig. 1D). modeled on the basis of its 62% sequence ting comparative models of many ribosom- The accuracy of de novo models is too identity to the crystallographic structure of al proteins into the electron microscopy low for problems requiring high-resolution adipocyte lipid-binding protein (PDB code map of the ribosomal particle (23). This structure information. Instead, the low-reso- 1ADL) (21). A number of fatty acids were example also suggests that structural lution models produced by these methods can ranked for their affinity to brain lipid-binding genomics of single proteins or their do- reveal structural and functional relationships protein consistently with site-directed mu- mains, combined with protein structure pre- between proteins not apparent from their ami- no acid sequences and provide a framework Fig. 1. Accuracy and for analyzing spatial relationships between application of protein evolutionarily conserved residues or between structure models. residues shown experimentally to be func- Shown are the differ- tionally important. These applications are il- ent ranges of applica- lustrated by examples from the recent CASP4 bility of comparative protein structure mod- blind protein structure prediction experiment eling, threading, and de (24, 25). The predicted structure of a protein novo structure predic- involved in cell lysis (26) was found to be tion; the corresponding structurally related to a protein with a similar accuracy of protein function but no significant sequence similar- structure models; and ity (Fig. 2B). The predicted structure of a their sample applica- tions. (A through C). domain of the mismatch repair protein MutS Sample comparative (27) (Fig. 1D) has structural similarity to models based on about proteins with related functions (28). Func- 60% (A), 40% (B), and tionally important residues of the signaling 30% (C) sequence protein Frizzled (29) were clustered in the identity to their tem- predicted structure in a surface patch likely to plate structure. (D and E) Examples of Rosetta be involved in a key protein-protein interac- de novo structure pre- tion (Fig. 2C). Thus, in favorable cases de dictions for the CASP4 novo predictions can provide some of the structure prediction most important functional insights obtainable experiment. Predicted from experimentally determined structures. structures are in red, and actual structures Modeling on a Genomic Scale are in blue. The accura- cy of the models de- Threading and comparative modeling methods crease significantly in have already been applied on a genomic scale going from (A) to (E), (18, 30, 31). In total, domains in 58% of all but the overall struc- 600,000 known protein sequences were mod- ture is still roughly cor- eled with ModPipe (18) and MODELLER (9) rect. (D) A domain from the 811-residue and deposited into a comprehensive database of MUtS protein which comparative models, ModBase (32–34). The was recognized as an Web interface to the database allows flexible autonomous unit from querying for fold assignments, sequence-struc- an alignment of ho- ture alignments, models, and model assessments mologous sequences; of interest. An integrated sequence/structure such parsing of large proteins into domains viewer, ModView, allows inspection and anal- can make structure ysis of the query results. ModBase will be in- prediction more tractable. creasingly interlinked with other applications

94 5 OCTOBER 2001 VOL 294 SCIENCE www.sciencemag.org G ENOME:UNLOCKING B IOLOGY’ S S TOREHOUSE and databases such that structures and other ety of target selection schemes (38), ranging Large-scale de novo prediction can guide tar- types of information can be easily used for from focusing on only novel folds to select- get selection by focusing experimental struc- functional annotation. Although the current ing all proteins in a model . A model- ture determination on proteins likely to adopt number of modeled proteins may look impres- centric view requires that targets be selected novel folds. De novo methods should also be sive given the early stage of , such that most of the remaining sequences useful in complementing comparative model- usually only one domain per protein is modeled can be modeled with useful accuracy by com- ing methods by building portions of proteins (on the average, proteins have slightly more than parative modeling. Even with structural not present in template structures. In addition, two domains), and two-thirds of the models are genomics, the structure of most of the pro- de novo methods supplemented by restraints based on less than 30% sequence identity to the teins will be modeled, not determined by from cross linking or other experiments can closest template. experiment. As discussed above, the accu- provide models for proteins not readily ame- Automation and large-scale modeling racy of comparative models and corre- nable to x-ray crystallographic or NMR anal- with de novo methods have lagged behind spondingly the variety of their applications ysis. Finally, large-scale de novo modeling those of comparative modeling methods, be- decrease sharply below the 30% sequence may allow coarse structure-based insights cause of the relatively poor quality of the identity cutoff, mainly as a result of a rapid into protein function of a large number of models produced and the relatively large increase in alignment errors. Thus, we will proteins well in advance of experimentally amount of computer time required. However, need to determine protein structures so that determined structures. inspired by the potential for functional in- most of the remaining sequences are related sights, large-scale modeling calculations to at least one known structure at higher Conclusions have been initiated with Rosetta. In the first than 30% sequence identity (36, 37). It was Improvement in the accuracy of models pro- such project, models for representatives of all recently estimated that this cutoff requires a duced by both de novo and comparative mod- PFAM families with less than 150 amino minimum of 16,000 targets to cover 90% of eling approaches will require methods that acids and currently not linked to proteins of all protein domain families, including those finely sample protein conformational space known structure have been produced. Strong of membrane proteins (36). These 16,000 using a free energy or scoring function that structural similarities of these models to structures will allow the modeling of a very has sufficient accuracy to distinguish the na- structures of previously determined proteins much larger number of proteins. For exam- tive structure from the nonnative conforma- can indicate previously unidentified relation- ple, New York Structural Genomics Re- tions. Despite many years of development of ships that may provide functional insights. It search Consortium measured the impact of molecular simulation methods, attempts to should soon be possible to extend these large- its structures by documenting the number refine models that are already relatively close scale calculations to cover most of the do- and quality of the corresponding models for to the native structure have met with relative- mains not represented in ModBase. detectably related proteins in the nonredun- ly little success. This failure is likely to be dant sequence database. For each new due to inaccuracies in the potential functions The Role of Protein Structure structure, on average, ϳ100 protein se- used in the simulations, particularly in the Prediction in Structural Genomics quences without any prior structural char- treatment of electrostatics and solvation ef- Structural genomics aims to structurally char- acterization could be modeled at least at the fects. Improvements in sampling strategies acterize most protein sequences by an effi- fold level (39). This large leverage of struc- may also be necessary, given the relatively cient combination of experiment and predic- ture determination by protein structure long time scale of protein folding (millisec- tion (35–37). This aim will be achieved by modeling illustrates and justifies the onds to seconds). Combination of physical careful selection of target proteins and their premise of structural genomics. chemistry with the vast amount of informa- structure determination by x-ray crystallogra- De novo structure prediction will contrib- tion in known protein structures may provide phy or NMR spectroscopy. There are a vari- ute to structural genomics in several ways. a route to development of improved potential

Fig. 2. Sample applica- tions of protein struc- ture models. (A)A comparative model of a complex between docosahexaenoic fatty acid (violet) and brain lipid-binding protein. Such models for a number of fatty acid li- gands were used to rank their binding af- finities (21). (B) A com- parative model of mouse mast cell protease 7, color-coded by the surface electrostatic potential. This model facilitated identification of a proteoglycan binding site (22). (C) The de novo predicted structure of a protein that lyses bacteria (left) was found to be similar to the structure of a protein with a similar function (nk-lysin; right) despite a lack of significant sequence similarity between the two proteins. Such similarity between predicted structures and previously determined structures can provide clues about protein function in the absence of strong sequence homology. (D) The de novo predicted structure (left) of the signaling protein Frizzled is compared with the experimentally determined structure (right) (26), with the residues identified by mutagenesis to be involved in binding the WNT signaling protein, indicated by gray spheres. The clustering of the putative functional residues at the right in the model and the true structure suggests that they form a contiguous protein- protein interaction site despite their lack of adjacency along the linear amino acid sequence. (C and D) Rosetta de novo predictions from CASP4; to faciliate comparison, the colors indicate position along the the chain from the NH2 terminus (blue) to the COOH terminus (red).

www.sciencemag.org SCIENCE VOL 294 5 OCTOBER 2001 95 G ENOME:UNLOCKING B IOLOGY’ S S TOREHOUSE

functions. The refinement of de novo and 3. A. E. Torda, Curr. Opin. Struct. Biol. 7, 200 (1997). 28. R. Bonneau, J. Tsai, I. Ruczinski, D. Baker, J. Struct. comparative models provides a good test and 4. W. J. Browne, A. C. North, D. C. Phillips, J. Mol. Biol. Biol. 134, 186 (2001). application of the molecular dynamics meth- 42, 65 (1969). 29. A. E. Dann et al., Nature 412, 86 (2001). 5. J. Greer, J. Mol. Biol. 153, 1027 (1981). 30. D. Fischer, D. Eisenberg, Proc. Natl. Acad. Sci. U.S.A. ods widely used to simulate biological mac- 6. T. L. Blundell, B. L. Sibanda, M. J. Sternberg, J. M. 94, 11929 (1997). romolecules (40). Thornton, Nature 326, 347 (1987). 31. N. Guex, A. Diemand, M. C. Peitsch, Trends. Biochem. Automated methods for deducing function 7. M. Levitt, J. Mol. Biol. 226, 507 (1992). Sci. 24, 364 (1999). 8. T. F. Havel, M. E. Snow, J. Mol. Biol. 217, 1 (1991). from structure will be critical to obtaining func- 32. See http://guitar.rockefeller.edu/modbase/. 9. A. Sali, T. L. Blundell, J. Mol. Biol. 234, 779 (1993). 33. R. Sanchez et al., Nature Struct. Biol. 7 (suppl.), 986 tional insights from both predicted and experi- 10. A. Kolinski, M. R. Betancourt, D. Kihara, P. Rotkiewicz, (2000). mentally determined structures. Considerable J. Skolnick, Proteins 44, 133 (2001). 34. R. Sa´nchez et al., Nucleic Acids Res. 28, 250 (2000). insight can be gained from structural compari- 11. A. Fiser, R. K. G. Do, A. Sali, Protein Sci. 9, 1753 35. S. K. Burley et al., Nature Genet. 23, 151 (1999). (2000). son of a given structure with all other known 36. D. Vitkup, E. Melamud, J. Moult, C. Sander, Nature 12. M. J. Bower, F. E. Cohen, R. L. Dunbrack Jr., J. Mol. Biol. Struct. Biol. 8, 559 (2001). 267, 1268 (1997). protein structures using methods such as Dali 37. A. Sali, Nature Struct. Biol. 5, 1029 (1998). 13. R. Samudrala, Y. Xia, E. Huang, M. Levitt, Proteins 3 (41), which can frequently detect structural re- 38. S. E. Brenner, Nature Struct. Biol. 7 (suppl.), 967 (suppl.), 197 (1999). lationships with functional significance that are 14. A. R. Ortiz, A. Kolinski, P. Rotkiewicz, B. Ilkowski, J. (2000). not evident from sequence comparisons. Also Skolnick, Proteins 37, 177 (1999). 39. See http://nysgrc.org/. promising are methods that match a structure 15. J. Pillardy et al., Proc. Natl. Acad. Sci. U.S.A. 98, 2329 40. A. L. Brooks, M. Karplus, B. M. Pettit (Wiley, New (2001). York, 1988). against a library of structural motifs associated 16. D. T. Jones, Proteins 1 (suppl.), 185 (1997). 41. L. Holm, C. Sander, Trends Biochem. Sci. 20, 478 with different functions (42–44). For higher res- 17. K. Simons, C. Strauss, D.Baker, J. Mol. Biol. 306, 1191 (1995). olution models produced by comparative mod- (2000). 42. A. Zhang et al., Protein Sci. 8, 1104 (1999). 18. R. Sanchez, A. Sali, Proc. Natl. Acad. Sci. U.S.A. 95, 43. A. C. Wallace, N. Borkakoti, J. M. Thornton, Protein eling methods, functional sites on proteins can 13597 (1998). Sci. 6, 2308 (1997). potentially be identified and characterized by 19. P. Koehl, M. Levitt, Nature Struct. Biol. 6, 108 (1999). 44. P. Aloy, E. Querol, F. X. Aviles, M. J. Sternberg, J. Mol. explicit ligand docking calculations. Finally, 20. M. A. Marti-Renom, M. S. Madhusudhan, A. Fiser, B. Biol. 311, 395 (2001). 45. R. Bonneau, D. Baker, Annu. Rev. Biophys. Biomol. large-scale protein-protein docking calculations Rost, Structure, in press. 21. L. Z. Xu, R. Sanchez, A. Sali, N. Heintz, J. Biol. Chem. Struct. 30, 173 (2001). in years to come may contribute to the identifi- 271, 24711 (1996). 46. We are grateful to J. Frank and R. Beckmann for the cation and characterization of protein interaction 22. R. Matsumoto, A. Sali, N. Ghildyal, M. Karplus, R. L. picture of the ribosomal particle; M. A. Marti- networks. Stevens, J. Biol. Chem. 270, 19524 (1995). Renom, R. Bonneau, and N. Eswar for help in 23. C. M. T. Spahn et al., Cell, in press. preparing the figures; and members of our groups 24. See http://predictioncenter.llnl.gov/casp4. for many discussions about protein structure pre- References 25. R. Bonneau et al., Proteins: Structure, Function and diction. Supported by NIH/GM 54762 (A.S.), the 1. M. A. Marti-Renom et al., Annu. Rev. Biophys. Biomol. Genetics, in press. Mathers Foundation (A.S.), a Merck Genome Re- Struct. 29, 291 (2000). 26. A. Gonzalez et al., Proc. Natl. Acad. Sci. U.S.A. 97, search Award (A.S.), and the Howard Hughes Med- 2. S. F. Altschul et al., Nucleic Acids Res. 25, 3389 11221 (2000). ical Institute (D.B.). A.S. is an Irma T. Hirschl Trust (1997). 27. G. Obmolova et al., Nature 407, 703 (2000). Career Scientist.

REVIEW Making Sense of Eukaryotic DNA Replication Origins David M. Gilbert

DNA replication is the process by which cells make one complete copy of although derived from prokaryotic and viral their genetic information before cell division. In bacteria, readily identifi- systems, there is no compelling reason to able DNA sequences constitute the start sites or origins of DNA replica- doubt that it will apply to all eukaryotic tion. In eukaryotes, replication origins have been difficult to identify. In organisms. In fact, the proteins that regulate some systems, any DNA sequence can promote replication, but other replication are highly conserved from yeast to systems require specific DNA sequences. Despite these disparities, the humans, including the origin recognition proteins that regulate replication are highly conserved from yeast to complex (ORC), which binds directly to rep- humans. The resolution may lie in a current model for once-per-cell-cycle lication origin sequences in budding yeast (1, regulation of eukaryotic replication that does not require defined origin 2). However, in several eukaryotic replication sequences. This model implies that the specification of precise origins is a systems, it appears that any DNA sequence response to selective pressures that transcend those of once-per-cell-cycle can function as a replicator. Those outside the replication, such as the coordination of replication with other chromo- field are often perplexed as to how investiga- somal functions. Viewed in this context, the locations of origins may be an tors of different eukaryotic systems can work integral part of the functional organization of eukaryotic chromosomes. with assumptions that range from very spe- cific to completely random origin sequence Transmission of genetic information from division. Typically, this process begins with recognition, yet all agree on the basic mech- one cell generation to the next requires the the binding of an “initiator” protein to a anism regulating DNA replication. This re- accurate and complete duplication of each specific DNA sequence or “replicator.” In view summarizes our current understanding DNA strand exactly once before each cell response to the appropriate cellular signals, of eukaryotic replication origins and then pre- the initiator directs a local unwinding of the sents some simple guidelines to help demys- DNA double helix and recruits additional tify these seemingly disparate observations, Department of Biochemistry and Molecular Biology, factors to initiate the process of DNA repli- providing a framework for understanding eu- SUNY Upstate Medical University, 750 East Adams Street, Syracuse, NY 13210, USA. E-mail: gilbert cation. This paradigm describes most of the karyotic origins that includes all existing [email protected] currently tractable replication systems and, data.

96 5 OCTOBER 2001 VOL 294 SCIENCE www.sciencemag.org