Protein Structure Prediction and Structural Genomics David Baker1 and Andrej Sali2

G ENOME:UNLOCKING B IOLOGY’ S S TOREHOUSE VIEWPOINT Protein Structure Prediction and Structural Genomics David Baker1 and Andrej Sali2 Genome sequencing projects are producing linear amino acid sequences, the protein chain flicker between different but full understanding of the biological role of these proteins will require local structures consistent with their local knowledge of their structure and function. Although experimental struc- sequence, and folding to the native state oc- ture determination methods are providing high-resolution structure infor- curs when these local segments are oriented mation about a subset of the proteins, computational structure prediction such that low free energy interactions are methods will provide valuable information for the large fraction of se- made throughout the protein (17). In simu- quences whose structures will not be determined experimentally. The ﬁrst lating this process, each short segment is class of protein structure prediction methods, including threading and allowed to sample the local structures adopt- comparative modeling, rely on detectable similarity spanning most of the ed by the sequence segment in known protein modeled sequence and at least one known structure. The second class of structures, and a search is carried out through methods, de novo or ab initio methods, predict the structure from se- the combinations of these local structures for quence alone, without relying on similarity at the fold level between the compact tertiary structures that bury the hy- modeled sequence and any of the known structures. In this Viewpoint, we drophobic residues and pair the ␤-strands. begin by describing the essential features of the methods, the accuracy of This strategy resolves some of the problems the models, and their application to the prediction and understanding of with both the conformational search and the protein function, both for single proteins and on the scale of whole free energy function: The search is greatly genomes. We then discuss the important role that protein structure accelerated because switching between dif- prediction methods play in the growing worldwide effort in structural ferent possible local structures can occur in a genomics. single step, and fewer demands are placed on the free energy function because the use of Modeling of a sequence based on known positions of conserved atoms from the tem- fragments of known structures ensures that structures consists of four steps: finding plates to calculate the coordinates of other the local interactions are close to optimal. known structures related to the sequence to atoms (7). A third group of methods uses be modeled (i.e., templates), aligning the se- either distance geometry or optimization Accuracy and Applications of Models quence with the templates, building a model, techniques to satisfy spatial restraints ob- The accuracy of a comparative model is re- and assessing the model (1). tained from the sequence-template alignment lated to the percentage sequence identity on The templates for modeling may be found (8–10). There are also many methods that which it is based, correlating with the rela- by sequence comparison methods, such as specialize in the modeling of loops (11) and tionship between the structural and sequence PSI-BLAST (2), or by sequence-structure side chains (12) within the restrained envi- similarity of two proteins (Fig. 1) (1, 18, 19). threading methods (3) that can sometimes ronment provided by the rest of the structure. High-accuracy comparative models are based reveal more distant relationships than purely on more than 50% sequence identity to their sequence-based methods. In the latter case, De novo Structure Prediction templates. They tend to have about 1 Å root fold assignment and alignment are achieved Although comparative modeling is limited to mean square (RMS) error for the main-chain by threading the sequence through each of the protein families with at least one known struc- atoms, which is comparable to the accuracy structures in a library of all known folds. ture, de novo structure prediction has no such of a medium-resolution nuclear magnetic res- Each sequence-structure alignment is as- limitation. De novo methods start from the as- onance (NMR) structure or a low-resolution sessed by the energy of a corresponding sumption that the native state of a protein is at x-ray structure. The errors are mostly mis- coarse model, not by sequence similarity as the global free energy minimum and carry out a takes in side-chain packing, small shifts or in sequence comparison methods. large-scale search of conformational space for distortions of the core main-chain regions, Comparative structure prediction produc- protein tertiary structures that are particularly and occasionally larger errors in loops. Me- es an all-atom model of a sequence, based on low in free energy for the given amino acid dium-accuracy comparative models are based its alignment to one or more related protein sequence. The two key components of such on 30 to 50% sequence identity. They tend to structures. Comparative model building in- methods are the procedure for efficiently carry- have about 90% of the main-chain modeled cludes either sequential or simultaneous mod- ing out the conformational search and the free with 1.5 Å RMS error. There are more fre- eling of the core of the protein, loops, and energy function used for evaluating possible quent side-chain packing, core distortion, and side chains. In the original comparative ap- conformations. To allow rapid and efficient loop modeling errors, and there are occasion- proach, a model is constructed from a few searching of conformational space, often only a al alignment mistakes (18). Finally, low-ac- template core regions and from loops and subset of the atoms in the protein chain is curacy comparative models are based on less side chains obtained from either aligned or represented explicitly; the potential functions than 30% sequence identity. The alignment unrelated structures (4–6). Another family of must then include terms that reflect the aver- errors increase rapidly below 30% sequence comparative methods relies on approximate aged-out effects of the omitted atoms and sol- identity and become the most substantial or- vent molecules. igin of errors in comparative models. In ad- 1Howard Hughes Medical Institute, University of Recently, there have been a number of dition, when a model is based on an almost Washington, Seattle, WA 98195, USA. E-mail: promising advances in de novo structure pre- insignificant alignment to a known structure, 2 [email protected]. Laboratory of Molecular diction (13–16). A particularly successful it may also have an entirely incorrect fold. Biophysics, Pels Family Center for Biochemistry and Structural Biology, The Rockefeller University, New method, called Rosetta, is based on a picture Accuracies of the best model building meth- York, NY 10021, USA. E-mail: [email protected]. of protein folding in which short segments of ods are relatively similar when used optimal- www.sciencemag.org SCIENCE VOL 294 5 OCTOBER 2001 93 G ENOME:UNLOCKING B IOLOGY’ S S TOREHOUSE ly (19, 20). Other factors such as template tagenesis and affinity chromatography exper- diction, may contribute substantially to ef- selection and alignment accuracy usually iments, even though the ligand specificity ficient structural characterization of large have a larger impact on the model accuracy, profile of this protein is different from that of macromolecular assemblies. especially for models based on less than 40% the template structure. Another example is The accuracy and reliability of models sequence identity to the templates. prediction of a binding site for a charged produced by de novo methods is much lower There is a wide range of applications of ligand based on a cluster of charged residues than that of comparative models based on protein structure models (Figs. 1 and 2). For on the protein, as was done for mouse mast alignments with more than 30% sequence example, high- and medium-accuracy com- cell protease 7 (Fig. 2B) (22). The prediction identity, but the basic topology of a protein or parative models frequently are helpful in re- of a proteoglycan binding patch was con- domain can in some cases be predicted rea- fining functional predictions that have been firmed by site-directed mutagenesis and hep- sonably well (Fig. 1, D and E). For roughly based on a sequence match alone because arin-affinity chromatography experiments. 40% of proteins shorter than 150 amino acids ligand binding is more directly determined by Fortunately, errors in the functionally impor- that have been examined, one of the five most the structure of the binding site than by its tant regions in comparative models are many commonly recurring models generated by sequence. It is often possible to correctly times relatively low because the functional Rosetta has sufficient global similarity to the predict features of the target protein that do regions, such as active sites, tend to be more true structure to recognize it in a search of the not occur in the template structure. The size conserved in evolution than the rest protein structure database. Reasonable mod- of a ligand may be predicted from the volume of the fold. The utility of low-accuracy com- els can in some cases be produced for do- of the binding site cleft (Fig. 2A). For exam- parative models can be illustrated by a mo- mains of even very large proteins by using ple, the complex between docosahexaenoic lecular model of the whole yeast ribosome, multiple sequence alignments to identify do- fatty acid and brain lipid-binding protein was whose construction was facilitated by fit- main boundaries (Fig. 1D). modeled on the basis of its 62% sequence ting comparative models of many ribosom- The accuracy of de novo models is too identity to the crystallographic structure of al proteins into the electron microscopy low for problems requiring high-resolution adipocyte lipid-binding protein (PDB code map of the ribosomal particle (23).

Protein Structure Prediction and Structural Genomics David Baker1 and Andrej Sali2

Whole Genome and Segmental Duplications Underlie Glutamine Synthetase and Phosphoenolpyruvate Carboxylase Diversity in Narrow-Leafed Lupin (Lupinus Angustifolius L.)

A Roadmap for Metagenomic Enzyme Discovery

Structural Genomics: an Approach to the Protein Folding Problem

Syntenic Gene and Genome Duplication Drives Diversification of Plant Secondary Metabolism and Innate Immunity in Flowering Plants

The Role of Protein Structure in Genomics

Structural Genomics Lecture Notes

The Impact of Structural Genomics: Expectations and Outcomes John-Marc Chandonia and Steven E

Classical Genetics 3. the Beginnings of Genomic Biol

Interrogating Protein Interaction Networks Through Structural Biology

Unexpected Features of the Dark Proteome

National Science Foundation-Sponsored Workshop Report

Structural Genomics of the Thermotoga Maritima Proteome Implemented in a High-Throughput Structure Determination Pipeline