B IOENGINEERING AND B IOPHYSICS

COMPUTATIONAL STUDIES OF FOLDING

The authors describe the state of the art in the field of prediction. They also introduce Prospector, a newly developed, iterative threading algorithm for protein structure prediction that can also be applied to ab initio , and discuss the promising results of its large-scale application.

roteins are the workhorses of life. based approaches to function prediction can play These polymers, comprised of 20 natu- a significant role,3,4 especially in target selection rally occurring amino acids, fold to a for genomics projects.5 The ultimate goal for Punique, biologically active conforma- most such projects is to experimentally deter- tion called the native state. Various genome-se- mine the structure of all possible protein folds quencing projects now list the parts of such pro- so that any newly found sequence is within mod- tein sequences in a given organism, but eling distance of an already solved structure. In unfortunately, this list is of little utility; the real this article, we examine the status of contempo- need is to identify the functions of all these pro- rary protein structure prediction approaches. teins, which range from molecular to physiolog- There are three classes of approaches to pro- ical to phenotypical. For between 40 to 60 per- tein structure prediction: homology modeling, cent of the protein-coding regions (or open threading, and ab initio folding. In homology reading frames), sequence-based methods that modeling,6 the query sequence and the already exploit evolutionary information can provide in- solved template structure are clearly evolution- sight into some aspect of biological function. arily related. The key challenge is to generate Such alignment methods define the standard the best alignment on the template backbone, against which we must measure all alternative rebuild the protein’s side chains, and fill in the approaches,1,2 but such approaches increasingly alignment’s gaps, typically in the loops between fail as the protein families become more distant.3 secondary structural elements. For threading, we The remaining unassigned open reading frames attempt to find the closest matching structure in represent an important challenge, and structure- a library of already solved structures.7 The struc- tures can be analogous—two need not be evolutionarily related—but they adopt simi- lar structures by convergent evolution. Both 1521-9615/01/$10.00 © 2001 IEEE threading and homology modeling have the dis- JEFFREY SKOLNICK advantage that a solved example of a related structure must already exist. With ab initio fold- Danforth Plant Science Center, Missouri ing, we attempt to fold a protein from a random ANDRZEJ KOLINSKI conformation.8 It has the advantage that we Danforth Plant Science Center and the University of Warsaw, Poland don’t need a previous example of the fold, but it

22 COMPUTING IN SCIENCE & ENGINEERING is limited to relatively small proteins and gener- amino acid had a single degree of freedom in- ally produces low- to moderate-resolution pre- volving its rotation around the Cα-Cα virtual dicted structures. bond. A knowledge-based potential controlled the Ab initio folding has a particular impact on short-range interactions, while the interactions two problems that we must simultaneously solve between the side groups were of the Lennard- for ab initio protein-structure prediction to be Jones type. They handled sampling with classical truly successful. We must first develop an effi- . Simulations of bovine pan- cient conformational search scheme that ad- creatic trypsin inhibitor sometimes produced dresses the multiple minima problem (if each structures resembling the native fold, with the residue has three degrees of freedom, then a best structures having a root-mean-square devi- 100-residue protein has on the order of 1050 pos- ation from the native in the range of 6.5 Å. sible conformations), and then we must apply it Later researchers studied similar models, with to an energy function that has the global mini- comparable results.11 Some have developed con- mum in the protein’s native conformation. Both tinuous-space models with parts are equally challenging, because a protein’s more structural details. Sun energy landscape has many hills and valleys, and examined models that had an There are three classes developing an energy function that identifies the all-atom representation of the native state as the global minimum among simi- main chain and single, united of approaches to lar, but incorrect, protein-like states is nontrivial. atom-side groups.12 Knowl- These problems can be partly addressed by ex- edge-based statistical poten- protein structure ploiting information from threading such as pre- tials described the interactions dicted contacts between side chains. In this between the side groups, and prediction. spirit, we have developed a unified approach to a genetic algorithm (GA) protein structure prediction that uses informa- searched conformational space. tion from our new threading algorithm Prospec- For small peptides (such as tor as restraints in an ab initio folding algorithm. mellitin, pancreatic polypeptide inhibitor, and apamin), he predicted structures whose accuracy ranged from a root-mean-square deviation of A historical perspective 1.66 Å to 4.5 Å from native, depending on A typical protein folds from the unfolded, ran- size.Pedersen and John Moult assumed an all- dom conformation state to the native state on the heavy atom protein representation and used order of milliseconds to minutes. At full atomic knowledge-based potentials to describe intra- detail, we would have to simulate both the pro- protein interactions.13 A combination of Monte tein and the water in which it is dissolved. Using Carlo (MC) and GAs search the conformational contemporary computers, it is impossible to fold space. MC produces a set of structures for the a protein by brute force. Classical molecular dy- GA starting population, with crossover points namics simulations of a protein surrounded by occurring in the largest flexibility regions de- an appropriate number of water molecules typi- tected in the MC runs. Their method success- cally access times on the order of tens to hun- fully predicted low- to moderate-resolution pro- dreds of nanoseconds, which is at least three or- tein fragments and the approximate folds of ders of magnitude less than the fastest protein small proteins, but it’s limited to small proteins. folding times. To simulate the requisite folding time scales, we typically reduce the number of Lattices to simplify the conformational search the protein’s degrees of freedom and treat the sol- Although continuous-space, reduced models vent implicitly by a potential of mean force (such contain fewer degrees of freedom than detailed as a Generalized Born (which treats the electro- atomic models, effectively sampling the confor- statics), accessible surface approach.9 mational space for larger proteins is extremely difficult. To further reduce the number of de- First steps grees of freedom, researchers have proposed dis- The first reduced protein folding models ap- crete or lattice models. Early studies of lattice peared 25 years ago. In their pioneering work, proteins did not focus on protein structure pre- Michael Levitt and Arieh Warshel proposed a diction but rather on understanding the funda- model that assumed two centers of interaction per mentals of the thermodynamics and kinetics of residue, one on the backbone alpha carbon and protein folding.14–21 the other at the side group mass’s center.10 Each The first attempt to predict a protein’s native

SEPTEMBER/OCTOBER 2001 23 structure in an ab initio fashion using a lattice wise, and multibody long-range interactions, representation of a protein came from Da- with an implicit averaged effect of the water shevskii.22 He used a diamond lattice chain to molecules. For several small globular proteins approximate the polypeptide conformations and and simple multimeric coils, such models gen- a chain growth algorithm to sample conforma- erated correct low- to moderate-resolution tional space. A simple force field generated and (high-resolution in the case of leucine zippers) identified compact structures resembling native structures obtained from simulated annealing folds of small polypeptides. Somewhat later, simulations.26,28 David Covell investigated a simple cubic lattice model of CASP Such models real proteins.23 His model’s To assess the current status of protein structure consisted entirely prediction, John Moult proposed the CASP generated correct low- of long-range interactions (Critical Assessment of Techniques for Protein that included a pairwise, Structure Prediction) community-wide protein to moderate- knowledge-based potential, a structure prediction experiment. The idea is that surface term, and a potential experimentalists who are about to determine pro- resolution structures. that corrected the model tein structures make the sequences of the pro- chain’s local packing. The teins available and then the protein structure pre- quality of the crude folds this diction community makes predictions that are method generated was com- then assessed by independent reviewers. Atten- parable to those obtained from early continuous dees tested recently developed ab initio protein models. Covell later studied five small globular structure predictions methods during the CASP3 proteins by the enumeration of all possible com- exercises, conducted in December 1998 in Asilo- pact conformations on a body-centered cubic mar, California.29 They presented a number of lattice chain. He and his colleagues could always new techniques that constitute qualitative find the closest-to-native conformation within progress in ab initio prediction with respect to the top 2 percent of the lowest energy structures, the previous CASPs (held every two years). as assessed by a knowledge-based interaction Among the best performing ab initio methods scheme. was the Rosetta method developed by David Hinds and Michael Levitt developed a lattice Baker and coworkers.30 This approach works as model of proteins where a single diamond lat- follows: First, its developers prepared a multiple tice vertex represented several residues of a real sequence alignment for the sequence of interest protein.24 They used an elaborate statistical po- and did secondary prediction. The combined tential to mimic the mean interactions between secondary structure predictions and sequence such defined protein segments and did an ex- alignments provide the most plausible three- to haustive search of a compact space. They then nine-residue structural fragments extracted (25 obtained the actual identity of the residues from fragments for each segment of the query se- a dynamic programming procedure. Often, they quence) from the structural database. An algo- found correct low-resolution structures among rithm that randomly inserts these three- and the compact structures. nine-residue fragments searches conformational Over the years, we (the authors) have devel- space, and any conformations are scored by a oped a series of high-coordination lattice models function that contains a hydrophobic burial of globular proteins.17,28,25,26 We used lattices of term, elements of electrostatics, a disulfide bond various resolution to mimic the Cα-trace of real bias, and a sequence-independent term that eval- proteins, ranging from 3D “chess-knight” type uates the packing of secondary structure ele- lattices to a high coordination lattice with 90 lat- ments. The top 25 (of 1,200 generated) struc- tice vectors to represent possible subsequent lo- tures frequently contained the proper fold. The cations of Cα-Cα virtual bonds. The models had best five structures that exhibited a single hy- additional interaction centers to represent the drophobic core are selected by “visual inspec- side groups, described by a single-sphere, multi- tion.” This could be a drawback because doing ple-rotamer representation.26,27 The force field a manual evaluation of massive-scale predictions contained terms mimicking short-range interac- would be difficult. Nevertheless, of 18 targets in tions that described local conformational pref- CASP3, four predictions proved globally correct erences for helices and beta strands; explicitly (with a root-mean-square deviation range of 4 cooperative hydrogen bonds; and one body, pair- to 6 Å from native). Furthermore, the majority

24 COMPUTING IN SCIENCE & ENGINEERING of the predictions contained correct fragments.31 rather than evolutionary information, which dis- Other groups also made good predictions for a tinguishes their approach from other participants number of difficult ab initio target proteins at in CASP3.36 Optimization is performed with con- CASP3. Ortiz and colleagues applied a high co- formational space annealing, which narrows the ordination lattice model that we had first devel- search regions and finds distinct families of low- oped, which searched conformational space by energy conformations. Then, the lowest energy, an MC-simulated annealing approach.27,32 The reduced model conformations are subsequently model assumed a 90-basis vector representation converted to the all-atom models and optimized of the alpha carbon trace that has a 1.2 Å resolu- by electrostatically driven MC simulations.37 For tion due to the underlying cubic lattice grid’s some CASP3 targets, this method produced ex- spacing. Off-lattice single-sphere side chains ceptionally good predictions. The method could assume multiple orientations with respect seemed to perform much better on helical pro- to the backbone, thereby mimicking the distrib- teins than on β or α/β proteins. ution of rotamers for particular amino acids. The model’s generic force field consisted of knowl- edge-based potentials (derived from the statis- Choice of sampling scheme tics of the regularities seen in known protein In general, the choice of the simulation–opti- structures). Additionally, they implemented a mization algorithm depends on the given study’s weak bias toward predicted secondary structures aim. The study of protein dynamics and folding and weak theoretically predicted long-range pathways requires different procedures (and to contact restraints from correlated mutation some extent, different force fields) than those analysis in the interaction scheme.33 They based studies designed to identify a protein’s native contact prediction on the analysis of correlated conformation. mutations in sequences detected by multiple se- MC procedures use a wide spectrum of strate- quence alignments. For some targets, their ap- gies for conformational updating. In some algo- proach correctly predicted globally correct fold rithms, there are global up- or large fragments of the structure. dates of the entire chain; chain Osguthorpe employed a continuous-space growth algorithms are repre- The local and global model and sampled conformations with knowl- sentative of this genre. Other edge-based potentials.34 He correctly predicted algorithms involve local chain modifications substantial fractions of his attempted targets, and updates involving only a small his prediction was the best for one of the diffi- portion of the chain or a small combine in the same cult targets. distance displacement of a Ram Samudrala and coworkers developed a larger part of it. Sometimes, algorithm. hierarchical procedure that enumerated all com- the local and global modifica- pact conformations by using a diamond lattice tions combine in the same al- model that had multiple residues per lattice ver- gorithm. tex.35 They then selected the best structures by If we want to study the kinetics of protein fold- fitting the predicted secondary structure frag- ing, then we need to reproduce the actual process ments to the lattice models. These structures of it. Is there a relationship between the molecu- were energy minimized using an all-atom force lar dynamics simulations of a continuous model field and spatial restraints from the lattice mod- and a trajectory of an otherwise similar but now els. They scored the optimized structures by a discretized (or lattice) model? When a random combination of all-atom and residue-based, scheme selects only small, local distance moves, knowledge-based potentials. They then used dis- then the dynamics is equivalent to coarse-grained tance geometry to generate possible “consensus” Brownian dynamics, and a given trajectory is the models and rebuilt all the atom structures again numerical solution of a stochastic equation of (optimized and ranked by energy). This method motion. Of course, the short-time dynamics on correctly predicted a number of qualitatively a single elementary move of the discrete model’s correct significant-size protein fragments. This time scale have no physical meaning. However, approach’s major weakness was perhaps the small the long-time dynamics should be qualitatively fraction of good structures in the initial pool of correct, albeit with possible distortions of the lattice models. time scale of various dynamic events. Recent Harold Scheraga and coworkers developed studies show that the MC folding pathways ob- their force field based on physical principles served in high-coordination lattice models re-

SEPTEMBER/OCTOBER 2001 25 produce the qualitative picture of folding dy- a single minimum remains. When traced back namics seen in experiment.38 Thus, we can use to the original energy surface, this corresponds lattice dynamics for meaningful studies of the na- to the global energy minimum on the nonde- ture of protein folding pathways and the mecha- formed surface. For relatively simple but non- nism of multimeric protein assembly. The valid- trivial systems, this method works well, but for ity of studies using discrete models depends more more complex situations, existing methods do on the protein representation’s accuracy and its not guarantee that we can find the lowest energy attendant force field than on a particular sam- conformation. pling scheme. However, some Simulated annealing, ESMC, minimization, oversimplified discrete mod- GAs, and the combination of GAs with MC Existing methods do els might face serious ergod- sampling have successfully found the near-na- icity problems—an aspect of tive conformations of reduced models of small not guarantee that we simulation that we must care- proteins.12,13,20,42,43 Recently, many studies have fully examine. compared the efficiency of various MC strate- can find the lowest We need isothermal simu- gies for finding a protein model’s global mini- lations at a range of tempera- mum.44 The most straightforward approach is energy conformation. tures above, at, and below the simulated annealing, where the system starts out folding transition tempera- at a relatively high temperature that gradually ture (where 50 percent of the lowers until it’s below the folding transition tem- molecules are native and 50 perature. If, on repeated runs (starting from dif- percent are unfolded) to obtain the folding ferent initial states), we cover the same confor- process’s thermodynamics. Unfortunately, there mation, we can assume that there is a good is a serious problem associated with the ex- chance we have located the global minimum. tremely slow relaxation in the low-temperature, However, for difficult problems, simulated an- dense, globular state where the local barriers are nealing runs (or at least a substantial fraction of high, thus standard sampling becomes ineffec- them) become trapped in local energy minima tive. This renders straightforward molecular dy- that could be far from the properly folded state. namics or canonical MC algorithms prohibi- Unfortunately, there is no simple test of conver- tively expensive. We can surmount such gence in the simulated annealing method. Mod- problems by using properly designed local ifying the transition acceptance criteria could moves that can “jump over” these high local en- considerably improve the simulated annealing’s ergy barriers. efficiency. For instance, we could perform local Multicanonical (or Entropy Sampling Monte minimization before and after the transition and Carlo—ESMC20) sampling can provide more then apply the Metropolis criterion to the locally complete data on folding thermodynamics.25,39,40 lowest energy pairs or conformations. This way, Because they use differently defined transition the sampling procedure can avoid visits to a large probabilities, energy barriers are substituted by fraction of irrelevant local states. entropic barriers. These simulations offer the In contrast to simulated annealing, sampling advantage of an objective means of establishing techniques that use the multicanonical ensem- when the simulation has converged over a given ble have convergence tests. In ESMC,20 the sys- energy range and from a single series of simula- tem entropy estimation is constructed by a sam- tions. It is possible to obtain an estimation of all pling process controlled by the density of states thermodynamic functions (energy, free energy, of particular discretized energy levels. When and entropy) over a wide range of temperatures. converged, all energy levels, including the lowest However, the cost of such computations grows one, should be sampled with the same frequency. rapidly with system size. The ESMC method is “quasi-deterministic”— Rather than characterizing the full thermody- meaning the data from the preceding simula- namics, a simpler task is to find the lowest en- tions could help improve successive run accu- ergy state. This is important because the ther- racy. In principle, ESMC should find the lowest modynamic hypothesis postulates that native energy state, but in practice, the energy spec- proteins are in the conformational energy’s trum near the lowest energy state could have global minimum.41 Researchers have developed large entropy barriers, the lowest energy state a variety of strategies to obtain this global mini- might not be detected, and this region might not mum problem including the diffusion equation be converged. We could accelerate the conver- method, which deforms the energy surface until gence rate by artificially deforming the entropy

26 COMPUTING IN SCIENCE & ENGINEERING curve versus energy in the less important, high- If so, there are soft biases to the template by a energy range. generalized comparative modeling approach that The Replica Exchange Monte Carlo (REMC) involves ab initio folding in the vicinity of the method45 has a different philosophy. Here, we template in a reduced protein model, the SIde simulate many copies with a standard Metropo- CHain Only Model (SICHO) where each lis scheme at various temperatures spanning residue is described by a single interaction center from high to low. Occasionally, the replicas are located at center of mass of the side chains along randomly swapped according to a criterion that with the backbone alpha carbon.48,49 We use depends on temperature difference and the en- REMC to explore conformational space, but ergy difference. Thus, the low-energy confor- threading can also provide predicted secondary mations at a higher temperature could move to a structure and tertiary contacts that are not re- lower temperature. At high temperatures, the stricted to the template structure but that we can energy barriers could be surmounted easily, extract from other structures. This allows for while at low temperatures, the vicinities of the fold prediction in unaligned regions of the query energy “valleys” are efficiently sampled. sequence. Conversely, when there is no signifi- Comparing the computational expense of cant match to a template, the predicted sec- finding the lowest energy state for a simple pro- ondary structure and tertiary contacts extracted tein-like copolymer model shows that REMC is from threading (onto templates that do not have much more efficient than MC-based simulated the query sequence’s global structure) are passed annealing protocol despite the fact that we must to an ab initio folding algorithm that uses the simulate multiple copies of the system. The same reduced protein model, but now there are REMC method also finds the low-energy con- no templates. Then, we cluster the resulting formations many times faster than ESMC. Fur- structures,50 add atomic detail, then use a pair- thermore, due to REMC’s efficient sampling, we wise atomic potential to better select structures could use samples at various temperatures for the (including low RMSD structures that do not “umbrella”-type estimation of the density of cluster),51 the structures are states as a function of energy from which we can again selected, and than pre- obtain all thermodynamic quantities. sent the results. There could be gaps in Thus for protein structure prediction, we have a variety of tools for searching the conformational Summary of CASP4 the alignment as space. The key issue is how can such tools be ex- prediction results ploited for successful protein structure prediction. Last year, the next CASP, wellas long unaligned CASP4 was held. We begin by describing how we did in regions. Prospector CASP4. For difficult targets, We now turn to the practical question of how classified by the CASP4 asses- one goes about predicting protein structure. sors as “new folds,” our Given a protein sequence of unknown structure, method failed to correctly predict the entire most people typically run PSI-BLAST46 over se- structure. Often, Prospector correctly predicted quences from structures in the protein data the fold’s structure elements as well as supersec- bank.47–52 Then, if this fails, a threading program ondary structure elements, but these elements in an attempt to identify a significant probe-tem- had topological errors that led to a large overall plate match is used. Even if successful, for non- root-mean-square deviation from the experi- trivial cases, query sequence alignments could mental structures. In other cases, we obtained an be in error. Additionally, there could be gaps in accurate fold corresponding to the native struc- the alignment as well as long unaligned regions. ture’s mirror-image topology. If both methods fail, then ab initio folding is the Sometimes we obtained accurate models in requisite structure prediction method. Ideally, spite of the fact that our threading procedure did we want a unified approach that automatically not recognize remotely similar folds present in treats these possibilities. Let’s look at our re- the protein data bank.52 For example, as shown cently developed unified approach and then con- in Figure 1a for target T0102 (a cyclic 70 amino centrate on the ab initio component. acid protein), our procedure produced a good First, we run our threading algorithm, model with a coordinate root-mean-square de- Prospector,47 and establish if there is a signifi- viation of 3.6 Å from native for the first model cant query sequence-template structure match. (of a maximum of five allowed) submitted. Other

SEPTEMBER/OCTOBER 2001 27 (a) (b)

Figure 1. Comparison of the (a) predicted and (b) experimental structures for the CASP4 targets A. T0102 with an RMSD of 3.6 Å from native, and B. T0110, where the predicted structure has an RMSD of 4.2 Å from native.

groups produced models of comparable quality tor successfully predicted 50 proteins.51 If we in the range of 4.0 to 4.3 Å from native. count the best structure, 58 proteins (89.2 per- For T0110, shown in Figure 1b, (a 95-residue cent) had a structure equal to or below 6.5Å. Un- a/ß protein of a complicated fold), our ab initio fortunately, the lowest energy structures of only prediction produced the most accurate model, 36 proteins satisfy this criterion, which demon- with a root-mean-square deviation of 4.2 Å from strates the imperfections in our potentials as well native, which was significantly better than those as in the practicality of selecting structures by based on fold recognition or alternative ab ini- clustering. Often there are pairs of topological tio techniques. mirror-image structures among the obtained cluster centroids. When one of the centroids has Application to a large benchmark the proper fold, we also obtain (in most cases) the Subsequent to CASP4, we tuned the SICHO topological mirror-image structure. model to improve its performance and improved the contact prediction protocol by using addi- Feasibility of structural refinement tional protein-specific pair potentials. We also Our reduced protein model used to assemble improved the sequence profile method, which topologies has limited resolution. Typically, well- defines the score as the difference between the folded structures have a root-mean-square devi- sequence in the structure and the reversed se- ation of 2 to 6.5 Å from native. Can we improve quence in the structure. The latter modification such models using a more detailed protein rep- also makes Prospector more sensitive. We se- resentation and a more exact force field? It ap- lected sequences of 65 representative small sin- pears that sometimes we can refine the models gle-domain globular proteins as a test set for ab to a resolution close to that of experimental initio folding.53 The set contained proteins—α/β, structures. In previous work with similar low- α+β, and β-type folds—and 40 proteins ran- resolution lattice models, researchers success- domly chosen from another work.54 For 47 out fully refined several structures of leucine zippers of 65 proteins (72.3 percent), at least one cluster to experimental resolution with a root-mean- centroid in the top five had a root-mean-square square deviation of 0.6 Å from native.28 We sub- deviation below 6.5Å from native. When we used sequently extended these using ESMC to pro- an atomic potential to select structures, Prospec- vide a treatment of the GCN4 leucine zipper

28 COMPUTING IN SCIENCE & ENGINEERING folding thermodynamics as well as the predic- known contacts, the better the accuracy of the tion of the native state.55 Furthermore, the predicted structures. We could extract such frag- CHARMM force field, when supplemented by mentary structural data from NMR experiments. a generalized born–surface area treatment, is Structural restraints could also originate from highly correlated with the lattice-based force electron microscopy, fluorescence data, or cross- field. These studies are extremely encouraging, linking experiments. Sometimes mutation ex- but it is unclear how soon low-resolution to periments can identify residues that are involved moderate- or high-resolution structure refine- with ligand binding or that are in contact. We ment will become routine. could easily incorporate information about the spatial arrangement of these residues into the folding algorithm. Although techniques for the prediction of low- resolution structures have significantly im- lthough the methodology for protein proved, they still have a way to go before struc- structure prediction is partially suc- ture prediction becomes routine. Nevertheless, cessful, it needs further improvement. this is a laudable goal as low-resolution struc- AProspector, which forms this ap- tures are of considerable utility both in the iden- proach’s core, also needs improvement. For ex- tification of biochemical function and in ligand ample, it currently uses a very simple sequence docking.58 Such efforts will have to be applied profile as a scoring function. Clearly, it needs to on a genomic scale if structure-based approaches exploit more powerful and more sensitive se- to function prediction are to play a role in the quence profiles.56 Prospector also generates post genomic era. A number of such efforts are high-scoring local sequence fragments that are underway and as structure prediction continues often quite accurate. This information should be to improve, the applications of protein structure incorporated into subsequent threading itera- prediction methods to entire genomes will be- tions and could serve as partial seed structures come more prevalent. in ab initio folding, akin to Rosetta.30 The scaling of various contributions to the in- teraction scheme is now to a large extent arbi- trary and adjusted essentially by trial and error. To achieve more accurate scaling, we plan to em- ploy an automated procedure targeted to gener- Acknowledgment ating as strong a correlation as possible between This research was supported in part by NIH grants Nos. root-mean-square deviation from native. Per- GM37408, GM48835, and RR-12255. We greatly haps we could achieve a significant improvement appreciate the contributions of Marcos Betancourt, Hui in the model by introducing approximate elec- Lu, Daisuke Kihara, Piotr Rotkiewicz, and Michal Boniecki trostatics into the interaction scheme. This to some of the research described in this review. should include more implicit treatment of the solvent other than as an intra main chain hydro- gen bonds. The goal here is to reduce the model’s reliance on predicted tertiary restraints, which almost always dictate folding method’s References success. 1. S.F. Altschul et al., “Basic Local Alignment Search Tool,” J. Mol- A variety of sparse but rapidly obtained exper- ecular Biology, vol. 215, 1990, pp. 403–410. imental data could increase the accuracy and ex- 2. S. Henikoff and J.G. Henikoff, “Protein Family Classification Based tend the range of applicability of our structure on Searching a Database of Blocks,” Genomics, vol. 19, 1994, prediction method. Our ab initio folding proce- pp. 97–107. dure employs predicted secondary structure and 3. J.S. Fetrow and J. Skolnick, “Method for Prediction of Protein Function from Sequence Using the Sequence-to-Structure-to- predicted contacts between side groups. As Function Paradigm with Application to Glutaredoxins/Thiore- demonstrated recently for an older version of the doxins and T1 Ribonucleases, J. Molecular Biology, vol. 281, 1998, SICHO model, knowledge of secondary struc- pp. 949–968. ture and as few as N/7-N/5 side chain contacts 4. J. Skolnick and J. Fetrow, “From Genes to Protein Structure and Function: Novel Applications of Computational Approaches in (where N is the number of residues in the pro- the Genomic Era,” Tibtech, vol. 18, 2000, pp. 34–39. tein) enable the structure assembly for proteins 5. J. Bonanno, “Structural Genomics,” Current Biology, vol. 9, no. 57 up to 240 residues. The larger the number of 23, 1999, pp. R871–872.

SEPTEMBER/OCTOBER 2001 29 6. R. Sanchez and A. Sali, “Evaluation of Comparative Protein Struc- ture Prediction (CASP): Round III,” Proteins, vol. 3, 1999, pp. 2–6. ture Modeling by MODELLER-3,” Proteins, vol. 1, 1997, pp. 30. K.T. Simons et al., “Ab Initio Protein Structure Prediction of CASP 50–58. III Targets Using ROSETTA,” Proteins, vol. 3, 1999, pp. 171–176. 7. D.T. Jones, “GenTHREADER: An Efficient and Reliable Protein Fold 31. D.T. Jones, “Successful Ab Initio Prediction of the Tertiary Struc- Recognition Method for Genomic Sequences,” J. Molecular Biol- ture of NK-Lysin Using Multiple Sequences and Recognized Su- ogy, vol. 287, no. 4, 1999, pp. 797–815. persecondary Structural Motifs,” Proteins, vol. 1, 1997, pp. 8. M.J. Sternberg et al., “Progress in Protein Structure Prediction: 185–191. Assessment of CASP3,” Current Opinions in Structural Biology, vol. 32. A.R. Ortiz et al., “Ab Initio Folding of Proteins Using Restraints 9, no. 3, 1999, pp. 368–373. Derived from Evolutionary Information,” Proteins, vol. 3, 1999, 9. D. Bashford and D.A. Case, “Generalized Born Models of Macro- pp. 177–185. molecular Solvation Effects,” Ann. Rev. Physical , vol. 33. A.R. Ortiz, A. Kolinski, and J. Skolnick, “Nativelike Topology As- 51, 2000, pp. 129–152. sembly of Small Proteins Using Predicted Restraints in Monte 10. M. Levitt and A. Warshel, “Computer Simulation of Protein Fold- Carlo Folding Simulations,” Proc. Nat’l Academy Science, 1998, ing,” Nature, vol. 253, 27 Feb. 1975, pp. 694–698. pp. 1020–1025. 11. C. Wilson and S. Doniach, “A Computer Model to Dynamically 34. D.J. Osguthorpe, “Improved Ab Initio Predictions with Simpli- Simulated Protein Folding: Studies with Crambin,” Proteins, vol. fied Flexible Geometry Model,” Proteins, vol. 3, 1999, pp. 6, 1989, pp. 193–209. 186–193. 12. S. Sun, “Reduced Representation Model of Protein Structure Pre- 35. R. Samudrala et al., “Ab Initio Proteins Structure Prediction Using diction: Statistical Potential and Genetic Algorithms,” Protein Sci- a Combined Hierarchical Approach,” Proteins, vol. 3, 1999, pp. ence, vol. 2, 1993, pp. 762–785. 194–198. 13. J.T. Pedersen and J. Moult, “Ab Initio Protein Folding Simulations 36. J. Lee et al., “Calculation of Protein Conformation by Global Op- with Genetic Algorithms: Simulations on the Complete Sequence timization of a Potential Energy Function,” Proteins, vol. 3, 1999, of Small Proteins,” Proteins, vol. 1, 1997, pp. 179–184. pp. 204–208. 14. N. Go and H. Taketomi, “Respective Roles of Short- and Long- 37. D.R. Ripoll, A. Liwo, and H.A. Scheraga, “New Developments of Range Interactions in Protein Folding,” Proc. Nat’l Academy of the Electrostatically Driven Monte Carlo Method: Test on the Science, 1978, pp. 559–563. Membrane-Bound Portion of Melittin,” Biopolymers, vol. 46, 15. W.R. Krigbaum and S.F. Lin, “Monte Carlo Simulation of Protein 1988, pp. 117–126. Folding Using a Lattice Model,” Macromolecules, vol. 15, 1982, 38. A. Rey and J. Skolnick, “Comparison of Lattice Monte Carlo Dy- pp. 1135–1145. namics and Brownian Dynamics Folding Pathways of α-Helical 16. A. Kolinski and J. Skolnick, “Monte Carlo Simulations on an Equi- Hairpins,” Chemical Physics, vol. 158, 1991, pp. 199–219. librium Globular Protein Folding Model,” Proc. Nat’l Academy of 39. U.H.E. Hansmann and Y. Okamoto, “Prediction of Peptide Con- Science, 1986, pp. 7267–7271. formation by Multicanonical Algorithm: New Approach to the 17. J. Skolnick and A. Kolinski, “Computer Simulations of Globular Multiple Minima Problem,” J. Computational Chemistry, vol. 14, Protein Folding and Tertiary Structure,” Ann. Rev. Physical Chem- 1993, pp. 1333–1338. istry, vol. 40, 1989, pp. 207–235. 40. A. Kolinski, W. Galazka, and J. Skolnick, “Monte Carlo Studies of 18. J. Skolnick and A. Kolinski, “Simulations of the Folding of a Glob- the Thermodynamics and Kinetics of Reduced Protein Models: ular Protein,” Science, vol. 250, 1990, pp. 1121–1125. Application to Small Helical, b and a/b Proteins,” J. Chemical Physics, vol. 108, 1998, pp. 2608–2617. 19. H.S. Chan and K.A. Dill, “Polymer Principles in Protein Structure and Stability,” Ann. Rev. and Biophysical Chemistry, vol. 41. C.B. Anfinsen, “Principles that Govern the Folding of Protein 20, 1991, pp. 447–490. Chains,” Science, vol. 181, 1973, pp. 223–230. 20. M.-H. Hao and H.A. Scheraga, “Monte Carlo Simulations of a 42. H.A. Scheraga and M.-H. Hao, “Entropy Sampling Monte Carlo First-Order Transition for Protein Folding,” J. Physical Chemistry, for Polypeptides and Proteins,” Advanced Chemical Physics, vol. vol. 98, 1994, pp. 4940–4948. 105, 1999, pp. 243–272. 21. L. Mirny and E. Shakhnovich, “Protein Folding Theory: From Lat- 43. T. Dandekar and P. Argos, “Identifying the Tertiary Fold of Small tice to All-Atom Models,” Ann. Rev. Biophysics and Biomolecular Proteins with Different Topologies form Sequence and Secondary Structures, vol. 30, 2001, pp. 361–396. Structure Using the Genetic Algorithm and Extended Criteria Specific for Strand Regions,” J. Molecular Biology, vol. 256, 1996, 22. V.G. Dashevskii, “Lattice Model of Three-Dimensional Structure pp. 645–660. of Globular Proteins,” Molekulyarnaya Biologiya, vol. 14, no. 1, 1980, pp. 105–117. 44. U.H.E. Hansmann and Y. Okamoto, “Numerical Comparison of Three Recently Proposed Algorithms in the Protein Folding Prob- α 23. D.G. Covell, “Folding Protein -Carbon Chains into Compact lem,” J. Computational Chemistry, vol. 18, 1997, pp. 920–933. Forms by Monte Carlo Methods,” Proteins, vol. 14, 1992, pp. 409–420. 45. R.H. Swendsen and J.S. Wang, “Replica Monte Carlo Simulation of Spin Glasses,” Physical Rev. Letters, vol. 57, no. 21, 1986, pp. 24. D.A. Hinds and M. Levitt, “A Lattice Model for Protein Structure 2607–2609. Prediction at Low Resolution,” Proc. Nat’l Academy of Science, 1992, pp. 2536–2540. 46. S.F. Altschul and E.V. Koonin, “Iterated Profile Searches with PSI- BLAST: A Tool for Discovery in Protein Databases,” Trends in Bio- 25. A. Kolinski and J. Skolnick, Lattice Models of Protein Folding, Dy- chemical Science, vol. 23, no. 11, 1998, pp. 444–447. namics and Thermodynamics, //au: publisher?//, 1996. 47. J. Skolnick and D. Kihara, “Defrosting the Frozen Approximation: 26. J. Skolnick et al., “A Method for Prediction of Protein Structure A New Approach to Threading, Proteins, vol. 42, 2001, pp. from Sequence,” Current Biology, vol. 3, 1993, pp. 414–423. 319–331. 27. A. Kolinski and J. Skolnick, “Monte Carlo Simulations of Protein 48. J. Skolnick et al., “Ab Initio Protein Structure Prediction via a Folding, II: Application to Protein A, ROP, and Crambin,” Pro- Combination of Threading, Lattice Folding, Clustering, and teins, vol. 18, 1994, pp. 353–366. Structure Refinement,” to appear in Proteins, 2001. 28. M. Vieth et al., “Prediction of the Folding Pathways and Struc- 49. A. Kolinski et al., “Generalized Comparative Modeling ture of the GCN4 Leucine Zipper,” J. Molecular Biology, 1994, (GENECOMP): A Combination of Sequence Comparison, pp. 361–367. Threading, and Lattice Modeling for Protein Structure Prediction 29. J. Moult et al., “Critical Assessment of Methods of Protein Struc- and Refinement,” to appear in Proteins, 2001.

30 COMPUTING IN SCIENCE & ENGINEERING 50. M.R. Betancourt and J. Skolnick, “Finding the Needle in a Haystack: Educing Native Folds from Ambiguous Ab Initio Protein Structure Predictions,” J. Computational Chemistry, vol. 22, 2001, pp. 339–353. 51. H. Lu and J. Skolnick, “A More Specific Distant Dependent Atomic Knowledge Based Potential for Protein Structure Predic- tion,” to appear in Proteins, 2001. 52. H.M. Berman, “The Past and Future of Structure Databases,” Cur- rent Opinions in Biotechnology, vol. 10, no. 1, 1999, pp. 76–80. 53. D. Kihara et al., “TOUCHSTONE: An Ab Initio Protein Structure Prediction Method that Uses Threading-Based Tertiary Re- straints,” to appear in Proc. Nat’l Academy of Science, 2001. 54. K.T. Simons, C. Strauss, and D. Baker, “Prospects for Ab Initio Protein Structural Genomics,” J. Molecular Biology, vol. 306, no. 5, 2001, pp. 1191–1199. 55. D. Mohanty, A. Kolinski, and J. Skolnick, “De Novo Simulations of the Folding Thermodynamics of the GCN4 Leucine Zipper,” Bio- physical J., vol. 77, no. 1, 1999, pp. 54–69. 56. L. Rychlewski et al., “Comparison of Sequence Profiles: Strate- gies for Structural Predictions Using Sequence Information,” Pro- tein Science, vol. 9, no. 2, 2000, pp. 232–241. 57. A. Kolinski, “Assembly of Protein Structure from Sparse Experi- mental Data: An Efficient Monte Carlo Model,” Proteins, vol. 32, 1998, pp. 475–494. 58. M. Wojciechowski and J. Skolnick, “Docking of Small Ligands to Low-Resolution and Theoretically Predicted Receptor Structures,” to appear in J. Computational Chemistry, 2001.

Jeffrey Skolnick is the Director of Computational and Structural Biology at the Donald Danforth Plant Sci- ence Center. His research interests are in , protein structure and function prediction, lat- tice-based approaches to protein tertiary structure pre- diction, the simulation of membranes and membrane peptides, and . He received his PhD in polymer from Yale University. Contact him at the Laboratory of Computational Ge- nomics, Danforth Plant Science Center, 893 N. War- son Rd., Creve Coeur, MO 63141; skolnick@danforth- center.org.

Andrzej Kolinski is a Member at the Donald Danforth Plant Science Center and Head of the Theory of Biopolymers Laboratory at the University of Warsaw, Poland. He has received the International Scholar’s Award of the Howard Hughes Medical Institute and is a three-time recipient of the Prize of Polish Ministry of Higher Education. He received a PhD in chemistry from the University of Warsaw. Contact him at the Labora- tory of Computational Genomics, Danforth Plant Sci- ence Center, 893 N. Warson Rd., Creve Coeur, MO 63141; [email protected].

SEPTEMBER/OCTOBER 2001 31