Ab Initio Construction of Protein Tertiary Structures Using a Hierarchical Approach

doi:10.1006/jmbi.2000.3835 available online at http://www.idealibrary.com on J. Mol. Biol. (2000) 300, 171±185 Ab Initio Construction of Protein Tertiary Structures Using a Hierarchical Approach Yu Xia1, Enoch S. Huang2, Michael Levitt1* and Ram Samudrala1* 1Department of Structural We present a hierarchical method to predict protein tertiary structure Biology, Stanford University models from sequence. We start with complete enumeration of confor- School of Medicine, Stanford mations using a simple tetrahedral lattice model. We then build confor- CA 94305, USA mations with increasing detail, and at each step select a subset of conformations using empirical energy functions with increasing complex- 2Cereon Genomics, 45 Sidney ity. After enumeration on lattice, we select a subset of low energy confor- Street, Cambridge mations using a statistical residue-residue contact energy function, and MA 02139, USA generate all-atom models using predicted secondary structure. A com- bined knowledge-based atomic level energy function is then used to select subsets of the all-atom models. The ®nal predictions are generated using a consensus distance geometry procedure. We test the feasibility of the procedure on a set of 12 small proteins covering a wide range of protein topologies. A rigorous double-blind test of our method was made under the auspices of the CASP3 experiment, where we did ab initio structure predictions for 12 proteins using this approach. The performance of our methodology at CASP3 is reasonably good and completely consistent with our initial tests. # 2000 Academic Press Keywords: protein structure prediction; lattice model; knowledge based; *Corresponding authors discriminatory function; decoy approach Introduction standing of the driving forces behind protein folding. Perturbations introduced by errors in the Ab initio protein structure prediction remains potential energy landscape may possibly result in a one of the most important unsolved problems in different folding pathway and a different folded molecular biophysics after 30 years of intensive structure. Even if we have an accurate enough research. This problem is in principle solvable: if potential energy function, we are still hampered by we know the exact formulation of the physical the huge search space (Levinthal, 1968). micro-environment within a cell where proteins Novel methods have recently been proposed for fold, we will be able to mimic the folding process ab initio protein structure prediction with impress- in nature by computing the molecular dynamics ive results (Simons et al., 1999; Ortiz et al., 1999; based on our knowledge of the physical laws Osguthorpe, 1999; Lomize et al., 1999; Lee et al., (McCarmmon & Harvey, 1987; van Gunsteren, 1999; Huang et al., 1999; Eyrich et al., 1999). Cur- 1998; Duan & Kollman, 1998). Complementarily, rent methods for structure prediction can be we can rely on the much-debated thermodynamic roughly grouped into two categories. The ®rst set hypothesis, i.e. that the native protein structure of methods include Monte Carlo and deterministic is thermodynamically stable and is located at the energy minimization (Hansmann & Okamoto, global free energy minimum (An®nsen, 1973). 1999; Scheraga, 1996; Levitt & Lifson, 1969) and However, we do not yet have a complete under- genetic algorithms (Pedersen & Moult, 1996), which generally start from either one or a small set of random starting points and attempt to drive the Abbreviations used: cRMS, Ca root mean square conformation to a low energy in an iterative man- deviation; dRMS, distance root mean square deviation; ner. The major advantage of these methods is that CASP, critical assessment of protein structure prediction methods; RAPDF, residue-speci®c all-atom conditional they more or less mimic the physical process of probability disriminatory function; SCOP, structural protein folding. Besides the folded structure, the classi®cation of proteins. pathway that leads to the folded structure may E-mail addresses of the corresponding authors: also be obtained. However, the success of these [email protected]; [email protected] methods depends crucially on the very high 0022-2836/00/010171±15 $35.00/0 # 2000 Academic Press 172 Hierarchical Construction of Protein Structures quality of energy function: it not only has to dis- puri®cation scheme, many non-native structures criminate the native structure from all the other are pruned out due to high energy even before all- possible structures along any possible simulation atom structures are built. Here, we describe our run, it also has to lead any random starting con- methodology in detail and provide a comprehen- ®guration toward the native structure. It is not sive analysis of its performance. clear whether current energy functions can satisfy the two requirements simultaneously (Moult, 1997). Results and Discussion The second set of methods use a sampling pro- Simplified lattice representation is able to cedure to produce trial structures, known as represent protein conformations well decoys (Chelvanayagam et al., 1998; Huang et al., 1999; Park & Levitt, 1996), that are subsequently Table 1 shows the parameters that we used in evaluated by an energy function. The structure the lattice prediction procedure for the 12 test pro- with the lowest energy is assumed to be the native teins. All the parameters are predetermined and structure. are only dependent on protein size. Our simple lat- We chose to work within the second paradigm tice model will only be useful if it can represent for the following reasons. First, the prediction pro- native protein features to a good approximation. cedure is separated into sampling and selection How well can the tetrahedral lattice model rep- processes. Each process is modular and can be resent native protein structures? To answer this developed and calibrated separately. Ineffective- question, we compute for each test protein the dis- ness of the whole procedure can be attributed to tance root mean square deviation (dRMS) of the one or both parts that can be corrected as needed. lattice structure with the highest number of native Second, we ignore pathway prediction and focus contacts (Table 2). This structure will be picked out our attention on structure prediction. Protein fold- if we have a perfect energy function, and is a ing is a process that involves hundreds of degrees measure of how well the lattice can represent pro- of freedom. Any single simulation can easily be tein structures. trapped in one of many local minima along the In general, larger proteins are represented less folding pathway, and the chances of overcoming a accurately. However, there is a tremendous degree local energy minimum decrease exponentially with of variation: the best dRMS ®t for 1aa2 with 108 the height of the free energy barrier. With the residues is less than 3 AÊ , whereas the best dRMS decoy approach, it is possible to explore millions of ®t for 1fgp with only 67 residues is slightly larger local energy minima of protein conformations in at 3.15 AÊ . In most cases, the best dRMS ®t ranges parallel, thereby sampling the protein confor- from 2.4 AÊ to 3.3 AÊ . This shows that our simpli®ed mational space effectively without the need to lattice model is able to represent the full variety of overcome high energy barriers. Third, the require- supersecondary structure topologies that occur in ments demanded of the energy function have been native proteins. The selection of these best struc- signi®cantly reduced: the only requirement is dis- tures depends entirely on the energy function. crimination between near-native and non-native structures. This allows for the use of powerful Simplified energy function is able to select statistical energy functions as discriminatory func- good subsets of lattice models tions, which may not perform as well as folding potentials. We use a simple statistical contact energy func- Structure sampling and evaluation have con¯ict- tion for both threading optimization and selection ing needs. We need simpli®ed models to reduce of low energy structures. Performance of different the dimensionality of the sampling space to make energy functions is characterized by how far the the computations tractable. At the same time, to dRMS distribution of the low energy population is make the best selections, we need structures with pushed away from that of all lattice structures full atomic detail to represent potential energy sur- towards native structure. Using energy criteria for face with enough accuracy to allow discrimination threading and selection pushes the structure popu- by energy functions. Unfortunately, generating and lation towards lower dRMS in all 12 test cases evaluating all-atom structures is a time-consuming (Table 2). In the case of protein 1dkt-A, energy process and cannot be done for huge numbers of selection improves the mean of the dRMS distri- conformations. bution by 2 AÊ . On average, the dRMS distribution We tackle this problem by sampling low resol- shifts 0.86 AÊ towards the lower end by applying ution structures exhaustively, and performing the energy criteria. A likely reason for the selection ®nal selection with a limited, yet promising, set of power of our energy function is that the function all-atom structures. Our approach starts with an we use is complementary to the geometry of the exhaustive enumeration of all possible folds using lattice scheme. Even though our simple lattice a highly simpli®ed tetrahedral lattice model. A set models ignore local geometrical information such of ®lters are then applied to these folds, primarily as side-chain orientation and secondary structure, in the form of discriminatory functions. As the ®l- they have well-formed interiors that can represent ters are applied, we add more detail to the models, the hydrophobic core of native structures. Since until one ®nal all-atom model remains. Using this our energy function captures the dominant hydro- Hierarchical Construction of Protein Structures 173 Table 1. Proteins and parameters used in lattice structure enumeration and selection Bounding box a b c Ê d e f Protein Size Class Walk length Edge size (A) vertex count Rg cutoff A.

Load more