Protein Structure Prediction: Model Building and Quality Assessment

Protein Structure Prediction: Model Building and Quality Assessment BJORN¨ WALLNER Stockholm Bioinformatics Center Department of Biochemistry and Biophysics Stockholm University Sweden Stockholm 2005 Björn Wallner Stockholm University AlbaNova University Center Stockholm Bioinformatics Center S-106 91 Stockholm Sweden e-mail: [email protected] Tel: +46-8-55 37 85 66 Fax: +46-8-55 37 82 14 Akademisk avhandling som med tillst˚and av Stockholms Universitet i Stockholm framlägges till offentlig granskning föravläggande av filosofie doktorsexamen fredagen den 30 september 2005 kl 10.00 i Svedbergssalen, AlbaNova Universitetscentrum, Roslagstullsbacken 21, Stockholm. Faculty opponent: Michael Levitt, Stanford University School of Medicine. Doctoral Thesis c Björn Wallner, 2005 ISBN 91-7155-121-2 Printed by Universitetsservice US AB, Stockholm 2005 To my family Protein Structure Prediction: Model Building and Quality Assessment Björn Wallner, [email protected] Stockholm Bioinformatics Center Stockholm University, SE-106 91 Stockholm, Sweden Abstract Proteins play a crucial roll in all biological processes. The wide range of protein functions is made possible through the many different conformations that the protein chain can adopt. The structure of a protein is extremely important for its function, but to determine the structure of protein experimentally is both difficult and time consuming. In fact with the current methods it is not possible to study all the billions of proteins in the world by experiments. Hence, for the vast majority of proteins the only way to get structural information is through the use of a method that predicts the structure of a protein based on the amino acid sequence. This thesis focuses on improving the current protein structure prediction methods by combining different prediction approaches together with machine-learning techniques. This work has resulted in some of the best automatic servers in world – Pcons and Pmodeller. As a part of the improvement of our automatic servers, I have also developed one of the best methods for predicting the quality of a protein model – ProQ. In addition, I have also developed methods to predict the local quality of a protein, based on the structure – ProQres and based on evolutionary information – ProQprof. Finally, I have also per- formed the first large-scale benchmark of publicly available homology modeling programs. Contents 1 Publications 7 1.1 Papers included in this thesis . 7 1.2 Other publications . 7 2 Introduction 9 3 Protein Structure Prediction 10 3.1 De Novo .......................... 12 3.2 Comparative modeling . 13 3.2.1 Finding related structures . 14 3.2.2 Aligning related structures . 16 3.2.3 Model building . 17 3.2.4 Model evaluation . 19 3.3 Meta prediction . 20 4 Techniques and databases 21 4.1 SCOP – Structural Classification Of Proteins . 22 4.2 Neural networks . 22 4.2.1 Input data encoding . 23 4.2.2 Network architecture . 24 4.2.3 Neural network training . 27 4.2.4 Cross-validation . 29 4.2.5 Other issues when using neural networks . 30 4.2.6 The neural networks in this study . 30 4.3 Benchmarks . 31 4.3.1 SCOP based . 31 4.3.2 CASP . 32 5 4.3.3 CAFASP . 34 4.3.4 LiveBench . 35 5 Present investigation 35 5.1 Development of a method that identifies correct protein models (Paper I) . 36 5.2 Pcons, ProQ and Pmodeller at CASP5 / CAFASP3 (PaperII) ......................... 38 5.3 Detecting correct residues with ProQres and ProQprof (Paper III) . 39 5.4 Pcons5: the fifth generation of Pcons (Paper IV) . 41 5.5 Are all homology modeling programs equal? (Paper V) 42 6 Conclusions and reflections 44 7 Acknowledgments 45 6 1 Publications 1.1 Papers included in this thesis Paper I Björn Wallner and Arne Elofsson. Can correct protein models be iden- tified? Protein Science 12(5):1073-1086, 2003. Paper II Björn Wallner, Huisheng Fang and Arne Elofsson. Automatic consensus based fold recognition using Pcons, ProQ and Pmodeller. Proteins, 53 Suppl 6:534-541, 2003. Paper III Björn Wallner and Arne Elofsson. Identification of correct regions in protein models using structural, evolutionary and consensus information. Submitted, 2005. Paper IV Björn Wallner and Arne Elofsson. Pcons5: combining consensus, structural evaluation and fold recognition scores. Submitted, 2005. Paper V Björn Wallner and Arne Elofsson. All are not equal: A benchmark of different homology modeling programs. Protein Science, 14(5):1315-1327, 2005. Papers are reproduced with permission. 1.2 Other publications Björn Wallner, Huisheng Fang, Tomas Ohlson, Johannes Frey-Skött and Arne Elofsson. Using evolutionary information for the query and target improves fold recognition. Proteins, 54(2):342-350, 2004. Tomas Ohlson, BjörnWallner and Arne Elofsson. Profile-profile methods provide improved fold-recognition. A study of different profile- 7 profile alignment methods. Proteins, 57(1):188-197, 2004. Ann-Charlotte Berglund, BjörnWallner, Arne Elofsson and David A. Liberles. Tertiary Windowing to Detect Positive Diversifying Selec- tion. J.Mol.Evol., 60(4):499-504, 2005. Huisheng Fang, BjörnWallner, Jesper Lundström, Christer von Wow- ern and Arne Elofsson. Improved fold recognition by using the Pcons consensus approach. Chapter 16 in Protein Structure Prediction:Bioinformatic Approach, IUL, 2002. 8 2 Introduction This work is about the prediction of the structure of proteins from the amino acid sequence. Proteins play a crucial roll in all biological processes and constitute a central part of the cell machinery. They are the active components that catalyze various processes, and in that way they control and direct chemical pathways underlying all processes of the living cell. They can act as transporters, transporting oxygen in the blood or controlling the flow of ions across the cell membrane. They also provide mechanical support for the cell and the whole body. This wide range of functions is made possible through the many different conformations that proteins can adopt. Thus, the structure of a protein is extremely important for its function. Proteins are build from 20 different amino acids forming a chain of variable length. The different amino acids have different properties and when put together they interact forming a three-dimensional structure. The order of the amino acids also called the sequence uniquely defines this structure (Anfinsen, 1973). To obtain the sequence of a protein is today relatively easy, however to determine the structure of a protein experimentally using X-ray crystallography or NMR spectroscopy is both difficult and very time consuming. The high-throughput methods developed in the last few years have increased the rate of structures determined experimentally, still it is unreasonable to believe that the structure of more than a tiny fraction of all the billions of proteins in the world will be studied by experimental methods in the foreseeable future. Hence, for the vast majority of proteins the only way to get structural information is through the use of methods that predict the structure of a protein based on the amino acid sequence. The structure of a protein depends solely on the amino acid sequence. This is property is extremely important when predicting the structure based on the sequence, because similar sequences will have similar structure. Also similar structures most often also have similar function, which means that by comparing sequences of proteins it is possible to say that two proteins not only have the same structure but also the same function. This forms the basis for homology based structure and function prediction, which today is routinely used to infer function from sequences with known function to sequences with unknown 9 function. However, how to compare sequences, is by no way trivial and many different methods have been developed throughout the years. This thesis focuses on improving the current protein structure prediction methods by combining different prediction approaches together with machine-learning techniques. In particular we have developed methods to predict the quality of a predicted structure using neural networks. These predictions have successfully been combined with consensus analysis of predicted structures from existing methods. In the following, the protein structure prediction problem will be de- scribed and different approaches how to solve the problem will be ad- dressed (section Protein structure prediction). Also some techniques and databases relevant for the development (section Techniques and databases) and benchmarks (section Benchmarks) of our prediction methods will be introduced. Finally, my contribution to the protein structure prediction field will be summarized (section Present investigation) and also presented in full (Papers I-V). 3 Protein Structure Prediction The three-dimensional (3D) structure of a protein is guided by two distinct sets of principles operating on vastly different time scales: the laws of physics and the theory of evolution (Figure 1). From a physical point of view the protein molecule is a consequence of a va- riety of forces, such as entropy, chemical bonds, Coulomb and van der Waals interactions. Under native conditions these forces fold a protein into a stable well-defined 3D structure in less then a second. From an evolutionary point of view the protein molecule is a result of evolution over millions of years. Proteins evolve through duplication, speciation, horizontal transfer and mutations. During the course of evolution a protein changes gradually because its function is usually retained, which requires the conservation of its structure and therefore also its sequence (Chothia & Lesk, 1986). These two viewpoints give rise to two different approaches to the protein structure prediction problem (Baker & Sali, 2001): de novo and comparative modeling. The first approach predicts the structure from 10 EQLTKCEVFQKLKDLKD.... folding baboon mouse human cow Physics Evolution De Novo prediction Comparative Modeling Figure 1: Proteins obey two sets of principles, the laws of physics and the theory of evolution.

Protein Structure Prediction: Model Building and Quality Assessment

The Second Critical Assessment of Fully Automated Structure

CASP)-Round V

Jakub Paś Application and Implementation of Probabilistic

April 11, 2003 1:34 WSPC/185-JBCB 00018 RAPTOR: OPTIMAL PROTEIN THREADING by LINEAR PROGRAMMING

Comparative Protein Structure Modeling Using Modeller 5.6.2

The 2000 Olympic Games of Protein Structure Prediction

Algorithms for Protein Comparative Modelling and Some Evolutionary Implications

Livebench-6: Large-Scale Automated Evaluation of Protein Structure Prediction Servers

State-Of-The-Art Bioinformatics Protein Structure Prediction Tools (Review)

Automated Prediction of CASP-5 Structures Using the ROBETTA Server

CAFASP3 in the Spotlight of EVA

Protein Structure Prediction in 2002 Jack Schonbrun, William J Wedemeyer and David Baker*