Protein Structure Prediction: Model Building and Quality Assessment

BJORN¨ WALLNER

Stockholm Bioinformatics Center Department of Biochemistry and Biophysics Stockholm University Sweden

Stockholm 2005 Bj¨orn Wallner Stockholm University AlbaNova University Center Stockholm Bioinformatics Center S-106 91 Stockholm Sweden e-mail: [email protected] Tel: +46-8-55 37 85 66 Fax: +46-8-55 37 82 14

Akademisk avhandling som med tillst˚and av Stockholms Universitet i Stockholm framl¨agges till offentlig granskning f¨oravl¨aggande av filosofie doktorsexamen fredagen den 30 september 2005 kl 10.00 i Svedbergssalen, AlbaNova Universitetscentrum, Roslagstullsbacken 21, Stockholm. Faculty opponent: Michael Levitt, Stanford University School of Medicine. Doctoral Thesis c Bj¨orn Wallner, 2005 ISBN 91-7155-121-2 Printed by Universitetsservice US AB, Stockholm 2005 To my family Protein Structure Prediction: Model Building and Quality Assessment Bj¨orn Wallner, [email protected] Stockholm Bioinformatics Center Stockholm University, SE-106 91 Stockholm, Sweden

Abstract

Proteins play a crucial roll in all biological processes. The wide range of protein functions is made possible through the many different con- formations that the protein chain can adopt. The structure of a protein is extremely important for its function, but to determine the structure of protein experimentally is both difficult and time consuming. In fact with the current methods it is not possible to study all the billions of proteins in the world by experiments. Hence, for the vast majority of proteins the only way to get structural information is through the use of a method that predicts the structure of a protein based on the amino acid sequence. This thesis focuses on improving the current protein structure predic- tion methods by combining different prediction approaches together with machine-learning techniques. This work has resulted in some of the best automatic servers in world – Pcons and Pmodeller. As a part of the improvement of our automatic servers, I have also developed one of the best methods for predicting the quality of a protein model – ProQ. In addition, I have also developed methods to predict the local quality of a protein, based on the structure – ProQres and based on evolutionary information – ProQprof. Finally, I have also per- formed the first large-scale benchmark of publicly available programs. Contents

1 Publications 7 1.1 Papers included in this thesis ...... 7 1.2 Other publications ...... 7

2 Introduction 9

3 Protein Structure Prediction 10 3.1 De Novo ...... 12 3.2 Comparative modeling ...... 13 3.2.1 Finding related structures ...... 14 3.2.2 Aligning related structures ...... 16 3.2.3 Model building ...... 17 3.2.4 Model evaluation ...... 19 3.3 Meta prediction ...... 20

4 Techniques and databases 21 4.1 SCOP – Structural Classification Of Proteins ...... 22 4.2 Neural networks ...... 22 4.2.1 Input data encoding ...... 23 4.2.2 Network architecture ...... 24 4.2.3 Neural network training ...... 27 4.2.4 Cross-validation ...... 29 4.2.5 Other issues when using neural networks . . . . 30 4.2.6 The neural networks in this study ...... 30 4.3 Benchmarks ...... 31 4.3.1 SCOP based ...... 31 4.3.2 CASP ...... 32

5 4.3.3 CAFASP ...... 34 4.3.4 LiveBench ...... 35

5 Present investigation 35 5.1 Development of a method that identifies correct protein models (Paper I) ...... 36 5.2 Pcons, ProQ and Pmodeller at CASP5 / CAFASP3 (PaperII) ...... 38 5.3 Detecting correct residues with ProQres and ProQprof (Paper III) ...... 39 5.4 Pcons5: the fifth generation of Pcons (Paper IV) ...... 41 5.5 Are all homology modeling programs equal? (Paper V) 42

6 Conclusions and reflections 44

7 Acknowledgments 45

6 1 Publications

1.1 Papers included in this thesis

Paper I Bj¨orn Wallner and Arne Elofsson. Can correct protein models be iden- tified? Protein Science 12(5):1073-1086, 2003. Paper II Bj¨orn Wallner, Huisheng Fang and Arne Elofsson. Automatic consen- sus based fold recognition using Pcons, ProQ and Pmodeller. Proteins, 53 Suppl 6:534-541, 2003. Paper III Bj¨orn Wallner and Arne Elofsson. Identification of correct regions in protein models using structural, evolutionary and consensus informa- tion. Submitted, 2005. Paper IV Bj¨orn Wallner and Arne Elofsson. Pcons5: combining consensus, struc- tural evaluation and fold recognition scores. Submitted, 2005. Paper V Bj¨orn Wallner and Arne Elofsson. All are not equal: A benchmark of different homology modeling programs. Protein Science, 14(5):1315-1327, 2005. Papers are reproduced with permission.

1.2 Other publications

Bj¨orn Wallner, Huisheng Fang, Tomas Ohlson, Johannes Frey-Sk¨ott and Arne Elofsson. Using evolutionary information for the query and target improves fold recognition. Proteins, 54(2):342-350, 2004. Tomas Ohlson, Bj¨ornWallner and Arne Elofsson. Profile-profile meth- ods provide improved fold-recognition. A study of different profile-

7 profile alignment methods. Proteins, 57(1):188-197, 2004. Ann-Charlotte Berglund, Bj¨ornWallner, Arne Elofsson and David A. Liberles. Tertiary Windowing to Detect Positive Diversifying Selec- tion. J.Mol.Evol., 60(4):499-504, 2005. Huisheng Fang, Bj¨ornWallner, Jesper Lundstr¨om, Christer von Wow- ern and Arne Elofsson. Improved fold recognition by using the Pcons consensus approach. Chapter 16 in Protein Structure Prediction:Bioinformatic Approach, IUL, 2002.

8 2 Introduction

This work is about the prediction of the structure of proteins from the amino acid sequence. Proteins play a crucial roll in all biological processes and constitute a central part of the cell machinery. They are the active components that catalyze various processes, and in that way they control and direct chemical pathways underlying all processes of the living cell. They can act as transporters, transporting oxygen in the blood or controlling the flow of ions across the cell membrane. They also provide mechanical support for the cell and the whole body. This wide range of functions is made possible through the many different conformations that proteins can adopt. Thus, the structure of a protein is extremely important for its function. Proteins are build from 20 different amino acids forming a chain of variable length. The different amino acids have different properties and when put together they interact forming a three-dimensional structure. The order of the amino acids also called the sequence uniquely defines this structure (Anfinsen, 1973). To obtain the sequence of a protein is today relatively easy, however to determine the structure of a protein experimentally using X-ray crystallography or NMR spectroscopy is both difficult and very time consuming. The high-throughput methods developed in the last few years have increased the rate of structures determined experimentally, still it is unreasonable to believe that the structure of more than a tiny fraction of all the billions of proteins in the world will be studied by experimental methods in the foreseeable future. Hence, for the vast majority of proteins the only way to get structural information is through the use of methods that predict the structure of a protein based on the amino acid sequence. The structure of a protein depends solely on the amino acid sequence. This is property is extremely important when predicting the structure based on the sequence, because similar sequences will have similar structure. Also similar structures most often also have similar func- tion, which means that by comparing sequences of proteins it is possi- ble to say that two proteins not only have the same structure but also the same function. This forms the basis for homology based structure and function prediction, which today is routinely used to infer func- tion from sequences with known function to sequences with unknown

9 function. However, how to compare sequences, is by no way trivial and many different methods have been developed throughout the years. This thesis focuses on improving the current protein structure predic- tion methods by combining different prediction approaches together with machine-learning techniques. In particular we have developed methods to predict the quality of a predicted structure using neural networks. These predictions have successfully been combined with consensus analysis of predicted structures from existing methods. In the following, the protein structure prediction problem will be de- scribed and different approaches how to solve the problem will be ad- dressed (section Protein structure prediction). Also some techniques and databases relevant for the development (section Techniques and databases) and benchmarks (section Benchmarks) of our prediction methods will be introduced. Finally, my contribution to the protein structure prediction field will be summarized (section Present investi- gation) and also presented in full (Papers I-V).

3 Protein Structure Prediction

The three-dimensional (3D) structure of a protein is guided by two distinct sets of principles operating on vastly different time scales: the laws of physics and the theory of evolution (Figure 1). From a physical point of view the protein molecule is a consequence of a va- riety of forces, such as entropy, chemical bonds, Coulomb and van der Waals interactions. Under native conditions these forces fold a protein into a stable well-defined 3D structure in less then a second. From an evolutionary point of view the protein molecule is a result of evolution over millions of years. Proteins evolve through duplication, speciation, horizontal transfer and mutations. During the course of evolution a protein changes gradually because its function is usually retained, which requires the conservation of its structure and therefore also its sequence (Chothia & Lesk, 1986). These two viewpoints give rise to two different approaches to the pro- tein structure prediction problem (Baker & Sali, 2001): de novo and comparative modeling. The first approach predicts the structure from

10 EQLTKCEVFQKLKDLKD....

folding baboon mouse human cow

Physics Evolution De Novo prediction Comparative Modeling

Figure 1: Proteins obey two sets of principles, the laws of physics and the theory of evolution. These two principles give rise to two different approaches to the protein structure prediction problem, de novo prediction and Comparative modeling. the sequence alone, while the second approach relies on finding a sim- ilar sequence with known structure. Both these approaches will be described in more detail below. Let me just first mention that any protein structure prediction method can be seen as an optimization of a protein structure model with respect to a certain objective func- tion (S´anchez & Sali, 1997). De novo methods attempt to find the most likely structure of a protein sequence given the physical forces between the atoms in the protein, while comparative modeling meth- ods attempt to find the most likely structure of a protein sequence given its relationship to known similar protein structures.

11 3.1 De Novo

De novo methods predict the structure from the sequence alone, with- out relying on similarity between the modeled sequence and any known structure. These methods assume that the native structure corre- sponds to the global free energy minimum of the protein and attempt to find this minimum by an exploration of many possible protein con- formations. Two key steps in de novo methods are the procedure for efficiently carrying out the conformational search and the free energy function used for evaluating possible conformations. This class of methods was earlier called ab initio, because they pre- dicted the protein structure from scratch without prior knowledge. However, because most of the so called ab initio methods in fact used prior knowledge, nowadays the name de novo is used to denote this category. De novo, referring to predicting structures with new folds. The most successful method in the de novo category is the Rosetta method (Simons et al., 1999; Bonneau et al., 2001) developed by David Baker and co-workers. It uses 3 and 9 residue fragments of known structures with local sequences similar to the sequence to be modeled. It assembles these fragments into complete tertiary structures using a Monte Carlo simulated annealing procedure. The scoring function used in the simulated annealing procedure consists of sequence-dependent terms representing hydrophobic burial and specific pair interactions such as electrostatics and disulfide bonding and sequence-independent terms representing hard sphere packing, α-helix and β-strand packing, and the collection of β-strands in β-sheets. The de novo methods have improved in the last few years, in CASP1 and CASP2 they showed little success, however in CASP3 some de novo methods outperformed fold recognition methods for certain pro- teins in the fold recognition category (Bonneau & Baker, 2001), (for a description of CASP see the Benchmark section) . At CASP4 the de novo prediction were more consistent and more accurate than at previous CASPs (Bonneau et al., 2001). However, at CASP5 there was no significant increase in performance compared to CASP4 (Ven- clovas et al., 2003). This is probably because the main improvement in CASP3 was the introduction of the fragment approach, which was then further optimized and led to an improvement also at CASP4. At

12 CASP5 there were no really new ideas in the de novo field and conse- quently also no performance increase. Thus, despite the initial progress many issues must be resolved if a consistent and reliable de novo pre- diction scheme is to be developed. For instance no method performs consistently across all classes of proteins, most methods perform worse on all-β proteins, and all methods fail completely on sequences longer than 150 residues.

3.2 Comparative modeling

The aim of all comparative modeling methods is to build a 3D model for a protein of unknown structure (the target) on the basis of sequence similarity to proteins of known structures (the templates). This is pos- sible because small changes in the protein sequence usually result in small changes in its 3D structure (Chothia & Lesk, 1986). Despite the progress in the de novo prediction field (see above), the comparative modeling approach is by far much more accurate than de novo predic- tions. The overall accuracy of comparative models spans a wide range from low resolution models with only a correct fold to more accurate models comparable to medium resolution structures determined by X- ray crystallography or NMR spectroscopy. Even low resolution models can be useful in biology as some aspects of function can sometimes be predicted only from the coarse structural features of a model. Today (Mar-2005), it is possible to find structures with similar se- quences for one third of all proteins (Ekman et al., 2005). Because the number of known protein sequences is approximately 2,000,000 (Bairoch et al., 2005) (05-Jul-2005), comparative modeling could be applied to more than 650,000 proteins. This number should be compared to ap- proximately 30,000 protein structures determined experimentally (Ber- man et al., 2000). The usefulness of comparative modeling is also steadily increasing because the number of different structural folds that proteins adopt is limited and since the number of experimentally de- termined new structures is increasing rapidly (Holm & Sander, 1996). This trend is also accentuated by the recently initiated structural ge- nomics project that aims at determining at least one structure for most protein families (Vitkup et al., 2001). It is conceivable that structural genomics will achieve its aim in 5-10 years, making comparative mod-

13 START TARGET SEQUENCE TEMPLATE STRUCTURE

TKCELSQN A F LY YGRI L Q G P F K .. D E . I N L D S I Y C I L T Q

N M F Identify related structures Q L F S G L H Y E T E C S T (templates) K G S T Y E F D D KQ TQ EN ...... AIV

Align target sequence to ALIGNMENT template structures TARGET ...SCDKLLDDELDDDIACAKKILAIKGID... TEMPLATE ...SCDKFLDDDITDDIMCAKKILDIKGID...

Build a model for the target sequence using information from template structures TARGET MODEL Evaluate the model

NO protein model X model 0.9 0.8 OK? 0.7 QUALITY 0.6 0.5

0.4

PROFILE QUALITY YES 0.3 0.2

0.1

0

-0.1 END 0 20 40 60 80 100 120 140 RESIDUE POSITION

Figure 2: The different steps in comparative modeling. eling applicable to most protein sequences. Most comparative modeling methods consists of four steps (Figure 2): finding known structures related to the target sequence to be modeled (i.e. templates); aligning the sequence to the templates; building a 3D model; assessing the model (Marti-Renom et al., 2000). All four steps can of course be repeated until a satisfactory model is obtained. In the following each of these steps will be described in more detail.

3.2.1 Finding related structures

The starting point in comparative modeling is the identification of protein structures related to the target sequence (homologs), and the selection of a suitable template structure. The knowledge that two or more proteins are homologous is, besides being the basis for pro- tein structure modeling, the most commonly used method to transfer

14 functional knowledge from one protein to another and it can also be used for evolutionary studies. Besides the importance to detect ho- mologous proteins, many bioinformatical methods and tools are based on these relationships. For instance, the reliable detection of related proteins might actually improve secondary structure prediction (Rost & Sander, 1993) and other prediction methods. This is also the rea- son why the identification of related sequences has been subjected to intensive research during last 10 years, resulting in a wide range of different methods and different approaches to the problem. Many of these methods are available as web servers on the Internet. An ex- tremely useful site is the Meta-server at http://bioinfo.pl/meta, where it is possible to submit a sequence to many different servers using a simple interface. The basic approach to find relationships between proteins is to compare their sequences. This can be performed in many different ways, rang- ing from simple pairwise comparisons to more elaborated comparisons using evolutionary information for either the target or the template or for both. The output from all of these approaches is the same: a list of which amino acids in the two sequences that are equivalent, this list is also called the alignment. It has been demonstrated that methods using multiple sequences, i.e. evolutionary information, are superior to methods that only use single sequences (Lindahl & Elofsson, 2000) and more recently that methods that use evolutionary information for both the target and template sequences are even more efficient (Wall- ner et al., 2004). Probably because the description of which part of the sequences that are conserved and variable throughout the evolution are better described. However, simpler methods might have computational advantages. The most commonly used program to find related structures is psi-blast (Altschul et al., 1997). This program uses evolutionary information for the target sequence by using the initial database hits to refine the search to find more homologs in an iterative way. For a given sequence the first database hits are used to build a multiple sequence alignment, a position-specific scoring matrix is constructed from the alignment, and this matrix is then used to search the database for new homologs. These new homologs can then be used to construct a new matrix and do the search again until no more homologs are detected. psi-blast

15 is quite fast making it possible to search a database with 2,000,000 sequences in just a few minutes. In most cases psi-blast detects a sufficient number of homologs, how- ever there are slower methods that are significantly better, especially for the really hard targets. These methods use evolutionary informa- tion for both the target and template sequences. The evolutionary information is encoded in a profile, which is constructed from a multi- ple sequence alignment (for instance using psi-blast). A profile is a set of vectors, where each vector contains the frequency of each type of amino acid in a particular position of the multiple sequence align- ment. In this way it is possible to detect which parts of a sequence that are conserved and variable. Methods using evolutionary information for both the target and template compare two profiles and are there- fore often called profile-profile methods (Rychlewski et al., 2000; von Ohsen¨ & Zimmer, 2001; Yona & Levitt, 2000; Sadreyev & Grishin, 2003; Ohlson et al., 2004). Profile-profile comparisons can be done in several different ways, including calculating the sum of pairs, the dot-product, or a correlation coefficient between the two vectors. Once a list of related structures is obtained, the templates that are ap- propriate for the given modeling problem should be selected. Usually, a higher overall sequence similarity between the target and template sequence yields a better template. However, there are also other fac- tors that should be taken into account in the final template selection, such as ligands, quaternary interactions, resolution and the R-factor of a crystal structure. Ultimately the criteria for template selection de- pend on the purpose of the comparative model. For instance, to model a protein-ligand, selecting a template with similar ligand is probably more important than the resolution of the template, while the modeling of an active site of an enzyme might require a high resolution template. In many cases, many different templates are used and the best models are selected using other criteria (see Model evaluation section).

3.2.2 Aligning related structures

Most methods for finding related structures also produce an alignment between the target sequence and the sequence for the template struc- ture. For closely related protein sequences with identities over 40%,

16 the alignment is most often close to optimal. As the sequence similar- ity decreases the alignment becomes more difficult and will contain an increasingly large number of gaps and alignment errors (Rost, 1999; Marti-Renom et al., 2000; Elofsson, 2002). It is extremely important that the alignment is correct since no modeling program can recover from an incorrect alignment. Since the search methods are optimized for detection of remote relationships and not for optimal alignments, the alignment from the database search is often not the optimal target- template alignment for comparative modeling. Therefore, a specialized method could be used to align the target sequence with the template structures (Elofsson, 2002). In difficult alignment cases evolutionary information together with structure information of alternative templates should be used. The structural information could help avoid putting gaps in secondary structure elements, in buried regions or between residues that are far in space. Secondary structure prediction for the target sequence could also be used to obtain a more accurate alignment (Elofsson, 2002). However, in many cases the evaluation of a model is more reliable than an evaluation of an alignment. Thus, a good strategy is to build 3D models for all alternative alignments, evaluate the corresponding models and select the best model according to the 3D model evaluation instead of the alignment score (Jones, 1999a) (Paper I).

3.2.3 Model building

Given a target-template alignment there exists a variety of methods that can build a 3D model for the target protein. In principle they can be grouped into three different groups: rigid-body assembly (Browne et al., 1969; Blundell et al., 1987; Greer, 1990; Koehl & Delarue, 1994a; Koehl & Delarue, 1995; Bates et al., 2001; Petrey et al., 2003; Schwede et al., 2004), segment matching (Jones & Thirup, 1986; Claessens et al., 1989; Unger et al., 1989; Levitt, 1992) and modeling by satisfaction of spatial restraints (Havel & Snow, 1991; Sali & Blundell, 1993; Srini- vasan et al., 1993; Aszodi & Taylor, 1996). The accuracy for the differ- ent modeling approaches is similar when used optimally (Marti-Renom et al., 2000). Other factors like template selection and alignment ac- curacy usually have a larger impact on model quality than the choice

17 of modeling method. Still there are differences between the models produced by the different methods (see Paper V). Modeling by rigid-body assembly. The first modeling programs were based on rigid-body assembly methods. This is still a widely used approach, e.g. in the following programs: swiss-model (Schwede et al., 2004), nest (Petrey et al., 2003), 3d-jigsaw (Bates et al., 2001) and Builder (Koehl & Delarue, 1994a; Koehl & Delarue, 1995). The model is assembled from a small number of rigid bodies obtained from the core of the aligned regions. The assembly starts with building a framework from the coordinates of the aligned Cα atoms. The assembly involves fitting the rigid bodies onto the framework and rebuilding the non-conserved parts, i.e. loops and side-chains. The main difference between the different rigid-body assembly programs lies in how side- chains and loops are built. nest uses a stepwise approach, changing one evolutionary event from the template at a time, while 3d-jigsaw and Builder use mean-field minimization methods (Koehl & Delarue, 1996). Modeling by segment matching. The basis for modeling by seg- ment matching is the fact that most hexamers (oligopeptides of six amino acid residues) occurring in protein structures can be clustered into 100 structural classes (Unger et al., 1989). This means that a 3D model can be constructed by using a subset of atomic positions, derived from the target-template alignment as a guide to find match- ing segments in a representative database of all known protein struc- tures (Jones & Thirup, 1986; Claessens et al., 1989; Levitt, 1992). This approach can construct both main-chain and side-chain atoms and can also model insertions and deletions. This idea is implemented in the program SegMod (Levitt, 1992). Modeling by satisfaction of spatial restraints. The methods using modeling by spatial restraints derives restraints from the target- template alignment. This is usually done by assuming that the corre- sponding distances and angles between aligned residues in the template and the target structure are similar. These restraints are then sup- plemented by stereochemical restraints on bond lengths, bond angles, dihedral angels and non-bonded atom-atom contacts using a molecular mechanics force field. The model is then obtained by minimizing the violations of all restraints. One of the most frequently used modeling

18 programs, Modeller (Sali & Blundell, 1993), uses this approach. A further benefit of this approach, is that it is easy to add restraints derived from a number of different sources to the restraints derived from the alignment. For instance, predicted secondary structure or re- straints obtained from NMR experiments or fluorescence spectroscopy. In this way, the 3D models could be improved by making use of avail- able experimental data.

3.2.4 Model evaluation

The quality of the predicted models determines the information that can be extracted from it. Thus, it is important to estimate the accu- racy of a model before it is used. The model can be evaluated as a whole or in individual regions. There are many different kinds of er- rors that can occur in a protein model, including mistakes in side-chain packing, relatively small shifts and distortions in correctly aligned re- gions, errors in the unaligned regions (often loops), alignment errors and fold assignment mistakes. The first step in the evaluation process is usually to assess if the pro- tein has the correct fold. This will be true if the correct template is chosen and if the target-template alignment is approximately correct. The score from the database search is usually a good measure to de- cide whether the correct fold has been found or not, conserved key functional or structural residues in the target sequence could also be used. Once the fold has been established a more detailed analysis of the model quality can be obtained by the similarity between the target and template sequence. The sequence identity between the target and template sequence is a fairly good measure of model quality if the identity is above 30%. If the sequence identity is below 30% it becomes unreliable as a quality measure and in these cases model evaluation methods are especially useful. A basic requirement for a model is to have good stereochemistry. This is usually not a problem, not even for totally incorrect model, since most modeling programs use a force field to enforce proper stereo- chemistry.

19 Distribution of structural features from known protein structures can also be used. Large deviations from the most likely values are usually a strong indicator of errors in the model. Such features include pack- ing (Gregoret & Cohen, 1991), formation of a hydrophobic core (Bryant & Amzel, 1987), residue and atomic solvent accessibility (Koehl & De- larue, 1994b), spatial distribution of charged groups (Bryant & Lawrence, 1991), distribution of atom-atom contacts (Colovos & Yeates, 1993) and main-chain hydrogen bonding (Laskowski et al., 1993). There are also methods that combine the different criteria listed above. Pro- saII (Sippl, 1993) uses the probability to detect two residues at a spe- cific distance from each other. Verify3d (Bowie et al., 1991; L¨uthy et al., 1992) uses 3D profiles derived from the local environment around each residue. There has been some progress in the more physics-based approaches to model evaluation (Lazaridis & Karplus, 1999; Lazaridis & Karplus, 2000). It has recently been shown that using the generalized Born solvation model (Still et al., 1990) as a description of solvent effects, physical energy functions can be used to identify the native-like con- formation (Felts et al., 2002; Dominy & Brooks, 2002). However, for model evaluation purposes they have not reached the same accuracy as the methods based on statistics on known protein structures.

3.3 Meta prediction

The increasing variety of protein structure prediction methods has led to a series of different meta-predictors or consensus predictors. These methods does not make new predictions, instead they combine informa- tion from different methods in order to increase performance. The basic idea is that if several independent methods produce similar models, it is a strong indicator that the models are correct. The first consensus predictor was Pcons (Lundstr¨om et al., 2001), since then a number of other predictors has been developed, e.g. 3d-shotgun (Fischer, 2003) and 3d-jury (Ginalski et al., 2003). In a way the meta-predictors try to mimic the work of an expert, that looks at many different alternative models before making the final selection, see Figure 3. By using this approach around 10% more correct predictions are generated, but per- haps even more important is the increased specificity. Meta-prediction

20 protein expert

EASY HARD RANK1 RANK2 RANK3 RANK1 RANK2 RANK3

FFAS03 FFAS03

ORFEUS ORFEUS

PDB-BLAST PDB-BLAST

Figure 3: Meta-predictors mimic the work of an expert that looks at many alter- native models before making the final selection. Here, two examples are illustrated, one easy and one hard, in both cases the meta-predictor makes the correct choice. In the easy case it does not really matter which model that is selected since all of them are more or less correct. In the hard case, on the other hand, most of the selections will be incorrect. It is also here the meta-servers have proven to be most useful. is also an excellent way to evaluate different alternative models and in combination with structural evaluation the performance can be in- creased even further (Paper I and II).

4 Techniques and databases

In this section, the main techniques and resources that were used in the actual development of the predictors in Paper I-IV will be de- scribed. This involve a description of the protein classification database – SCOP, a description of neural networks as well as a description of different types of benchmarks that can be used to assess the results.

21 4.1 SCOP – Structural Classification Of Proteins

The aim of the SCOP database (Murzin et al., 1995) is to provide a detailed and comprehensive description of the structural and evolu- tionary relationships between all proteins in the Protein Data Bank (PDB). The quality of the database is very high since the classifica- tion is constructed manually by visual inspection and comparison of structures, but with the assistance of tools to make the task manage- able and help provide generality. On the negative side, it is not so frequently updated. Each protein is classified on four hierarchical levels, family, superfam- ily, fold and class. Proteins within a family have a clear common evolutionary relationship. They either have residue identities of 30% and greater, or lower sequence identity but very similar structure and function. That two families share the same superfamily means the proteins based on their structure and functional features most likely have a common evolutionary origin. Proteins with the same fold have the same major secondary structures in the same arrangement and with the same topological connections. Different folds are grouped into seven pre-defined classes like, all-α-helices, all-β-sheet or α and β. The latest release of the database (1.69 release, July 2005) contains 70 859 protein domains from 25 973 PDB entries grouped into 2 845 families, 1 539 superfamilies and 945 folds.

4.2 Neural networks

Here, the machine learning technique will be described that was used in the development of the predictors in Paper I-IV. Neural networks is a standard technique used in pattern recognition and classification that has been used for many years in a wide variety of applications, ranging from investment analysis to engine management and medical phenomena. Neural networks are inspired by how the brain is built up by interconnected neurons, even though the number of neurons and connections are much fewer than in the case of the brain. The most common type of neural network is the feed-forward network with sig-

22 modal nodes and error back-propagation for training. The feed-forward network provides a general framework for representing non-linear func- tional mappings from a set of input variables to a set of output vari- ables. In biology, neural networks have been applied successfully to the prediction of secondary structure (Qian & Sejnowski, 1988; Rost et al., 1994; Jones, 1999b), subcellular localization (Emanuelsson et al., 1999), the recognition of post translational modifications (Blom et al., 1999) and many more. Below the feed-forward networks will be in- troduced in more detail, however for a more thorough introduction see (Bishop, 1995).

4.2.1 Input data encoding

A very important issue when using neural networks is how to encode the input data. The input data should capture general features and should also be normalized allowing for comparison between different examples. In our case we have performed prediction on protein struc- ture models. The models consist in principle of a list of 3D coordi- nates for all atoms in the protein structure. There are numerous ways to translate a list of coordinates for atoms (with different properties) to numbers that the neural networks can use. On top of this, since our goal is to decide the quality of a protein model, there will always be a question of what to consider as correct and what to consider as incorrect. A number of protein quality measures have been developed throughout the years, for a review see (Cristobal et al., 2001). All of them compare the model to the correct structure and if the model and the correct structure are similar the quality should be high. This is also true for most measures, when the model and the correct structure are similar enough. However, most disagreements occur when the model and the correct structure are not so similar, but still contain simi- lar “features”. However, large-scale benchmarks like LiveBench (Bu- jnicki et al., 2001a; Bujnicki et al., 2001b; Rychlewski et al., 2003) and CASP (Moult et al., 2001) (see Benchmark section) have pushed the development of measures to a set that most researches in the field can agree on. The focus of this thesis has not been the development protein quality measures. Instead we have used established measures such MaxSub (Siew et al., 2000) and LGscore (Cristobal et al., 2001)

23 surface accessibility exposure <25% >75% LYS 0.85 0.07 correspondence ARG 0.07 0.70 with predicted TYR 0.65 0.15 secondary structure TYR PSI-PRED residue-residue P(helix)=0.73 LYS P(sheet)=0.20 contacts P(coil)=0.07 TRP TYR ARG LYS TYR ARG LYS 0.10 0.07 0.03 ARG 0.07 0.05 0.01 TYR 0.03 0.01 0.02

C N O C 0.06 0.08 0.05 atom-atom N 0.08 0.05 0.10 contacts O 0.05 0.10 0.02

Figure 4: Input parameters. The protein model is described by structural features that can be calculated based on the positions of the atoms. These features include atom-atom contacts, residue-residue contacts, surface accessibility and focused on the encoding of the input parameters. As stated above the encoding of the input parameters can be done many different ways. We have focused on simple features such as con- tacts between different types of atoms or residues and where different types of amino acids are situated in the protein, on the surface or in the interior (Figure 4). For more details see Paper I and III.

4.2.2 Network architecture

All neural networks consist of nodes (neurons) that are connected to each other through synapses with different weights (Figure 5). The artificial neuron works in similar way as a neuron in the brain, it cal- culates a weighted sum from the inputs connected to it and outputs a value, zj, which depends on the sum. The different weights on the

24 Node j y1

w1j y2 w2j

wkj aj yk Σ g(aj) zj

wmj

ym

Figure 5: A neuron. synapses, wij, correspond to the fact that also biological synapses have different impact on the post-synaptic neuron (excitatory and inhibitory synapses). The weighted sum is processed through an activation func- tion, g(v), before output. This function can be of many different types e.g. linear, threshold or sigmodal. The logistic function belonging to the sigmodal type is perhaps the most common, it is defined as:

1 g(aj) = zj = (1) 1 + exp (−aj) where aj is the weighted sum X aj = wijyi (2) i

Note that the activation function lies in the range (0,1) and is both continuous and differentiable. All these properties are useful in the derivation of the training algorithm. For the prediction of real values (in contrast to classification) it is quite common to use a linear acti- vation function at the output node, not to limit the range of possible outputs. The neurons in a feed-forward network are organized in layers, there are three types of layers: input, hidden and output (Figure 6). The

25 Input layer Hidden layer Ouput layer

Input data Output

Synapses

Figure 6: A feed-forward neural network. The input data is presented to the input nodes in the input layer, processed through the synapses and the hidden nodes in the hidden layer to the output node in the output layer. network input is presented to the input neuron layer, processed through the synapses and the activation function at the hidden layer before it reaches the output neuron(s). There are no feedback loops or connec- tions between neurons at the same level, there is only one direction of the data flow and this is also the reason why the network is called feed-forward. Further, this also ensures that the network outputs can be calculated as explicit functions of the inputs and the weights. Un- fortunately, it is not possible to decide which network architecture that is optimal for a given problem. Thus, to achieve optimal performance many different networks with different architectures need to be trained and optimized.

26 4.2.3 Neural network training

The training of a neural network involves an iterative procedure of minimization of an error function, with adjustments of the weights being made in a sequence of steps. The first of these steps is to initialize all weights to random values. Then, the input data of the training examples are presented at the input nodes and feed forward in the network to the output node, where the result is compared to the desired output using an error function. Based on the difference between the actual and desired output the weights are adjusted to capture the main features in the data. The algorithm for adjusting the weights is called the error back-propagation algorithm (Rumelhart et al., 1986), the name comes from the fact that the errors are propagated backwards in the network from the output enabling adjustment of the weights also at the input nodes.

The error signal, ej, at node j is

ej = zj − dj (3) where dj is the desired and zj the actual output at the node. There exists a wide range of error functions, E, such as sum-of-squares, city blocks and cross-entropy error function. The simplest and most com- monly used is the sum-of-squares defined as

1 X E = e2 (4) 2 j j where j runs over all output nodes. The error function is a function of all weights in the network, the idea is to impose a change ∆wij to the weights that is proportional and opposite to the gradient of the error function: ∂E ∆wij = −η (5) ∂wij where η is the learning rate, monitoring how fast the weight adjustment should be. Using the chain rule

∂E ∂E ∂aj ∂aj = = δj (6) ∂wij ∂aj ∂wij ∂wij

27 where δj denotes the local gradient associated with neuron j, from equation 2 we get

∂aj = yi {in eq : 5} =⇒ ∆wij = −ηδjyi (7) ∂wij

This function states that in order to get the weight adjustment for wij (weight connecting node i with node j) δj should be multiplied by yi (the value at the input end of the weight).

For the output nodes the evaluation of δj is straightforward

output 0 ∂E 0 δj = g (aj) = g (aj)ej (8) ∂zj and if j is a neuron in the hidden layer

hidden 0 X δj = g (aj) δkwkj (9) k where k runs over all output nodes. A presentation of all the input patterns once is called a training cycle. The weight adjustment is either done after presentation of each pattern (on-line learning) or after first summing the derivatives over all the patterns in the training (batch learning). The weights are updated in order to give minimal errors when the input training data is presented to the network. However, in order for the network to work on real problems all different types of examples must be present in the training set, e.g. in the case of structure prediction both good and bad models needs to be present in the training data, not only good models or only native structures. The ultimate aim of the neural network training is to capture general features in the data that can be used to do predictions on unseen data, i.e. networks should be able to generalize. To achieve this goal it is important to not over-train the network. The over-training occurs when the network is trained to such extent that it performs very well on the training data, but rather poorly on a new test set not part of the training data, see Figure 7. To avoid over-training the training must be stopped at an appropriate

28 20000

15000

over-training 10000 error starts here test set error

5000 training set error

0 0 100 200 300 400 500 600 700 800 900 1000 training cycle

Figure 7: Over-training. The error on the training set continues to decrease over all 1,000 cycles, while the error on the test set starts to increase from 250 cycles. This means that the generalization ability of the network deteriorates. training cycle before it occurs. This is usually done by monitoring the error on the test set during training and stop the training before the error on the test starts to increase. If possible a separate test set should then be used to assess the true generalization ability of the network.

4.2.4 Cross-validation

To check the performance and generalization ability of the network, cross-validation is usually performed. In cross-validation the total amount of training data is split into n parts (typically n ∈ [5, 10]), the parts should not be too similar, thus two similar examples should be in the same part. The training is then performed on n − 1 parts and the last part is used as a test set to check the performance. This is repeated n times using a different part as test set each time to mini-

29 mize the randomness. The cross-validation process allows efficient use of all training examples, while it at the same time gives some confi- dence that the neural network is able to generalize, provided that the performances of the different cross-validation sets are similar.

4.2.5 Other issues when using neural networks

Neural networks is a powerful technique to provide a non-linear func- tional mapping from a set of input variables to a set of output vari- ables, which is illustrated by the large number of successful applica- tions. However, there are in particular two different kinds of problems that have to be dealt with when using neural networks. First, a neural network consists of a large number of parameters (sometimes on the order of several thousands) and in order not to over-fit the parameters on the training data, the number of training examples should be of the same order of magnitude as the number of parameters. Thus, sufficient amount of data is needed. Second, even tough the network architec- ture and parameter updating are well defined in a mathematical way, there is no obvious way of extracting knowledge about what features that are important for the neural network prediction. This makes it hard to understand which features that are important and it also lim- its the biological usefulness of neural networks in the sense of finding exactly which structural features that are important for good models. However, it still useful for finding which features that are important in a more general sense and also for finding if certain features are orthog- onal to each other, i.e. if the prediction ability increases when different types of features are used in training.

4.2.6 The neural networks in this study

In this study we have used Matlab and the neural network package netlab (Bishop, 1995) to perform all network training. In all cases feed-forward neural networks were used together with five-fold cross- validation and the error back-propagation algorithm for training, lo- gistic hidden nodes and linear output node. Only one layer of hidden nodes was used. The most important part of neural network train- ing is the input data. The input data should capture general features

30 and should also allow for comparisons different examples. We have used features that can be calculated from a protein structure mod- els, i.e. based on the 3D coordinates of all the different atoms in the protein (Figure 4). We have also performed dimensionality reduction by classifying the 20 amino acids into six different groups (Mirny & Shakhnovich, 1999).

4.3 Benchmarks

Once a predictor has been developed it is important to benchmark its performance against other methods. It is extremely important that the data used to benchmark is not used for optimizing the parameters in the predictor. In this section benchmark techniques will be de- scribed. This involves a description of benchmarks based on the struc- tural database SCOP as well as a description of CASP and CAFASP, and finally the continuous benchmark done through the LiveBench project.

4.3.1 SCOP based

Benchmarks based on SCOP is used frequently for the evaluation of the recognition of correct structures based on sequence similarity. As described above SCOP is a hierarchical database, with three main levels fold, superfamily and family. Two structures related on the family level are more closely related (more similar) than two structures related on the fold level. This also means that two sequences related on the family level are more easily recognized than two sequences related on the fold level. The first step in this type of benchmark is to do an all-against-all SCOP search with the method to benchmark. For each sequence all hits are collected and the result is presented typically in specificity-sensitivity plots, where the specificity and sensitivity are plotted against each other for all possible cutoffs. The specificity is defined as: TP spec = (10) TP + FP

31 where TP is the number of True Positives and FP the number of False Positives, i.e. the specificity is the fraction of positive hits defined by the method that also are correct. The sensitivity is defined as: TP sens = (11) TP + FN where TP is the number of True Positives and FN the number of False Negatives, i.e. the sensitivity is the fraction of all possible positive hits that are found. Since the specificity and sensitivity are two competing properties, i.e. it is always possible to find all TP (100% sensitivity) but the specificity might be low (high number of FP ). An example of a specificity-sensitivity plot is seen in Figure 8. The closer the method is to the upper right hand corner (1,1) the better. The figure contains three sets of curves, one for each level of the SCOP database. For the family curves all hits are used, for the superfamily curves hits to the same family are ignored and for the fold curves hits to the same superfamily and family are ignored. In this way the detection ability for different difficult levels can be assessed. As expected, it is also clearly seen, that detecting sequences related on the fold level is much more difficult than detecting sequence related on the family level.

4.3.2 CASP

CASP (Critical Assessment of Structure Prediction) is a large-scale community experiment, conducted every two years since 1994 (Moult et al., 2003). The key feature is that the participants make blind predictions of structures. Since the predictions are blind no one can cheat (hopefully) and also methods cannot be optimized to do good in CASP for the same reason. The participants get the sequence for protein structures that are about to be determined by X-ray crystallography and NMR spectroscopy. The experiment is conducted over the summer and the result is col- lected and the most successful groups present their work at a confer- ence in December the same year. Since the structures are not known before the predictions are made the predictions are totally blind. The experiment has been a success in the predictor community, each time

32 Figure 8: Example of a specificity-sensitivity plot from (Wallner et al., 2004). attracting more and more participants, generating more and more pre- dictions (Table 1). In contrast the target collection relying on the generosity of experimentalists has been fairly constant. Even though the number of participating groups has increased over the years the largest increase is in the number of submitted models. This is mostly due to the development of automatic methods that routinely submits predictions for every target (see CAFASP). Predictors fall into two categories: teams of participants who devote considerable time and effort to modeling each target, usually having a period of several weeks to complete their work; and automatic servers, which must return a model within 48 hours, without human inter- vention. The predictions are evaluated using numerical criteria and manual independent assessors. There has been progress in the protein structure prediction field over the CASP years, for a recent review see (Moult, 2005). A couple of

33 CASP number Year No. targets No. predictors Total no. predictions CASP1 1994 33 35 135 CASP2 1996 42 72 947 CASP3 1998 43 98 3,807 CASP4 2000 43 163 11,136 CASP5 2002 67 215 28,728 CASP6 2004 64 208 41,283

Table 1: The CASP experiments in numbers. Note the rapid increase in the total number of predictions. highlights include the successful use of PSI-BLAST (Altschul et al., 1997) in CASP3, the performance of Baker in the ab initio prediction in CASP4 (Bonneau et al., 2001) and the meta-servers (Paper II) in CASP5. However one major bottleneck is still the refinement problem, i.e. improving the models over a simple backbone copy of the template. At CASP6, for the first time, there was a report of an initial model of a small protein (target 281) refined from a backbone RMSD of about 2.2 A˚ to about 1.6 A,˚ with many of the core side-chains correctly oriented. Still this is for a small protein and it remains to be seen if it can be scaled to larger structure in the future.

4.3.3 CAFASP

CAFASP (Critical Assessment of Fully Automated Structure Predic- tion) is the automatic counterpart of CASP (Fischer et al., 2003). The prediction in CAFASP follows the same scheme as CASP and is also run in parallel, with one important different - no human intervention is allowed. The goal is to evaluate the performance of fully automatic structure prediction servers. As the predictions, the assessment is also fully automated allowing fast evaluation of many servers. Another important difference from CASP is that all predictions have a score (from the server). This makes it possible to assess the reliability of the different methods.

34 4.3.4 LiveBench

The LiveBench project is a continuous benchmarking program (Bu- jnicki et al., 2001a; Bujnicki et al., 2001b; Rychlewski et al., 2003). The basic scheme is as follows, every week new PDB proteins are sub- mitted to participating proteins structure prediction servers and the results are collected and evaluated using automated model assessment programs. This provides a potential user with an excellent tool for evaluating the confidence of obtained models. And from the server perspective, the server gets an independent unbiased benchmark. The main advantages of LiveBench compared to CASP are the fast evaluation cycle and the larger number of targets. The assessment is conducted a week after releasing the targets. Each week provides around 5 new targets, which can be used to check differences in the performance of the servers. The LiveBench project has been run for almost six years with a steadily increasing number of participating servers with the exception of set 9 (Figure 9). However, it should also be stated that many of the “new” servers are new versions of an existing server with the previous version kept as reference.

5 Present investigation

The work presented in this thesis aims at the development of methods that improve the current protein structure prediction methods. In the following, the methods and the results from the five papers included in this thesis (Paper I-V) will be summarized. The presentation will follow closely the development of methods in relation to the bi-annual CASP meetings (see above). It has been clear since the first CASPs that human experts do bet- ter than automatic methods. However at CASP4 the best servers started to compete with the best humans. Only 11 human groups were ranked before the best server (3D-PSSM). Furthermore, at rank 7 was the semi-automated method named CAFASP-CONSENSUS that used the results for other servers to do the prediction (Fischer et al.,

35 Number of participating servers in LiveBench 50

40

30

No. Server 20

10

0 0 1 2 3 4 5 6 7 8 9 10 LiveBench set

Figure 9: The number of participating servers in the LiveBench project.

2001). The CAFASP-CONSENSUS idea was then fully automated in Pcons (Lundstr¨om et al., 2001), which was the first consensus or meta predictor. By using the result from six different servers, Pcons generated approximately 8%-10% more correct predictions with a sig- nificantly higher specificity. Thus, at that time Pcons was the best automatic method for doing protein structure prediction. And this is where our story begins.

5.1 Development of a method that identifies correct protein models (Paper I)

The main strength of the consensus analysis, as used in Pcons, is cou- pled to the structural similarity between the different models returned from the servers. However, there are of course other factors that can be used in the selection process as well, e.g. the score from the server

36 or an evaluation of the protein model using an energy function. Most energy function up to date seemed to be pretty good at picking the na- tive structure from a set of misfolded decoys. However, protein models are usually relatively far away from the native and even a quite good model can contain extended regions and unfavorable interactions that an energy function is too sensitive to find. The aim of this project was to develop an energy function, ProQ, that would be able to handle the difficulties involved in finding the best possible protein model. We decided to use a neural network ap- proach, as it would allow us to test many different types of input in a fairly straightforward manner. An important difference from earlier studies is, that we used a more relaxed and in our minds more realis- tic definition of native-like models. We defined “correct” models in a similar way as used in CASP (Moult et al., 2001; Sippl et al., 2001), CAFASP (Fischer et al., 2001) and LiveBench (Bujnicki et al., 2001a; Bujnicki et al., 2001b), i.e. by finding similar fragments between the native structure and a model. The training set consisted of all-atom models generated from a large set of different alignments methods, in total 11,108 protein models were used. From these models a set of structural features, such as atom-atom contacts, residue-residue con- tacts, surface accessibility and secondary structure information were calculated and neural networks were trained to predict the correctness of the model using these features. By combining all the structural features it was possible to increase the correlation coefficient between the predicted model quality and the actual correct model quality from 0.4 to 0.75. It was quite clear that the combination of different types of structural features was the main reason for the improvement. This can be explained by the fact that incorrect protein models might have many good, i.e. native-like features, whereas correct protein models might have many bad, i.e. native-unlike, features. The best way to distinguish between these models is to use a combination of measures. The performance of ProQ was compared to the performance of ex- isting methods for evaluating protein model quality. We found that ProQ was at least as good as other methods when identifying the native structure and better at the detection of correct models. This performance was also maintained over several different test sets and

37 quality measures. Finally we combined ProQ with Pcons, in a predictor called Pmod- eller. This predictor built all-atom models using Modeller (Sali & Blundell, 1993) for all Pcons alignments. ProQ was then applied to assess the quality of the models; and all models were given a combined score based both on the consensus analysis using Pcons and on the structural evaluation using ProQ. This combined Pmodeller score was shown to perform better than Pcons alone. Later studies (e.g. LiveBench and CASP) have confirmed this result, indicating the struc- tural evaluation using ProQ indeed is helpful in the selection of good models.

5.2 Pcons, ProQ and Pmodeller at CASP5 / CAFASP3 (Paper II)

The great exam was coming up - CASP season! According to our benchmarks Pmodeller should be doing pretty well in CASP. How- ever, being good in your own benchmark is one thing, being good in CASP is another story. CASP is the ultimate test. If you do well, no one can say that you have optimized your method to the test set. If you do bad that is exactly what you will hear. Another nice thing about CASP, is that it provides a unique opportunity to compare the performance of automatic servers (like Pmodeller and Pcons) with the performance of manual experts who might even use these methods. We participated in CASP with four servers: Pcons2, Pcons3, Pmod- eller and Pmodeller3, where Pmodeller uses Pcons2 and Pmod- eller3 uses Pcons3. The difference between the two Pcons version is the servers they are using. Pcons2 uses seven servers, while Pcons3 uses only three (but better) servers (Table I, Paper I). So the questions were: Would Pmodeller do better than Pcons, i.e does the quality assessment with ProQ really help? and how good are we in compari- son to other servers and other manual groups? It turned out that Pmodeller did extremely well and was one of the top ranked servers (Table IV, Paper II), it actually performed so well that we were selected to give a talk at the CASP5 meeting. The per- formance of Pmodeller was similar to all but a few manual experts.

38 Although a small group of experts still performed better, most of the experts participating in CASP5 actually performed worse even though they had full access to all automatic predictions and Pmodeller in particular. Pmodeller was consistently ranked higher than the corresponding Pcons version in CASP5 and in CAFASP3 (Fischer et al., 2003); and later also in LiveBench (Rychlewski et al., 2003). Further, it could also be shown that the homology modeling step in Pmodeller was not the reason for this improvement. Since no examples of significantly better models were observed when Pmodeller and Pcons used the same alignment. Thus, the main reason for the difference in performance between Pmodeller and Pcons was the ProQ structural evalua- tion of the quality of the protein models. This procedure increased both the total number of correct predictions and the specificity of the predictions, by reducing the number of false positives. The main lessons from CASP5 were that by using a consensus server, like Pmodeller, it was possible to compete with most manual ex- perts. However, some manual experts did significantly better predic- tions than the best consensus servers. One noticeable difference be- tween Pmodeller and these best manual groups was that the total number of residues predicted was about 5% less for Pmodeller, i.e. the models from Pmodeller were shorter. A possible reasons for this is that Pmodeller was restricted to one alignment per model. During the development of Pmodeller we tried using multiple tem- plates alignments, to increase the target coverage, but without success. Another related problem is the division of the target sequence into do- mains, where most servers will only return the hit to the best domain, resulting in a model with only one domain. An iterative approach where regions with no hits are resubmitted to the servers has proven to be useful (Chivian et al., 2003).

5.3 Detecting correct residues with ProQres and ProQprof (Paper III)

After CASP5 it was clear that the best groups had used meta-servers in combination with additional measures to assess the quality of the

39 final models. In particular many groups used Verify3d (Bowie et al., 1991; L¨uthy et al., 1992) to assess the quality of different parts the models (Kosinski et al., 2003; von Grotthuss et al., 2003). The con- sensus idea has also been applied and used to find the parts of a model that are most reliable (Fischer, 2003). Both these approaches were then used as a basis to build a hybrid model using multiple templates. Given the success of the Pmodeller server and in particular the ProQ module, we thought that it would be possible to use similar ideas to develop a local version of ProQ, ProQres. This version would, instead of predicting the quality of the whole model, predict the quality of single residues. Since the combination of different structural features made ProQ perform significantly better than Verify3d at detecting correct protein models. It should be possible to use different local structural features and perform better at the residue level as well. In addition we also developed a local quality predictor that used a window of profile scores to assess the quality of protein models built from an alignment, ProQprof. As for ProQ we used neural nets. These were trained on models built from structural alignments between proteins related on the superfamily level according to SCOP using five-fold cross-validation. The final predictors were then benchmarked using two additional test sets, one derived from LiveBench and one derived from SCOP. The structural features used in ProQres were identical to those used in ProQ, i.e. atom-atom contacts, residue-residue contacts, solvent accessibility surfaces and secondary structure information. However, in order get a local quality prediction the environment around each residue was described by calculating the different structural features for a sliding window and predict the correctness of the central residue. Hereby, each residue in the sequence is described by all the contacts made by all residues in the window. ProQprof used a window of profile scores calculated from the align- ment as the basis for training. This score were calculated using the the Log Aver (von Ohsen¨ & Zimmer, 2001) scoring function on equivalent positions (vectors) in two profiles corresponding to the two sequences in the alignment. Both ProQres and ProQprof were shown to be more accurate than

40 existing methods. ProQprof performs better than ProQres for models based on distant relationships, while ProQres performs best for models based on closer relationship. We also showed that a com- bination of ProQres and ProQprof performs better than any of them alone. A possible explanation for this difference in performance depending on model quality, is that poor models lack many native long-range interactions, which make a structural evaluation less reli- able, while a profile-profile comparison is able to find these positions. On the other hand models of higher quality have many more native long-range interactions, i.e. improving the structural evaluation.

5.4 Pcons5: the fifth generation of Pcons (Paper IV)

In this paper we describe the development of Pcons5, which is sig- nificantly different from the previous Pcons versions. Pcons5 is also the first available stand alone version of Pcons, and it should be ideal for those who would like to set up a local meta-server. Pcons5 was developed with the goal to be easy to up update with new servers. It consists of three modules, a consensus module, a quality assessment module and score module. Each module returns two scores and these score are combined in a final score using a weighted sum. Pcons5 is not restricted to a fixed number of specific servers, it is possible to run both the consensus and the quality assessment module on any set of models. It is only the score module that is method specific. The consensus module returns the average similarity to all models, and the average similarity to the first ranked models from each server. The quality module assess the quality of the models using a backbone version of ProQ. The score module returns two scores that measure the reliability of the server score, e.g BLAST E-value below 10−10 would be given a high score and a E-value above 10−1 a low score etc. In addition we also developed a Pmodeller5 version based on Pcons5 that built all-atom models using Modeller (Sali & Blundell, 1993) and re-rank the models using the original version of ProQ, which is significantly better than the backbone ProQ module used in Pcons5.

41 Pcons5 and Pmodeller5 participated in CASP6. For technical rea- sons we had to start the server manually, but they were run without human intervention. Thus, in the official CASP rankings we are listed as a manual group, while in fact it was an automatic server. This was also the reason why we were not selected to talk this time, since Pmodeller was indeed one of the best servers. It also performed bet- ter than Pcons5 and the previous versions of Pmodeller and Pcons (according to the sum of the GDT TS for the first ranked model).

5.5 Are all homology modeling programs equal? (Paper V)

In this study we presented the first large-scale benchmark of publicly available homology modeling programs, i.e. programs that construct the 3D coordinates of a protein model based on the alignment to a known protein structure. The following six programs were tested: Modeller, SegMod/encad, swiss-model, 3d-jigsaw, nest and Builder. The idea was simple: give each program the same input (alignment and template) and analyze the model it produced with respect to physiochemical criteria and structural similarity to the cor- rect structure. However, it turned out that making the comparison was not as simple and straightforward as we had expected. The reason was that some programs crashed for certain inputs and sometimes the programs did not return coordinates for all residues in the sequence. To not return any coordinates is probably worse than to return wrong coordinates, but not everyone will agree on this assumption. In any case, we chose to ignore these cases and only compared models, or rather residues that existed in all models from all programs. In the following, the most important results from this study will be presented. perspective. From our study it was clear that the different modeling approaches give rise to differences in the final model. It is true that the alignment is cru- cial for the quality of the final model and no model building procedure can recover from an incorrect alignment. The effect of an alignment error though, can be very dependent on the modeling procedure, e.g. a rigid-body assembly method will be much more sensitive to align-

42 ment errors than a method that derives distance restraints from the alignment (Figure 1, Paper V). It was also clear that no single mod- eling program performs best in all different test. All programs had its pros and cons (Table 4, Paper V), but three modeling programs, Modeller, nest and SegMod/encad performed better than the other programs. These programs were reliable and most often they produced chemically correct models. SegMod/encad perform very well in all tests except for backbone conformation, nest very rarely makes the models worse than the tem- plate, but the chemistry is not as good as for SegMod/encad, while Modeller is in general good, except for a few examples of poor con- vergence and suboptimal side-chain positioning. swiss-model failed to produce models for 10% of all alignments (all others failed < 1%). However, it performed quite well when it produced a model. 3d-jigsaw and Builder had problems with modeling all residues, resulting in a decreased overall performance. Also none of the programs build side-chains as well as a program spe- cialized for this task such as SCWRL, using the same information. It should therefore be room for improvement within this area.

43 6 Conclusions and reflections

By using a machine-learning approach we have shown that it is possible to predict the quality of protein models based on structural features (Paper I-III). We also showed that different structural features contain orthogonal information and that the combination of different structural features are important to distinguish between correct and incorrect protein models (Paper I). Further, we have improved the best meta- predictor, Pcons, by the including a quality assessment step using our quality predictor, ProQ, in the Pmodeller series. The improvement using ProQ has been confirmed in both in LiveBench and in CASP, where Pmodeller has been one of the top performing servers. Our latest predictors, ProQres and ProQprof, predicts the local model quality based on local structural features and on evolutionary information respectively. Using evolutionary information is clearly beneficial to detect good regions in protein models, while structural information is more useful to detect poor regions. These predictors might prove to be useful in the construction of hybrid models using more than one template. We have also conducted the first large-scale benchmark of publicly available homology modeling programs. Here we showed that there are differences between the different modeling programs. Three modeling programs, Modeller, nest and SegMod/encad performed better than the other programs. Clearly, one of the major bottlenecks in the protein structure predic- tion field today is refinement, e.g. taking a 2A˚ model down to 1A.˚ There have been some progress in this area, by using protein evolution to restricted the refinement to the parts of the model that are more likely to undergo a change (Qian et al., 2004). However, even the re- stricted sampling requires accurate energy functions to select the final conformation. Most likely these energy functions will be a combination of different features taken both from physics and evolution.

44 7 Acknowledgments

This is probably the most important part. I want to acknowledge all the people contributing in one way or the other in the production of this thesis. First of all, Arne Elofsson, my supervisor, for many encouraging dis- cussions, great ideas and for being the best supervisor you could ever have. My roommates at SBC Erik and Asa˚ for always arriving early and for providing a perfect excuse to discuss the important things in life. Johannes for introducing me to the world of Hattrick and for his un- believable stories that actually are true. Tomas for explaining how to order “macchiato due” in Rome. All other members of the constantly growing AE group: Diana Ekman, Oliva Eriksson, Kristoffer Illerg˚ard,Per Larsson and H˚akan Viklund for being such lovely colleagues. All other PhD students, post-docs, projects students, sys admins, as- sistant, associate and full professors at SBC for creating a wonderful working environment. The Swedish government for setting up the Research School in Ge- nomics and Bioinformatics to pay for my salary. Helge Ax:son Johnsons Foundation and Jan-Arthur Ekstr¨omsMinness- tiftelse for providing additional financial support. Olov, Jakob and Magnus for your friendship - see you on the beach in Herta in October! Magnus for our endless discussions at the gym, I even explained the problem of angles in three dimensions once. My parents, for encouraging me to walk my own paths and pursue my own goals and for always helping out. Susanna, the love of my life, thanks for always being by my side.

Stockholm, August 2005.

45 References

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J.. 1997. Gapped BLAST and PSI– BLAST: a new generation of protein database search programs. Nucleic Acids Res 25 (17):3389–3402.

Anfinsen, C. B.. 1973. Principles that govern the folding of protein chains. Science 181 (96):223–230.

Aszodi, A., and Taylor, W. R.. 1996. Homology modelling by distance geometry. Fold Des 1 (5):325–334.

Bairoch, A., Apweiler, R., Wu, C. H., Barker, W. C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., Martin, M. J., Natale, D. A., O’Donovan, C., Redaschi, N., and Yeh, L. S.. 2005. The universal protein resource (uniprot). Nucleic Acids Res 33 (Database issue):D154–9.

Baker, D., and Sali, A.. 2001. Protein structure prediction and struc- tural genomics. Science 294 (5540):93–96.

Bates, P. A., Kelley, L. A., MacCallum, R. M., and Sternberg, M. J.. 2001. Enhancement of protein modeling by human intervention in applying the automatic programs 3D-JIGSAW and 3D-PSSM. Proteins Suppl 5:39–46.

Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., and Bourne, P. E.. 2000. The protein data bank. Nucleic Acids Res 28 (1):235–242.

Bishop, Christopher M.. 1995. Neural Networks for Pattern Recogni- tion. Great Clarendon St, Oxford OX2 6DP, UK: Oxford Univer- sity Press.

Blom, N., Gammeltoft, S., and Brunak, S.. 1999. Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. J Mol Biol 294 (5):1351–1362.

46 Blundell, T. L., Sibanda, B. L., Sternberg, M. J., and Thornton, J. M.. 1987. Knowledge-based prediction of protein structures and the design of novel molecules. Nature 326 (6111):347–352.

Bonneau, R., and Baker, D.. 2001. Ab initio protein structure predic- tion: progress and prospects. Annu Rev Biophys Biomol Struct 30:173–189.

Bonneau, R., Tsai, J., Ruczinski, I., Chivian, D., Rohl, C., Strauss, C.E., and Baker, D.. 2001. Rosetta in CASP4: progress in ab initio protein structure prediction. Proteins 45 (Suppl 5):119–126.

Bowie, J.U., L¨uthy, R., and Eisenberg, D.. 1991. A method to identify protein sequence that fold into a known three-dimensional struc- ture. Science 253 (5016):164–170.

Browne, W. J., North, A. C., Phillips, D. C., Brew, K., Vanaman, T. C., and Hill, R. L.. 1969. A possible three–dimensional struc- ture of bovine alpha–lactalbumin based on that of hen’s egg–white lysozyme. J Mol Biol 42 (1):65–86.

Bryant, S.H., and Amzel, L.M.. 1987. Correctly folded proteins make twice as many hydrophobic contacts. Int J Pept Prot Res 29:46– 52.

Bryant, S. H., and Lawrence, C. E.. 1991. The frequency of ion-pair substructures in proteins is quantitatively related to electrostatic potential: a statistical model for nonbonded interactions. Proteins 9 (2):108–119.

Bujnicki, J.M., Elofsson, A., Fischer, D., and Rychlewski, L.. 2001a. Livebench–1: continous benchmarking of protein structure pre- diction servers. Protein Sci. 10 (2):352–361.

Bujnicki, J.M., Elofsson, A., Fischer, D., and Rychlewski, L.. 2001b. Livebench–2: large-scale automated evaluation of protein struc- ture prediction servers. Proteins 45 (Suppl 5):184–191.

Chivian, D., Kim, D. E., Malmstrom, L., Bradley, P., Robertson, T., Murphy, P., Strauss, C. E., Bonneau, R., Rohl, C. A., and Baker,

47 D.. 2003. Automated prediction of CASP-5 structures using the robetta server. Proteins 53 Suppl 6:524–533. Chothia, C., and Lesk, A. M.. 1986. The relation between the diver- gence of sequence and structure in proteins. EMBO J 5 (4):823– 826. Claessens, M., Cutsem, E. Van, Lasters, I., and Wodak, S.. 1989. Mod- elling the polypeptide backbone with ’spare parts’ from known protein structures. Protein Eng 2 (5):335–345. Colovos, C., and Yeates, T.O.. 1993. Verification of protein struc- tures: patterns of nonbonded atomic interactions. Protein Sci. 2 (9):1511–1519. Cristobal, S., Zemla, A., Fischer, D., Rychlewski, L., and Elofsson, A.. 2001. A study of quality measures for protein threading models. BMC Bioinformatics 2 (5). Dominy, B.N., and Brooks, C.L.. 2002. Identifying native-like protein structures using physics-based potentials. J Comput Chem. 23 (1):147–160. Ekman, D., Bj¨orklund, A. K., Frey-Skott, J., and Elofsson, A.. 2005. Multi-domain proteins in the three kingdoms of life: orphan do- mains and other unassigned regions. J Mol Biol 348 (1):231–243. Elofsson, A.. 2002. A study on how to best align protein sequences. Proteins 15 (3):330–339. Emanuelsson, O., Nielsen, H., and von Heijne, G.. 1999. ChloroP, a neural network-based method for predicting chloroplast transit peptides and their cleavage sites. Protein Sci 8 (5):978–984. Felts, A.K., Gallicchio, E., Wallqvist, A., and Levy, R.M.. 2002. Dis- tinguishing native conformations of proteins from decoys with an effective free energy estimator based on the OPLS all–atom force field and the Surface Generalized Born solvent model. Proteins 48 (2):404–422. Fischer, D.. 2003. 3D-SHOTGUN: a novel, cooperative, fold- recognition meta-predictor. Proteins 51 (3):434–441.

48 Fischer, D., Elofsson, A., Rychlewski, L., Pazos, F., Valencia, A., Rost, B., Ortiz, A. R., and r. Dunbrack RL, J.. 2001. CAFASP2: the second critical assessment of fully automated structure prediction methods. Proteins Suppl 5:171–183.

Fischer, D., Rychlewski, L., r. Dunbrack RL, J., Ortiz, A. R., and Elofsson, A.. 2003. CAFASP3: the third critical assessment of fully automated structure prediction methods. Proteins 53 Suppl 6:503–516.

Ginalski, K., Elofsson, A., Fischer, D., and Rychlewski, L.. 2003. 3d- jury: a simple approach to improve protein structure predictions. Bioinformatics 19 (8):1015–1018.

Greer, J.. 1990. Comparative modeling methods: application to the family of the mammalian serine proteases. Proteins 7 (4):317–334.

Gregoret, L. M., and Cohen, F. E.. 1991. Protein folding. effect of packing density on chain conformation. J Mol Biol 219 (1):109– 122.

Havel, T. F., and Snow, M. E.. 1991. A new method for building protein conformations from sequence alignments with homologues of known structure. J Mol Biol 217 (1):1–7.

Holm, L., and Sander, C.. 1996. Mapping the protein universe. Science 273:595–603.

Jones, D.T.. 1999a. GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol 287 (4):797–815.

Jones, D.T.. 1999b. Protein secondary structure prediction based on position–specific scoring matrices. J Mol Biol 292 (2):195–202.

Jones, T. A., and Thirup, S.. 1986. Using known substructures in protein model building and crystallography. EMBO J 5 (4):819– 822.

Koehl, P., and Delarue, M.. 1994a. Application of a self-consistent mean field theory to predict protein side-chains conformation and

49 estimate their conformational entropy. J Mol Biol 239 (2):249– 275.

Koehl, P., and Delarue, M.. 1994b. Polar and nonpolar atomic environ- ments in the protein core: implications for folding and binding. Proteins 20 (3):264–278.

Koehl, P., and Delarue, M.. 1995. A self consistent mean field ap- proach to simultaneous gap closure and side-chain positioning in homology modelling. Nat Struct Biol 2 (2):163–170.

Koehl, P., and Delarue, M.. 1996. Mean-field minimization methods for biological macromolecules. Curr Opin Struct Biol 6 (2):222–226.

Kosinski, J., Cymerman, I. A., Feder, M., Kurowski, M. A., Sasin, J. M., and Bujnicki, J. M.. 2003. A ”FRankenstein’s monster” approach to comparative modeling: merging the finest fragments of fold-recognition models and iterative model refinement aided by 3D structure evaluation. Proteins 53 Suppl 6:369–379.

Laskowski, R.A., MacArthur, M.W., Moss, D.S, and Thornton, J.M.. 1993. Procheck: a program to check the stereochemical quality of protein structures. J Appl Cryst 26 (2):283–291.

Lazaridis, T., and Karplus, M.. 1999. Discrimination of the native from misfolded protein models with an energy function including implicit solvation. J Mol Biol 288 (3):477–487.

Lazaridis, T., and Karplus, M.. 2000. Effective energy functions for protein structure prediction. Curr. Opin. Struct. Biol. 10 (2):139– 145.

Levitt, M.. 1992. Accurate modeling of protein conformation by auto- matic segment matching. J Mol Biol 226 (2):507–533.

Lindahl, E., and Elofsson, A.. 2000. Identification of related proteins on family, superfamily and fold level. J Mol Biol 295 (3):613–625.

Lundstr¨om, J., Rychlewski, L., Bujnicki, J., and Elofsson, A.. 2001. Pcons: a neural-network-based consensus predictor that improves fold recognition. Protein Sci. 10 (11):2354–2362.

50 L¨uthy, R., Bowie, J.U., and Eisenberg, D.. 1992. Assessment of protein models with three–dimensional profiles. Nature 356 (6364):283– 85.

Marti-Renom, M.A., Stuart, A.C., Fiser, A., S´anchez, R., Melo, F., and Sali, A.. 2000. Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 29:291–325.

Mirny, L.A., and Shakhnovich, E.I.. 1999. Universally conserved posi- tions in protein folds: reading evolutionary signals about stability, folding kinetics and function. J Mol Biol 291 (1):177–198.

Moult, J.. 2005. A decade of CASP: progress, bottlenecks and prog- nosis in protein structure prediction. Curr Opin Struct Biol 15 (3):285–289.

Moult, J., Fidelis, K., Zemla, A., and Hubbard, T.. 2001. Critical assessment of methods of protein structure prediction (CASP): round IV. Proteins 43 (Suppl 5):2–7.

Moult, J., Fidelis, K., Zemla, A., and Hubbard, T.. 2003. Critical assessment of methods of protein structure prediction (CASP)- round V. Proteins 53 Suppl 6:334–339.

Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C.. 1995. SCOP: a structural classification of proteins database for the in- vestigation of sequences and structures. J Mol Biol 247 (4):536– 540.

Ohlson, T., Wallner, B., and Elofsson, A.. 2004. Profile-profile methods provide improved fold-recognition: a study of different profile- profile alignment methods. Proteins 57 (1):188–197.

Petrey, D., Xiang, Z., Tang, C. L., Xie, L., Gimpelev, M., Mitros, T., Soto, C. S., Goldsmith-Fischman, S., Kernytsky, A., Schlessinger, A., Koh, I. Y., Alexov, E., and Honig, B.. 2003. Using multiple structure alignments, fast model building, and energetic analysis in fold recognition and homology modeling. Proteins 53 Suppl 6:430–435.

51 Qian, B., Ortiz, A. R., and Baker, D.. 2004. Improvement of compar- ative model accuracy by free-energy optimization along principal components of natural structural variation. Proc Natl Acad Sci U SA 101 (43):15346–15351. Qian, N., and Sejnowski, T. J.. 1988. Predicting the secondary struc- ture of globular proteins using neural network models. J Mol Biol 202 (4):865–884. Rost, B.. 1999. Twilight zone of protein sequence alignments. Protein Eng 12 (2):85–94. Rost, B., and Sander, C.. 1993. Prediction of protein secondary struc- ture structure at better than 70% accuracy. J Mol Biol 232 (2):584–599. Rost, B., Sander, C., and Schneider, R.. 1994. PHD–an automatic mail server for protein secondary structure prediction. Comput Appl Biosci 10 (1):53–60. Rumelhart, D. E., Hinton, G. E., and Williams, R. J.. 1986. Learning internal representations by error propagation. Cambridge, MA, USA: MIT Press. Rychlewski, L., Fischer, D., and Elofsson, A.. 2003. Livebench-6: large- scale automated evaluation of protein structure prediction servers. Proteins 53 Suppl 6:542–547. Rychlewski, L., Jaroszewski, L., Li, W., and Godzik, A.. 2000. Com- parison of sequence profiles. strategies for structural predictions using sequence information. Protein Sci. 9 (2):232–241. Sadreyev, R., and Grishin, N.. 2003. COMPASS: a tool for compar- ison of multiple protein alignments with assessment of statistical significance. J Mol Biol 326 (1):317–336. Sali, A., and Blundell, T.L.. 1993. Comparative modelling by statis- faction of spatial restraints. J Mol Biol 234 (3):779–815. Schwede, T., Kopp, J., Guex, N., and Peitsch, M.C.. 2004. SWISS– MODEL: an automated protein homology-modeling server. Nu- cleic Acids Res 31 (7):3381–3385.

52 Siew, N., Elofsson, A., Rychlewski, L., and Fischer, D.. 2000. Maxsub: an automated measure to assess the quality of protein structure predictions. Bionformatics 16 (9):776–785.

Simons, K.T., Bonneau, R., Ruczinski, I., and Baker, D.. 1999. Ab initio protein structure predictions of CASP III targets using ROSETTA,. Proteins Suppl 3:171–176.

Sippl, M.J.. 1993. Recognition of errors in three–dimensional structures of proteins. Proteins 17 (4):355–362.

Sippl, M.J., Lackner, P., Domingues, F.S., Prlic, A., Malik, R., An- dreeva, A., and Wiederstein, M.. 2001. Assessment of the CASP4 fold recognition category. Proteins 45 (Suppl 5):55–67.

Srinivasan, S., March, C. J., and Sudarsanam, S.. 1993. An automated method for modeling proteins on known templates using distance geometry. Protein Sci 2 (2):277–289.

Still, W.C., Tempczyk, A., Hawley, R.C., and Hendrickson, T.J.. 1990. Semianalytical treatment of solvation for molecular mechanics and dynamics. J Am Chem Soc 112 (16):6127–6129.

S´anchez, R., and Sali, A.. 1997. Comparative protein modeling as an optimization problem. J. Mol. Struct. 398 (3):489–496.

Unger, R., Harel, D., Wherland, S., and Sussman, J. L.. 1989. A 3d building blocks approach to analyzing and predicting structure of proteins. Proteins 5 (4):355–373.

Venclovas, C., Zemla, A., Fidelis, K., and Moult, J.. 2003. Assessment of progress over the CASP experiments. Proteins 53 Suppl 6:585– 595.

Vitkup, D., Melamud, E., Moult, J., and Sander, C.. 2001. Complete- ness in structural genomics. Nat Struct Biol 8 (6):559–566. von Grotthuss, M., Pas, J., Wyrwicz, L., Ginalski, K., and Rychlewski, L.. 2003. Application of 3D-jury, GRDB, and Verify3D in fold recognition. Proteins 53 Suppl 6:418–423.

53 von Ohsen,¨ N., and Zimmer, R.. 2001. Improving profile-profile align- ments via log average scoring. In WABI ’01: Proceedings of the First International Workshop on Algorithms in Bioinformatics pp. 11–26, Springer-Verlag, London, UK.

Wallner, B., Fang, H., Ohlson, T., Frey-Skott, J., and Elofsson, A.. 2004. Using evolutionary information for the query and target improves fold recognition. Proteins 54 (2):342–350.

Yona, G., and Levitt, M.. 2000. Towards a complete map of the protein space based on a unified sequence and structure analysis of all known proteins. Proc Int Conf Intell Syst Mol Biol 8:395–406.

54