CASP5 Methods Abstracts A-2 123D_server (P0476) - 68 predictions: 68 3D

123D: an Old Program for Fold Recognition

N.Alexandrov Ceres, Inc. Malibu, CA, USA [email protected]

I used the 123D+ web site at http://123d.ncifcrf.gov/ for making predictions. The predictions were completely automatic, without any manual intervention with only exceptions made for multi-domain proteins. For such proteins the strongest local hit was cut out from the query sequence and the rest of the sequence was submitted again. The program 123D+ uses PSI-blast generated profiles for both query sequence and the fold library, secondary structure compatibility, and contact capacity potentials for finding optimal sequence –structure alignment. Fold library was constructed from 40% non-redundant Astral set of SCOP-1.59 domains.

Accelrys (P0210) - 24 predictions: 24 3D

Comparative Modeling Using GeneAtlasTM

Dana Haley-Vicente, Velin Spassov, Tina Yeh, Ken Butenhof, Christoph Schneider, Azat Badretdinov and Lisa Yan

Accelrys Inc., 9685 Scranton Road, San Diego, CA 92121, USA [email protected]

GeneAtlas™ (1) is a high-throughput pipeline for automated protein structure prediction and function annotation. For template structure identification it uses PSI- BLAST searches and our fold recognition program, SeqFold. To maximize homology recognition, both direct and reverse PSI-BLAST searches are performed and the hits are combined. Automated model building is carried out with Modeler, and models are evaluated using Profiles-3D Verify scores.

For CASP5 targets, we first use GeneAtlas to help to identify and select potential PDB templates, and then the alignments are adjusted manually with the aid of various alignment tools (e.g. Align123) in the Homology module in InsightII. Align123 is based on ClustalW and augmented with a secondary structure match term added to the alignment score. If multiple templates are used to build a model, structure-structure alignments are explored using InsightII’s structure alignment tools, as well as Modeler’s MALIGN3D, and the protein structure alignment program CE. Subsequently the sequence-structure alignment is carried out with Modeler’s Align2D. Multiple models are built with Modeler, including the new loop refinement routine based on the optimization of statistical pair potentials. Models are checked for proper stereochemistry, and evaluated by comparing the restraint violations reported by Modeler; and by the Profiles-3D Verify scores, which measure the compatibility of each residue in the model with its environment.

A-3 In addition, some targets were selected to test two new methods that we have developed, ChiRotor and Looper, for side-chain and loop prediction. ChiRotor is a fast algorithm that predicts the conformation of all or part of amino-acid side chains with an average RMSD of about 1Å for the core residues. The loop-modeling program, Looper, produces a number of energy minimized loop backbone conformations ranked according to force-field energy terms. Both algorithms are a combination of a discrete search in dihedral angle space and CHARMm energy minimization.

1. Kitson et al. (2002) Functional annotation of proteomic sequences based on consensus of sequence and structural analysis. Briefings in Bioinformatics 3(1), 1-13.

A-4 Advanced-ONIZUKA (P0214) - 92 predictions: 92 3D

Fold Selection and Patchwork Energy Minimization

Kentaro Onizuka Advanced Technology Research Laboratories, Matsushita Electric Industrial Co. Ltd. [email protected]

.The new method developed to meet CASP5 consists of three units. 1) Fold recognition unit This unit selects ten to hundred conformations that have relatively good compatibility to the target protein sequence among approximately two thousand non-redundant protein structures collected from PDB release 100. The selected conformations are aligned to the target protein sequence. The compatibility of a conformation against the target sequence is evaluated as the sum of multi-dimensional mean-force potentials between all possible pairs of residues in that conformation, now that having the target sequence aligned.

2) Patch work energy minimization unit This unit builds a protein conformation by concatenating the structure segments cut out of those conformations selected by the fold recognition unit. The conformations selected are aligned to the target protein sequence. Here the concatenation of conformations is done as follows; 1) select two (i-th and j-th) conformations each aligned to the target protein sequence, 2) choose a residue M in the sequence as the crossover point 3) the new conformation is generated by concatenating the segment from N- term (of the target sequence) to M of j-th conformation and the segment from M to C-term (of the target sequence) of i-th conformation. The minimization algorithm is analogous partially to genetic algorithm and also dynamic programming. The minimization procedure first set the several segment core residues, which should never be the crossover points. The core residues are those having locally minimal energy, where the energy of each residue is calculated as the average energy (sum of potentials involving that residue) over all the selected conformations. The first concatenation step takes crossover points M between N-term and the first segment core residue. For i-th conformation, the best combination of M and j with the conformation having minimal energy is selected. The k-th step takes M between k-1-th and k-th core residue. The last step takes M between the last core and the C-term residue. Finally, the best conformation having minimal energy is selected from the remaining new conformations as the result of the energy minimization.

3) Gap caulking unit The protein conformation built by patchwork energy minimization unit contains some gaps inserted or deleted during the alignment process. This unit tries to caulk those gaps by searching the conformations (selected by the fold recognition unit) for the combination of two gapless conformation segments at that region which may substitute the conformation segments containing gaps.

ab The multi-dimensional mean force potentials E k are pairwise between two residues with respect to the residue types a and b, sequence separation k, and the six- dimensional relative configuration whose components are 1) the distance between two residues, 2) the direction of residue b from a, and 3) the orientation of b against a (three Euler's angles). The fold recognition unit, however, first employs singleton potentials with respect only to one residue type among the pair in order to generate the energy profile of conformations among non-redundant conformation data-set. Then the target sequence is aligned to each profile using dynamic programming algorithm.

A-5 The compatibility of each conformation to the target sequence is evaluated by calculating the total energy, which is the sum of pairwise potentials according to that alignment. The energy minimization unit employs pairwise potentials plus attractive force potentials because the energy minimization using only the net mean force potentials1 generates an extended conformation rather than compact one. The attractive potentials adopted here are such that are proportional to the square of the distance between residues.

The performance of the minimization algorithm proposed is intense, although the algorithm logically does not assure to generate the optimal solution. The most difficult problem remaining is the potentials for minimization.

1. Sippl M.J. (1990) Calculation of Conformational Ensembles from Poten-tials of Mean Force: An Approach to the Knowledge-based Prediction of Local Structure in Globular Proteins. J. Mol. Biol., 213, 859-883. 2. Onizuka K., Noguchi T., Akiyama Y. Matsuda H. (2002) Using Data Compression for Multidimensional Distribution Analysis. Intelligent Systems May/June 2002, 48-54. ALAX (P0234) - 39 predictions: 39 3D

A New Sequence Alignment Method ALAX and Its Application to

Atsushi Hijikata1, Tosiyuki Noguti 2 and Mitiko Go1 1 Division of Biological Science, Graduate School of Science, Nagoya University, 2 Saga Medical School [email protected]

One of the important issues in homology modeling is to obtain accurate sequence alignment. Particularly it is true in the case of low sequence identity (less than 30 %) between the target and template proteins. In low sequence identity, one of the difficulties lies in locating the insertions/deletions (in/del) at proper positions. To accommodate the in/del at correct locations, we developed a new sequence alignment method for protein pairs with weak identity in their amino acid sequences. A new gap penalty function was introduced that is based on the solvent accessibility of the corresponding amino acid residues of the template structure. In the new sequence alignment method, the gap penalty function and the Position Specific Scoring Matrix (PSSM) of PSI-BLAST [1] were combined. This alignment method we developed is named ALAX (ALignment based on ACCessibility). We used ALAX for template-target sequence alignment and homology modeling software FAMS in CASP5/CAFASP3.

In CASP5/CAFASP3, we obtained the target models through the following three steps.

1) Template structure selection To identify a template structure, we used five iterations of PSI-BLAST against the non-redundant protein sequence database (nr) of the NCBI. All the sequences having an e-value lower than 0.1 were included in the PSSM construction. Then, the PSSM was used to search against the PDB sequence database. One PDB sequence with the lowest e-value was selected as a template structure. 2) Target – template sequence alignment To align the template and the target sequence, we used ALAX with solvent accessibility of residues of the template structure and the PSSM constructed in the step 1).

A-6 3) Model building The model building was carried out finally by using FAMS [2] program according to the alignment that was obtained by ALAX. All the processes of homology modeling, 1) to 3) are fully automatic.

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402. 2. Ogata K. and Umeyama H. (2000) An automatic homology modeling method consisting of database searches and simulated annealing. J. Mol. Graph. Model. 18 (3), 258-272, 305-306.

A-7 Aligners (P0064) - 31 predictions: 31 3D

Fold Recognition Using Only Boilerplate Methods of Database Search and Multiple Sequence Alignment

Arcady Mushegian Stowers Institute for Medical Research [email protected]

I believe that most if not all approaches for predicting protein structure from sequence form a continuum of methods, at the core of which lies probabilistic modeling of evolutionarily related sequence families. (Ab initio methods may be an exception, but they used to be practical mostly for short peptides). Thus, there is no “threading” really distinct from “fold recognition” really distinct from “homology modeling” – the difference is mainly in the atomic detail of the resulting model.

In order to falsify, and thereby scientifically test, the above statement, one has to demonstrate that various complementary physico-chemical approaches are 1. not reducible to probabilistic modeling of protein sequence families and 2. result in a statistically significant improvement over the methods that use alignment information alone.

In order to provide a benchmark against which the level of improvement can be scored, I applied the “no-new-methods” approach for structure prediction of CASP5 targets. At the first step, I removed the targets that had a statistically significant match (arbitrary cutoff E=<10 -4), at the first iteration of the PSI-BLAST program [1] to a sequence with the known (pdb) structure. These are straight homology modeling targets, where the real issue is not fold recognition but the RMSD of the model. I know nothing about methods of reducing RMSD. I also left out several very short peptides. The result is 37 targets where fold recognition, i.e., identification of and alignment to an appropriate template, is a legitimate yet non-trivial task.

The main database search program was PSI-BLAST (cutoff for inclusion into a profile was set at 0.05 and composition-based statistics was used when helpful). The program was run to convergence, the homologs were collected and used in the new rounds of similarity searches. This is an important step because none of the existing similarity search methods is assured to recover all family members in one, even iterative, search [2]. If template was not discovered, the RPS-BLAST program [3] was used and proved helpful in two cases. Several HMM-based applications were also employed but did not give any gain in the template identification.

Sequences of multiple family members, including target, template, and several homologs with different degree of similarity with both, were aligned using MACAW [4] and T-COFFEE [5], then converted to the AL format (I thank Ognen Duzlevski for giving me a converter program). The only manual check was to assure that the alignment makes structural sense, i.e. that the major elements of secondary structure are aligned, and their connectivity is possible given the distances between the aligned elements in each structure. Loops were not modeled if they could not be aligned on the basis of sequence similarity.

I submitted 28 models for 28 targets. The assessors are invited to see whether the results are, on average, comparable with the ones achieved by more sophisticated approaches.

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402

A-8 2. Aravind L., Koonin E.V. (1999) Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches J Mol Biol. 287(5):1023-1040. 3. Schaffer A.A. et al. (1999). IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics. 15:1000-1011 4. Schuler G.D. et al. (1991). A workbench for multiple alignment construction and analysis. Proteins 9: 180-190 5. Notredame C. et al. (2000). T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 302: 205-217.

arby-scai (P0183) - 68 predictions: 68 3D

The Arby Automated Structure Prediction Server

Ingolf Sommer1, Niklas von Öhsen2 1 – Max-Planck-Institute for Informatics, 2 – FraunhoferInstitute forScientific Computing and Algorithms [email protected]

Our fully automated protein structure prediction server Arby combines the results of several fold recognition methods to find suitable templates in a database of structural representatives of protein domains.

The method starts by constructing a set of subsequences from the query sequence, each subsequence representing a hypothesis for a possible protein domain. This is done by scanning against the InterPro database and using hits as domain hypotheses [1]. Additional hypotheses are constructed using a secondary structure prediction from PSIPRED [2]. Segments of predicted loops are used as potential domain boundaries. Finally, the set of subsequences is reduced to a reasonable size by removing subsequences that are highly similar or short.

For each subsequence a multiple alignment is constructed by searching the NR database using PSI-BLAST [3]. A frequency profile is calculated from this multiple alignment using a slightly modified version of the Henikoff-Henikoff sequence-weighting algorithm [4].

Each of the potential domains is then subjected to four different fold recognition methods. Each method searches for an optimal structure in our template database. The template database is a representative subset of the SCOP domains with pairwise sequence identity lower than 40% [5, 6]. For each of these template domains, a frequency profile was constructed as described above for the targets. The first fold recognition method is PSI-BLAST, which is used to search through our set of template domains (augmented by the NR sequence database). The second one is the 123D threading program. It uses frequency profiles on the target side and 3D structural information on the template side [7, 8]. The third one is the JProp profile-profile alignment method recently developed in our group [9, 10]. It compares frequency profiles on the target side with profiles on the template side using the log average scoring approach. The fourth method is again the JProp profile-profile alignment program, but in this version it makes use of additional secondary structure information on the target and template side (publication in preparation).

A-9 The quality of each of these search results is assessed using confidence measures. For PSI-BLAST, these are readily available [11], for the other methods, these were developed in a recent study [12].

The target sequence is then annotated with all the produced quadruplets (subsequence, fold recognition method, search result, confidence value). Finally, we select a set of non-overlapping annotations along the sequence, by performing combinatorial optimization of a heuristic score based on the confidence values. For each of these selected annotations, a separate protein domain is predicted. The structure of this domain prediction is computed by aligning the subsequence against the template structure using JProp.

The underlying machinery is a Java based data flow engine, designed for stability. Since it is general and independent of the specific pipeline (as the one described above), it can be used as infrastructure for other projects as well: we developed a component framework in which all algorithms and programs are encapsulated in small Java classes. Each of these components specifies an algorithm to be executed along with its input parameters, the output that it produces, and possible error conditions. The accompanying engine provides a number of features for the components: First of all, the input/output dependencies of components are resolved. If all inputs for a specific algorithm have been determined, the algorithm itself is being scheduled for execution. The components are executed in parallel on any number of CPUs, in our case 10 CPUs of a SunFire 4800 server. A frequent problem in fully automated systems is reliable error handling. We solve this problem by catching potential error conditions and adaptively pruning the data-flow tree. Additionally, persistence of the computed results is accomplished by using a relational database, thus offering convenient and fast access to previously computed results for identical input parameters.

The power of the structure prediction server is based on the use of modern profile-profile algorithms for fold recognition, the quality assessment using confidence measures, and the stable and powerful Java data flow engine. In future work, we will use the latter technology as a basis for our bioinformatics computing environment.

Acknowledgements. In addition to the authors, the ARBY CAFSP 3 Team includes Mario Albrecht, Thomas Lengauer, Theo Mevissen, and Ralf Zimmer. We thank Daniel Hanisch for providing contributions to the Java implementation. Part of this research has been supported by BMBF grant no. 01 SF 9984/3 (Helmholtz Network for Bioinformatics).

1. Apweiler R. et al. (2001) The InterPro database, an integrated documenta-tion resource for protein families, domains and functional sites. Nucleic Acids Res. 29 (1), 37-40. 2. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 292 (2), 195-202. 3. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new genera-tion of protein database search programs. Nucleic Acids Res. 25 (17), 3389-402. 4. Henikoff S. and Henikoff J.G. (1994) Position-based sequence weights. J Mol Biol. 243 (4), 574-8. 5. Chandonia J.M., et al. (2002) ASTRAL compendium enhancements. Nucleic Acids Res. 30 (1), 260-3. 6. Brenner SE, Koehl P, and Levitt M. (2000) The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res. 28 (1), 254-6. 7. Zien A., Zimmer R., and Lengauer T. (2000) A simple iterative approach to parameter optimization. J Comput Biol. 7 (3-4), 483-501. 8. Alexandrov N.N., Nussinov R., and Zimmer R. (1996) Fast protein fold recognition via sequence to structure alignment and contact capacity potentials. Pac Symp Biocomput, 53-72. 9. Von Öhsen N, Sommer I, and Zimmer R (2003) Profile-Profile Alignment: A Powerful Tool For Protein Structure Prediction. in Pac Symp Biocomput. 10. Von Öhsen N. and Zimmer R. (2001) Improving profile-profile alignment via log average scoring. Lecture Notes in Computer Science. 2149, 11-26. 11. Karlin S. and Altschul S.F. (1990) Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes . Proc Natl Acad Sci U S A. 87 (6), 2264-8.

A-10 12. Sommer I., et al. (2002) Confidence measures for protein fold recognition. Bioinformatics. 18 (6), 802-12.

AS2TS (P0081) - 26 predictions: 26 3D

AS2TS – A New Protein Structure Prediction Server

J. Zemla Independence High School, Brentwood, CA, US [email protected]

We have attempted to predict structures of twenty-six CASP5 targets using a preliminary version of a fully automated method AS2TS (Amino acid Sequence to Tertiary Structure) [1].

The AS2TS server built 3D protein models using a top sequence-structure alignment provided by PSI-BLAST [2] for a given target. Coordinates for loop regions were assigned from a library of folds generated by LGA program (Local-Global Alignment) [3]. Side chains were added using SCWRL program [4]. Human intervention was limited to enter an amino acid sequence to the AS2TS server and control whether the process of model building went through.

Our main goal during this round of CASP was to test the ability and effectiveness of combining two independently working processes: sequence alignment method with loop building procedure. An analysis of evaluation results will help in further development of the AS2TS system.

1. Zemla A. http://protein.llnl.gov/as2ts 2. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W. & Lipman D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17), 3389-3402. 3. Zemla A. http://PredictionCenter.llnl.gov/local/lga/lga.html 4. Bower M., Cohen F.E. and Dunbrack R.L. Jr. (1997) Sidechain prediction from a backbone-dependent rotamer library: A new tool for homology modeling. J. Mol. Biol. 267, 1268-1282 ATOME (P0464) - 318 predictions: 318 3D

Evaluation of an Automatic Pipeline, ATOME for Protein Structure Modelling

G. Labesse, V. Catherinot, J.-L. Pons, L. Martin and D. Douguet 1 - Centre de Biochimie Structurale (CNRS), Montpellier, France [email protected]

A-11 The fold compatibility between the targets and PDB entries was analyzed using our recently developped meta-server [1]. Query sequences are sent automatically to six distinct fold recognition or protein structure prediction servers: 3D-PSSM[2], PDB-BLAST (http://bioinformatics.burnham-inst.org/pdb_blast/), FUGUE[3],

GenTHREADER[4], SAM-T99[5] and J-PRED2[6] with default parameters but for PDB-BLAST (10 iterations). No particular treatment were made for multi-domain targets as proper domain delimitation was not yet automatized. This likely lead to partially incorrect alignment or to incorrect fold recognition for a few targets.

As most “threaders” use the “frozen approximation”, each structural alignment was further evaluated using T.I.T.O [7]. PSI-BLAST [8] on SWISSPROT [9] sequence database run on the NPSA server [10] was used to search homologous sequences using the target sequence as a query. The homologs and the target sequence were used to produce a multiple alignment using CLUSTALW. This alignment was used to assess the structural alignments.

A consensus ranking was deduced for each template taking into account its score and its ranking (both computed by the original server), the T.I.T.O score and the level of sequence identity.

For all targets, three models were built directly using MODELLER 6.0 [11] for the top-ranking structural alignments. Additional restraints to be used in MODELLER 6.0, were deduced from template secondary structure assignment made using P-SEA [12]. Models were evaluated using PROSA [13] and Verify3D [14]. Side chain modelling in the common core (as defined by target-template alignment) was also performed using SCWRL 2.8 [15] and similarly evaluated but not further refined.

For each target, all the three-dimensional models were ranked according to the scores computed by PROSA [13] and Verify3D [14]. The top-five models were deposited for each targets.

1. Douguet D. et al. (2001) Easier threading through web-based comparisons and cross-validations. BioInformatics 17, 752-753. 2. Kelley L.A. et al. (2000) Enhanced Genome Annotation using Structural Profiles in the Program 3D-PSSM. J. Mol. Biol. 299, 501-522 3. Shi J. et al. (2001) FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 310, 243-257. 4. McGuffin L.J. et al. (2000) The PSIPRED protein structure prediction server. Bioinformatics 16, 404-405 5. Karplus K. et al. (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics 14, 846-856. 6. Cuff J.A. et al. (1998) Jpred: A Consensus Secondary Structure Prediction Server. Bioinformatics 14, 892-893 7. Labesse G. et al. (1998) A Tool for Incremental Threading Optimization (T.I.T.O.) to help alignment and modelling of remote homologs. Bioinformatics 14, 206- 211. 8. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 9. Bairoch A. et al.. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000 Nucleic Acids Res. 28, 45-48 10. Combet, C. et al. (2000) NPS@: network protein sequence analysis. Trends Biochem. Sci. 25, 147-150 11. Sali A. et al. (1993) Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815. 12. Labesse G. et al. (1997) P-SEA: a new efficient assignment of secondary structure from Ca trace of proteins. CABIOS 13, 291-295. 13. Sippl M.J. (1993) Recognition of errors in three-dimensional structures of proteins. Proteins 17, 355-362. 14. Eisenberg D. et al. (1997) VERIFY3D: assessment of protein models with three-dimensional profiles. Methods Enzymol 277, 396-404 15. Dunbrack R.L. et al. (1993) Backbone-dependent rotamer library for proteins. Application to side-chain prediction. J Mol Biol. 230, 543-574. Avbelj-Franc (P0341) - 25 predictions: 25 3D

A-12 Torsion Space Monte Carlo Simulations of Folding Using Electrostatic Screening of Backbone and Charged Side-Chain Interactions

F. Avbelj National Institute of Chemistry [email protected]

Three-dimensional structures of small proteins are predicted ab initio using torsion space Monte Carlo simulations from sequence alone. Protein structures in the data- bank are not used in this method. The method is based on the electrostatic screening of main-chain and charged side-chain interactions.

The screening of main-chain electrostatic interactions by water solvation is used to model the backbone conformational propensities (the electrostatic screening model: ESM) [1-7]. The strongest support for the ESM has been provided by the recent experimental studies, which demonstrated that an enthalpic factor is involved in determining the preferences for α-helices and β-strands [8-10].

The energy function in the Monte Carlo procedure contains: main-chain and charged side-chain electrostatic interactions, electrostatic solvation free energies of main- chain and charged side-chain groups, and hydrophobic effect. The screening of charged side-chain electrostatics by water solvation is used to model the interactions of charged side-chain groups. The interactions of polar non-charged side-chain groups are ignored. The hydrophobic effect is modeled by the long-range interactions. The main-chain and charged side-chain electrostatic interactions are calculated using Coulomb's law with a dielectric constant of 1. The electrostatic solvation free energies of polar main-chain and charged side-chain groups (ESF) are calculated using the finite difference Poisson-Boltzmann model (DelPhi) with PARSE parameter set [11]. The electrostatic potential of the molecule is first calculated using a very large box and large grid size. This potential then provides boundary conditions for more accurate calculations of electrostatic potential around each residue (focusing).

Torsion space Monte Carlo simulations of small proteins are performed using hierarchic condensation. In the first phase of simulation only the local electrostatic energies and backbone solvation free energies of residues are activated. After equilibration the protein molecules display native-like local conformational propensities and dimensions characteristic for the highly denatured proteins. The calculated NMR J3HNHα coupling constants agree well with those obtained from the COIL residues in experimental protein structures. In this phase the β-strands are formed. In the second phase of simulation the main-chain hydrogen bonds are included in the energy function. In this phase α-helices and hairpins are formed. In the third phase of simulation the long-range hydrophobic and electrostatic interactions between charged residues are gradually included in the energy function. The electrostatic interactions between charged residues are screened by the electrostatic solvation free energies of charged side-chains. In this phase α-helices and β-strands gradually condense into compact structures.

In order to improve sampling of the conformational space, a large number of independent Monte Carlo simulations (~100) are performed. All heavy atoms including polar hydrogen’s are included in simulations. Geometry of amino acids is generated using the Discover residue library. Only torsion angles are allowed to vary during simulations. The ω peptide bond torsion angles are fixed to 180˚. Hard sphere repulsion is enforced by discarding conformations with steric clashes. Pairs of atoms related by torsion angles are not checked for steric clashes. Conformational space is sampled by varying torsion angles of proteins using different types of moves. The Metropolis criterion is used to decide whether to accept or reject the move. Temperature is 300 K.

1. Avbelj F. and Moult J. (1995) Role of electrostatic screening in determining protein main-chain conformational preferences. Biochemistry, 34, 755-764.

A-13 2. Avbelj F. and Fele L. (1998) Role of main-chain electrostatics, hydrophobic effect, and side-chain conformational entropy in determining the secondary structure of proteins. J. Mol. Biol., 279, 665-684. 3. Avbelj F. (2000) Amino acid conformational preferences and solvation of polar backbone atoms in peptides and proteins. J. Mol. Biol., 300, 1337-61. 4. Avbelj F. and Moult J. (1995) The conformation of folding initiation sites in proteins determined by computer simulation. Proteins: Struc., Funct., Genet., 23, 129-141. 5. Avbelj F. and Fele L. (1998) Prediction of the three dimensional structure of proteins using the electrostatic screening model and hierarchic condensation. Proteins: Struc., Funct., Genet., 31, 74-96. 6. Avbelj F. (1992) Use of a potential of mean force to analyze free energy contributions in protein Folding. Biochemistry, 31, 6290-6297. 7. Avbelj F. and Baldwin R. L. (2002) Role of backbone solvation in determining thermodynamic β-propensities of the amino acids . Proc. Natl. Acad. Sci. U.S.A., 99, 1309-1313. 8. Luo P. and Baldwin R. L. (1999) Interactions between water and polar groups of the helix backbone: An important determinant of helix propensities. Proc. Natl. Acad. Sci. U.S.A., 96, 4930-4935. 9. Lorch M. et al. (2000) Effects of mutants on the thermodynamics of a protein folding reactions: Implications for the mechanism of formation of the intermediate and transition states. Biochemistry, 39, 3480-3485. 10. Thomas S. T. et al. (2001) Hydration of the peptide backbone largely defines the thermodynamic propensity scale of residues at the C' position of the C- capping box of α-helices. Proc. Natl. Acad. Sci. U.S.A., 98, 10670-10675. 11. Sitkoff D. et al. (1994) Accurate calculations of hydration free energies using macroscopic solvent models, J. Phys. Chem., 98, 1978-198.

BAKER (P0002) - 377 predictions: 377 3D

Comparative Modeling Using Rosetta

D. Chivian1+, C.A. Rohl1+, C.E.M. Strauss2, P. Murphy1 and D. Baker1 1 - University of Washington, 2 - Los Alamos National Laboratory, + - authors contributed equally [email protected]

Comparative modeling using Rosetta [1] is comprised of up to five steps: A) detection of the best parent for each putative domain, B) sequence alignment to that parent, C) modeling of structurally variable regions, D) optimization to increase the physical reasonableness of the final model, and E) re-assembling the complete chain when domains were parsed and processed individually.

(A) Homolog Detection Queries were initially screened for simple Blast or PSI-Blast parents. Large regions of query sequence without parent coverage were then submitted to the Bioinfo meta-server and candidate parents from Pcons2 and Pcons3 [2] were selected. Occasionally, parents with functions similar to that reported for the query were also considered. Human intervention was then employed to select the

A-14 appropriate parent. Domains for which no significant matches were found were modeled using the Rosetta de novo prediction protocol [3, and see above description].

(B) Sequence Alignment We employed a "kitchen sink" approach, called "K*SYNC", which produces large sets of candidate alignments by varying the way in which information is derived and used by a modified Smith-Waterman alignment algorithm. The information used includes the similarity between PSI-Blast derived residue substitution profiles for the query and parent, supplementing the parent residue substitution profile with counts from its FSSP, matching predicted regular secondary structure (PSIPRED, PHD, SAM, and/or JUFO) with three-state collapsed DSSP assigned secondary structure, and position specific obligateness and contiguousness as defined by the occupancy and degree of gapping for the query and parent in the PSI-Blast multiple sequence alignment and from the parent's FSSP multiple structural alignment.

The ensemble of sequence alignments was converted to an ensemble of three-dimensional template structures, and short to medium unaligned regions (< 17 residues) were modeled in the context of these templates using an abbreviated insertion modeling procedure (see C below). Alignments containing insertions that failed to produce conformations in agreement with the geometry of the template stems were discarded from the ensemble. Remaining alignments were ranked by evaluation of the structural models by several energy criteria. Human intervention was employed to either select one of the high-ranking alignments or to produce a new alignment by recombining the preferred features of multiple high-ranking alignments.

(C) Insertion modeling Unaligned regions corresponding to gaps in the sequence alignment as well as regions estimated likely to show significant structural divergence from the parent structure were modeled by the Rosetta fragment assembly strategy in the context of the fixed template. For regions < 17 residues, ~300 initial conformations were selected from a database of known structures using similarity of sequence, secondary structure, and stem geometry. Initial conformations for longer regions were built up using three and nine residue fragments. The conformations of all variable regions were then optimized using fragment replacement and random angle perturbations. A gap closure term in the potential in combination with conjugate gradient minimization was used to ensure continuity of the peptide backbone. Optimization of variable regions was accomplished by use of the standard Rosetta potential with centroid representation of side chains, followed by optimization with explicit side chains. All variable regions were optimized simultaneously, starting from a random selection of initial conformations. Generally, ~1000 independent optimizations were carried out. Variable regions were ranked independently by energy and low energy conformations for each variable region combined into a final model, manually ensuring that interacting variable regions were compatible. For the purposes of evaluating alignments (see B above), variable regions were modeled sequentially rather than simultaneously, stricter geometry requirements were enforced in selecting initial conformations, and the optimization step was severely truncated.

(D) Idealization and Optimization of Template Regions To make the models more physically reasonable, most structural models were modified to possess ideal backbone bond lengths and angles. Additionally, residue clashes were alleviated using a combination of small backbone perturbations and a rotamer-repacking

A-15 algorithm. For most targets, models pre- and post-optimization were submitted. For targets for which either the idealization and/or optimization resulted in significant backbone perturbation (> ~1.5 - 2A), this step was eliminated.

(E) Domain Assembly Domain scope models were combined into a contiguous chain by fragment insertion in the putative linker region, and evaluated by a coarse energy function. Finally, side-chains were repacked [4] in either the single or the multi-domain context.

1. Simons K.T. et al. (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol. Biol. 268 (1), 209-25. 2. Lundstrom J. et al. (2001) Pcons: a neural-network-based consensus predictor that improves fold recognition. Protein Sci. 10 (11), 2354-62. 3. Bonneau R. et al. (2001) Rosetta in CASP4: progress in ab initio protein structure prediction. Proteins 4 (S5), 119-26. 4. Kuhlman B. and Baker D. (2000) Native Protein sequences are close to optimal for their structures. Proc. Natl. Acad. Sci. USA 97 (19), 10383-8.

BAKER (P0002) - 377 predictions: 377 3D

De Novo Structure Predictions Using Rosetta

P. Bradley1+, J. Meiler1+, K.M.S. Misura1+, W.R. Schief1+, J. Schonbrun1+, W.J. Wedemeyer1+, O. Schueler-Furman1, M Kuhn1, P. Murphy1, C.E.M. Strauss2, and D. Baker1 1 - University of Washington, 2 - Los Alamos National Laboratory, + - authors contributed equally [email protected]

De novo structure predictions for CASP5 were made using Rosetta. The basic method has been described previously [1]. One of the fundamental assumptions underlying Rosetta is that the conformations adopted by short (3-9 residue) segments of the target polypeptide chain are similar to those adopted by related sequences in fragments of experimentally determined protein structures. Fragment libraries for each three and nine residue segment of the target polypeptide chain were extracted from the protein data bank using a profile-profile comparison method as described previously [2]. The conformational space defined by these fragments is then searched using a Monte Carlo procedure with an energy function favoring compact structures, buried hydrophobic residues, and paired beta strands. 10,000 - 400,000 independent simulations were carried out for the target sequence and homologous sequences (when available). Longer sequences were often parsed; smaller segments were folded and served as nuclei for folding the remainder of the chain.

A-16 For sequences longer that 110 amino acid residues, the resulting models were subjected to a filter which provided an even distribution of topologies generated during the Monte Carlo search procedure, and reduced the number of models with local contacts. The filtered models were then clustered as described in [3]. In some cases, representative decoys from each cluster were refined to improve the hydrogen bonding of their beta sheets. For proteins with fewer than 110 residues, decoys were scored with the energy function as described above, and the low free energy models were subjected to a Monte Carlo Minimization procedure to relieve backbone atomic clashes. Following this, sidechains were built onto the models using Dunbrack’s backbone dependent rotamer library and the method described in [4] and a similar Monte Carlo Minimization procedure was then used to minimize an all-atom energy function dominated by Lennard-Jones interactions, an orientation dependent hydrogen bonding term, and an implicit solvation model. Side chain conformations were periodically optimized using a full combinatorial optimization procedure. Models with the lowest free energy were selected.

Recent advances in the Rosetta method have been in the areas of decoy discrimination and improvement of the energy function for small proteins; and formation of beta sheets, generation of complex topologies and non-local contacts, and development of a protocol to identify decoys which have successfully incorporated these features for larger proteins. Attempts were made to improve secondary structure packing in all decoys. We have also attempted to compensate for incorrect secondary structure predictions in any given region of the polypeptide chain, and to increase the conformational space searched in regions where secondary structure could not be assigned with confidence. A new method, JUFO (manuscript in preparation), has been included in efforts to improve the accuracy of secondary structure prediction and aid generation of a more robust fragment library for a given sequence.

1. Bonneau R. et al. (2001) Rosetta in CASP4: progress in ab initio protein structure prediction. Proteins 45 (S5), 119-26. 2. Simons K.T. et al. (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol. Biol. 268 (1), 209-25. 3. Shortle D., Simons K.T. and Baker D. (1998) Clustering of low energy conformations near the native structures of small proteins. Proc. Natl. Acad. Sci. USA 95 (19), 11158-62. 4. Kuhlman B. and Baker D. (2000) Native Protein sequences are close to optimal for their structures. Proc. Natl. Acad. Sci. USA 97 (19), 10383-8.

BAKER-ROBETTA (P0029) - 199 predictions: 199 3D

Automated Method for Full Chain Structure Prediction Using Rosetta

D. Chivian, D.E. Kim, C.A. Rohl, L. Malmstrom, J. Meiler, T. Robertson, and D. Baker

A-17 University of Washington [email protected]

We have automated our basic comparative modeling and de novo protocols in an effort to determine the ability of the Rosetta [1] method to produce full chain models without human intervention. The server, called “Robetta” provides de novo, comparative, or mixed models in which the appropriate method is selected for each putative domain. Additionally, the server provides a secondary structure prediction from the JUFO-3D [2] method.

Regrettably, the server had several shortcomings during the CAFASP-3 experiment. Much of the code was implemented just prior to the experiment and not properly tested. It was not entirely free of logical errors, and some of the models are probably quite poor for this reason. Additionally, the automated methods that were implemented for CAFASP-3 employed reduced protocols either in an effort to meet the time demands required of a server method, or because they could not be completed in time for the experiment. In the interest of brevity, this abstract will only discuss differences from the full de novo and comparative modeling protocols (for full protocols, please see the “baker group” methods abstracts in this volume and [3]).

(A) Homolog Detection and Domain Parsing We developed a method, called “Ginzu”, to determine domains in the full chain of the query and assign them for de novo or comparative modeling. It consisted of sequential processing of the sequence with Blast, PSI-Blast, and Pcons2 [4] in order to identify regions of the query with parent PDB coverage. A Blast e-value of at least .001 or a Pcons2 confidence value of at least 1.5 was considered sufficient to justify comparative modeling. A single parent for each region of coverage was then selected based on confidence and length of coverage. Next, putative domain boundaries were determined for both comparative modeling and de novo regions of the chain. PSI-Blast detected homologous sequences were clustered by region of coverage and assigned to the query as non-overlapping domain regions in order of cluster size. Cut points between domains were assigned at positions of reduced occupancy in the PSI-Blast MSA and strongly predicted loop by PSIPRED [5]. Domain lengths for de novo regions were forced to be shorter (not more than ~200 residues) than they probably often were in the native structure in recognition of the current limitation of Rosetta’s de novo protocol to produce good quality models for large domains when generating a small decoy ensemble.

(B) Modeling Domains were modeled either by the de novo or comparative modeling Rosetta protocol. Reductions to the full protocols included generating a smaller decoy ensemble for de novo domains, producing only one default weighted K*SYNC alignment to the most confident parent for comparative modeling domains, and not rebuilding short and medium loops for unaligned regions in comparative models with our more rigorous protocol. Lastly, the final full chain model was produced trivially by spacing the coordinates of each domain model by 100 Angstroms.

(C) Secondary Structure Prediction JUFO-3D is a version of the JUFO neural-net secondary structure predictor that uses Rosetta de novo decoys or comparative models in addition to PSI-Blast multiple sequence information and an amino acid property profile to produce three-state predictions.

A-18 1. Simons K.T. et al. (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol. Biol. 268 (1), 209-25. 2. Meiler J. and Baker D. (manuscript in preparation) 3. Bonneau R. et al. (2001) Rosetta in CASP4: progress in ab initio protein structure prediction. Protein 45 (S5), 119-26. 4. Lundstrom J. et al. (2001) Pcons: a neural-network-based consensus predictor that improves fold recognition. Protein Sci. 10 (11), 2354-62. 5. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.

Baldi (P0021) - 61 predictions: 61 3D Baldi-CONpro (P0022) - 62 predictions: 62 RR Baldi-SSpro (P0023) - 63 predictions: 63 SS CMap23Dpro (P0253) - 1 prediction: 1 3D CMapPro (P0255) - 0 predictions SSpro2 (P0254) - 65 predictions: 65 SS

Automated ab initio Prediction of Protein Structure Through Contact Maps by Recurrent Neural Networks

Gianluca Pollastri and Pierre Baldi University of California, Irvine [email protected]

The strategy we implemented to predict protein structure splits the problem in three stages, as described in [1]. The first stage corresponds to modules that predict structural features including secondary structure and relative solvent accessibility. The second stage corresponds to modules that predict the contact map of the protein at the amino acid level, using the primary sequence and the structural features. The final stage is the reconstruction of 3D coordinates of C  atoms from predicted contact map and secondary structure. All the steps are entirely automated and performed without any human intervention..

The methods we use for secondary structure and relative solvent accessibility prediction have been described in [2,3]. These methods try to overcome the limitations of simple feed-forward networks and consist of BRNNs (Bidirectional Recurrent Neural Networks) with the capability of capturing at least partial long-ranged information without overfitting. The recurrent neural networks are given as input PSI-BLAST profiles derived as described in [2]. Both in the case of secondary structure and relative solvent accessibility an ensemble of networks is used for the final prediction. Secondary structure predictions were submitted to CASP in two versions (SSpro2 and Baldi-SSpro), trained on different data sets. Versions of the methods are also freely available as web servers (SSpro, ACCpro) at the address: http://promoter.ics.uci.edu/BRNN-PRED/

A-19 In the second step, we go from the primary sequence and the structural features to the map of contacts between amino acids. Training a large neural network to directly predict 3D coordinates from primary sequence information is in fact likely to fail because the problem is highly degenerate. Translations and rotations leave the structure invariant but greatly change the 3D coordinates. In contrast, contact maps provide a topological representation of the structure that is invariant under rotation and translation. Furthermore, contact maps typically contain enough information to reconstruct the full structure even in presence of noise [4]. A previous attempt to predict protein contact maps is described in [5]. Our current approach to the problem rests on a generalization of the graphical model underlying BRNNs to process one- dimensional objects. The generalization of this architecture to two-dimensional objects, such as contact maps, is described in [6]. In its basic version the model consists of nodes regularly arranged in 6 planes: one input plane, one output plane, and 4 hidden planes. This graphical model is implemented with five feed-forward neural networks, four representing transitions in the hidden planes given the input, one representing the input-output transformation. The main advantages of this model are that it chooses automatically an optimal context to base its decision on, and that it can capture at least partial long-ranged information without overfitting. We submitted contact map predictions to CASP at 8 and 12 Angstrom (respectively as CMapPro and Baldi-CONpro). The 12 Angstrom predictor is trained on a larger data set and proves to be more reliable in our tests, especially on proteins of length greater than 100 amino acids.

In contrast with the first two stages that heavily rely on machine learning methods, the last reconstruction step is addressed using distance geometry and optimization techniques without learning. Our approach partly follows [4] but with a number of significant modifications due to the fact that, in our case, predicted maps differ from exact maps, as well as from random perturbations of exact maps by uniform additive noise. In particular, in order to deal with the specific properties of predicted contact maps we use: (1) semi-random moves of variable length (combining of a random vector and an attraction vector directed towards putative contacts); (2) a bond length term in the energy function to deal with unphysical bond lengths introduced by the moves; and (3) a two-phase search with a first rough phase comprising large steps where only the predicted contact map contributes to the energy, and a second refinement phase comprising smaller steps that take into account the effects of chirality, bond length and amino acid hard-core repulsion forces. The submitted 3D structures are predicted using two different versions of the reconstruction algorithm: the first version (CMap23Dpro) uses a direct in-house implementation of [4], the other (Baldi) uses the modified version described above.

1. Baldi P. and Pollastri G. (2002) Machine Learning Structural and Functional Proteomics, IEEE Intelligent Systems (Intelligent Systems in Biology II), March/April. 2. Pollastri G., Przybylski D., Rost B., Baldi P. (2002) Improving the Prediction of Protein Secondary Structure in Three and Eight Classes Using Recurrent Neural Networks and Profiles, Proteins, 47, 228-235. 3. Pollastri G., Baldi P., Fariselli P., Casadio R. (2002) Prediction of Coordination Number and Relative Solvent Accessibility in Proteins, Proteins, 47, 142-153. 4. Vendruscolo M., Domany E. (2000) Protein folding using contact maps. Vitam Horm. 58, 171-212. 5. Fariselli P, Olmea O, Valencia A, Casadio R. (2001) Prediction of contact maps with neural networks and correlated mutations, Protein Eng. Nov;14(11), 835-43. 6. Pollastri G, Baldi P. (2002) Prediction of contact maps by GIOHMMs and recurrent neural networks using lateral propagation from all four cardinal corners, Bioinformatics. Jul;18 Suppl 1, S62-S70.

A-20 Bass-Michael (P0384) - 51 predictions: 51 3D

A Threading Approach to Structure Prediction

M. Bass and R. Luethy Computational Biology, Amgen Inc. [email protected]

The threading approach used here was employed to test the accuracy of a threading method when applied to a variety of test sequences. In target sequences that are similar to a known structure, if threading produced a different alignment, the thread alignment was used to test if the method can improve the residue shift error in alignments.

The threading method uses residue-based statistical potentials. The potentials were calculated as log-odds of the interaction. Three potentials were used. Each potential was given equal weight. The surface area potential was evaluated by dividing the surface exposure into ten equal bins. The pairwise interaction potential was calculated by measuring the closest atom-atom contact between pairs of amino acids such that the pair of amino acids was at least five amino acids apart. Only interactions between 2.5Å and 12.5Å were counted. The backbone dihedral potential was calculated for each amino acid. The dihedral angles were divided into 20 equal bins and separated according to amino acid. The standard statistical potentials were calculated against a subset of the Protein Data Bank (July 2002 release) such that no two proteins share more than 35% sequence identity. This set was reduced by removing any structures that fail a self-thread test. That is, a sequence must be able to find its structure with the threading algorithm. This produced a unique subset of the Protein Data Bank containing 2399 structures. A similar subset of the Protein Data Bank was used for the query database. This subset contained proteins that share no more than 45% sequence identity (2875 structures). The algorithm produces gapped alignments without end penalties using an adaptation of the Needleman-Wunsch algorithm [1].. The gap creation penalty is 3.5 and the gap extension penalty is 0.7. The Z-score was calculated for all of the alignments and alignments producing a Z-score in excess of 5.0 were considered. WU-Blast [2] was also run for each target against the structural database to provide a comparison alignment. The sequence alignment was converted into a three-dimensional structure by the following method. The alignments were converted to a C-alpha trace based on the coordinates of the template structure. Residues around any gaps in the alignment were allowed to vary according to the method of Luethy [3]. After the structure optimization, all-atom coordinates were constructed in the following way: first all coordinates from the PDB fragments were copied, then missing backbone atoms were inserted by looking up the closest 5 residue backbone fragment in PDB, finally missing side-chain atom were copied from the closest 5 residue fragment from PDB with the same residue in the middle. The structure was then minimized using TINKER [4] using a steepest descent method with fixed C-alpha atoms.

1. Needleman S.B. and Wunsch C.D. (1970) A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. J. Mol. Biol., 48, 443-453. 2. Gish W. (1996-2002) http://blast.wustl.edu 3. Luethy R. (2002) Unified Prediction Approach for Comparative Modeling and ab initio Predictions. CASP5 Abstract. 4. Ponder J.W. and Richards F.M. (1987) An Efficient Newton-like Method for Molecular Mechanics Energy Minimization of Large Molecules. J. Comput. Chem., 8, 1016-1024 (http://dasher.wustl.edu/tinker/)

A-21 Bates-Paul (P0096) - 72 predictions: 72 3D

Comparative Modelling By In Silico Recombination of Templates, Alignments and Models

Bruno Contreras-Moreira, Paul W. Fitzjohn, Marc Offman, Graham R. Smith and Paul A. Bates Biomolecular Modelling Laboratory Cancer Research UK - London Research Institute [email protected]

After the CASP4 assessment it was concluded that template selection and sequence alignment remain the main problems awaiting solution in the field of comparative modelling [1]. Models were rarely found to be closer to the experimental structures than the optimal template and often manual intervention only marginally mproved their quality. Similar problems were found in the fold recognition category [2,4], suggesting that the same approach may be applied in the search for possible solutions in both fields. During CASP5 our group has tested a novel procedure to tackle these problems. This new method was used to generate models for all 67 targets, with roughly half of them classified as fold recognition targets by the CAFASP3 meta-server (www.cs.bgu.ac.il/~dfischer/CAFASP3). This procedure is named in silico protein recombination, as it is a computational implementation of genetic recombination, a well known mechanism for generating population variability, but at the protein level. For each CASP5 target a population of models was generated from a variety of templates and sequence alignments. Care was taken to assure that models had similar length and were complete, adding missing loops when necessary and smoothing their phi/psi geometry to permit later energy calculations and minimizations. The algorithm can be outlined as:

initial population of models  (1) grow population: r recombination + (1-r) mutation  (2) select best proportion according to fitness  converged? stop : otherwise back to (1)

This is a standard genetic algorithm with two genetic operators (recombination and mutation) and a fitness function acting as an artificial selection agent. We will now briefly describe each step in the protocol.

Initial population of models. Initially, our server Domain Fishing [3] (www.bmm.icnet.uk/servers/ 3djigsaw/dom_fish) was used to define protein domains within each target sequence and to find suitable modelling templates. Resulting alignments were inspected and corrected if suspected to be incorrect. If reasonable alternative alignments could be found they too were added to the pool. When possible, only alignments with bit- scores (average pssm-logodds+secondary structure agreement/residue) around 2 were selected. In several cases annotations from the templates or their corresponding

A-22 PFAM families were used to check the correctness of the alignment in active/binding sites. Usually several models were built using the same template changing parts in the alignment. Models from these alignments were built using our server 3D-JIGSAW [4] (www.bmm.icnet.uk/servers/3djigsaw). Additional models were obtained from the CAFASP3 server after inspection of the alignments to gain extra variability in sequence alignments, templates used and exposed loops. These models were taken from different sources, including FAMS (physchem.pharm.kitasatou.ac.jp/FAMS), Pmodeller (www.sbc.su.se/~arne/pcons) and EsyPred3D (www.fundp.ac.be/urbm/bioinfo/esypred). Models were inspected and missing parts, typically loops, added using in-house software before going to the next step. In essence, this software explores phi/psi space to allow a peptide (the missing loop) to connect a gap in a protein fold.

1. Growing the population by recombination and mutation. The initial population was grown by randomly selecting pairs of protein models and applying one of the two possible operators. In the case of recombination, the models were superimposed based on their sequence alignment and a crossover point drawn. Crossover was not permitted inside secondary structure elements. The resulting recombinant model inherits the N-terminus from one parent and the C-terminus from the other. In mutation events (occurring with frequency 1-r, where r is the recombination probability) a new protein model was obtained by simply averaging its parents' coordinates after superimposition. In many cases this process obtained distorted side-chain conformations.

2. Selecting the best proportion. Fitness function. The whole idea of the algorithm is that it should be possible to obtain optimized mosaic models by shuffling them in a rational way. The key point in this approach is thus the choice of an appropriate fitness function. After some benchmarking experiments (unpublished results) we chose a function that calculates a free energy estimate based on two terms: protein contact pair-potentials and side-chain solvation energies estimated from their solvent accessible area. This function seems to yield a consistent measure of protein structural quality. When each population reaches the upper limit (between 2 and 4 times its initial size), this energy function is used to rank its members. Only the worst 25% of the population is discarded at this point, to assure that quality models are not lost prematurely.

3. Convergence criterion and final refinements. When the population has converged to similar energies, there is no room for further generation of variability and the evolution process stops. At this point the final population is inspected. In most cases this consists of several representations of the same protein conformation with average backbone deviations in the order of 0.1Å. One of these representatives is then taken as the final model, which is carefully inspected to detect unfavorable peptide conformations and a final energy minimization using the CHARMM22 force field is performed. This procedure is able to fix distorted side-chains. At this point we have a CASP5 unrefined model. In addition, for targets T0134, T0165, T0177 and T0185 we tested a further refinement step consisting of running an all-atom, molecular dynamics simulation inside a water box, with neutral total charge for around 0.5ns. For these simulations we used the GROMACS package (www.gromacs.org) and the OPLSAA force field. Snapshots taken from the trajectory were clustered according to average backbone deviations and one conformation from the most populated cluster was selected. After a few rounds of CHARMM22 energy minimization, it was submitted as a refined model. Insufficient computer resources prevented us from refining all targets.

1. Tramontano A., Leplae R. and Morea V. (2001) Analysis and Assessment of Comparative Modeling Predictions in CASP4.. Proteins suppl 5, 22-38 2. Sippl M.J., Lackner P., Domingues F.S., Prlic A., Malik R., Andreeva A. and Wiederstein M.(2001) Assessment of the CASP4 Fold Recognition Category . Protein suppl 5, 55-67.

A-23 3. Contreras-Moreira B. and Bates P.A. (2002) Domain Fishing: a first step in protein comparative modelling. Bioinformatics 18, 1141-1142. 4. Bates P.A., Kelley L.A., MacCallum R.M. and Sternberg M.J.E. (2001) Enhancement of Protein Modelling by Human Intervention in Applying the Automatic Programs 3D-JIGSAW and 3D-PSSM. Proteins suppl 5, 39-46. (www.bmm.icnet.uk/servers/3djigsaw)

Benner-steve (P0524) - 35 predictions: 18 3D, 17 SS

Evolution-based Structure Prediction

D.W. De Kee, T.J. McCormack, and S.A. Benner Foundation for Applied Molecular Evolution P.O. Box 13174, Gainesville FL 32604 [email protected]

Predictions for fourteen CASP5 ab initio targets were submitted in a collaborative effort to explore the potential for predicting secondary structure in the transparent secondary structure prediction method [1]. The targets were selected based on the availability of homologous protein sequences in adequate numbers and evolutionary distributions in the MasterCatalog, a commercial naturally organized database developed in collaboration with EraGen Biosciences (Madison, WI).

Multiple alignments were generated using the automated DARWIN-server [2].Secondary structures were predicted based on automated heuristics to assign surface, interior, active site and parsing residues by analysis of patterns of conservation and variation among homologous protein sequences in light of evolutionary models that interpret amino acid substitutions as the consequence of neutral variation subjected to functional constraints [3].

For the targets with a homolog whose structure has been solved, multiple alignment trials were performed. The alignments were executed with different gap-opening and gap-extension penalties. The alignments were then evaluated by visualizing them in relation to the solved structure, with the assumption that the greatest sequence variation exists outside the boundaries of conserved secondary structure motifs, i.e., -helices and -strands. Also, additional homologous sequences were added to the alignments in order to obtain a family profile, which allowed us to optimize the alignments, since key residues are more likely to be universally conserved.

1. Benner S.A. and Gerloff S.D. (1990) Patterns of divergence in homologous proteins as indicators of secondary and tertiary structure: a prediction of the structure of the catalytic domain of protein kinases. Adv. Enzyme Regul. 31, 121-181. 2. Gonnet G.H. et al. (1992) Exhaustive matching of the entire protein sequence database. Science 256, 1443-1445. 3. Benner S.A. et al. (1994) Bona- fide prediction of aspects of protein conformation. J. Mol. Biol. 235, 926-958.

Benner-steve (P0524) - 35 predictions: 18 3D, 17 SS

Evolution-based Structure Prediction Tools

A-24 Steven A. Benner, Danny De Kee, Thomas McCormack Unversity of Florida, Foundation for Applied Molecular Evolution email: [email protected]

In 1992, the first convincing tools were introduced for predicting protein conformation from sequence data. These started with a set of aligned homologous sequences for proteins diverging under functional constraints (1). These were applied against the two ab initio targets presented in the CASP 1 prediction context, phospho-beta- galactosidase and synaptotagmin, and generated correct tertiary structure models for both. The judges noted that these represented the first two successful ab initio predictions in the CASP program (2).In CASP 2, these tools generated another prediction, this time for the heat shock protein 90 (3). Here, the prediction was sufficiently accurate that it correctly assigned HSP90 as a distant homolog of gyrase, generated a functional hypothesis for HSP90, and identified as incorrect certain interpretations of experimental data concerning the function of HSP90.

Outside of the CASP project, the tools have been used to analyze the structures of protein kinase, the pleckstrin homology domain, and ribonucleotide reductase, among others, where their outcome has gone beyond that of modelling the fold, but in each case answer questions of interest to biologists and biomedical researchers working with these systems (4). A version of the method has been applied to every protein sequence family in GenBank, and these predictions are incorporated into the MasterCatalog, an interpretive proteomics database marketed by EraGen Biosciences (Madison WI) (5). The MasterCatalog helps identify diagnostics and therapeutics targets, assess the value of animal models for human disease, and correlate genomic data with function, starting with pathway interactions and extending to the cell, organism, ecosystem, and planetary biosphere (6).

At the time that they were introduced, it was clear that evolution-based structure prediction methods suffer from specific weaknesses inherent in their formulation. These weaknesses are expected regardless of the details surrounding its implementation. Thus, the PHD tool, which implements the same basic idea but in the form of a neural network, is expected to suffer from the same weaknesses, and this has been suggested anecdotally. The purpose of our participation in CASP5 is to generate a reference database of record of predictions done using the 1992 method, which is described in detail, both in (1), and in the patent literature (7).

1. Benner S. A., Gerloff D. L. (1991) Patterns of divergence in homologous proteins as indicators of secondary and tertiary structure. The catalytic domain of protein kinases. Adv. Enzyme Regulat. 31, 121-181 2. DeFay T., Cohen F. E. (1995) Evaluation of current techniques for ab initio protein structure prediction. Proteins 23, 431-445 3. Gerloff D.L., Cohen F.E., Korostensky C., Turcotte M., Gonnet G.H., Benner S.A. (1997) A predicted consensus structure for the N-terminal fragment of the heat shock protein HSP90 family. Proteins: Struct. Funct. Genet. 27, 450-458) 4. Benner S.A., Cannarozzi G., Chelvanayagam G., Turcotte M. (1997) Bona fide predictions of protein secondary structure using transparent analyses of multiple sequence alignments. Chem. Rev. 97, 2725-2843 5. Benner S.A., Chamberlin S.G., Liberles D.A., Govindarajan S., Knecht L. (2000) Functional inferences from reconstructed evolutionary biology involving rectified databases. An evolutionarily-grounded approach to functional genomics. Research Microbiol. 151, 97-106

A-25 Bilab (P0080) - 200 predictions: 200 3D

Tertiary Structure Prediction of Proteins Using Probability Maps of Mainchain Torsion Angles for New Fold Targets and Comparative Modeling Method for Other Targets

S. Nakamura1, T. Nishimura2, T. Ishida1, T. Miki1, J. Sasaki1, K. Hibi1, T. Ishizuka1,3 and K. Shimizu1 1 - Department of Biotechnology, the University of Tokyo, 2 - Graduate School of Humanities and Sociology, the University of Tokyo, 3 - Faculty of Industrial Science and Technology, Tokyo University of Science [email protected]

We have submitted tertiary structure prediction models for most of CASP5 target proteins except for T0136 and T0145. First we searched structural templates for the target sequence by using PSI-BLAST and 3D-PSSM server against Protein Data Bank. When we could not find any templates for the target, we used ab initio protein structure modeling tool named "ABLE" developed in our laboratory to produce prediction models. Otherwise we used MODELLER to build up prediction models based on the alignments of the templates and the target.

Modeling with ABLE was based on energy minimization of statistical potential by simulated annealing. First, we built up probability maps for mainchain torsion angles (phi-psi) at each position of the target sequence. Sequence similarity scores between nine-residue sub-sequence of the target at each position and all the fragments in the same length from tertiary structure database were calculated. As this database, NCBI non-redundant PDB (nrpdb) was used. Proteins with irregular residues, chain breaks, missing sidechains, and membrane proteins were eliminated from the list with cutoff p-value of 1.0e-7. 1164 chains were used in total. Sequence similarity score function was similar to that by Fischer et al [1] including sequence identity and the matching of secondary structure. BLOSUM62 matrix was used for the calculation of the sequence identity. Secondary structure prediction of the target was performed by using PSIPRED server. Weight factors to emphasize matching at the center of a nine-residue window were used. Probability maps of mainchain phi-psi torsion angles were obtained from phi-psi values of amino acids at the center of all nine-residue fragments with similarity scores larger than a threshold. For this procedure, the effects of the fragments with higher similarity scores were enhanced. Smoothing with Gaussian was applied to these maps.

After building probability maps for each amino acid, a number of tertiary structure models of the target were produced to minimize potential energy by simulated annealing using these maps. Potential energy function we used was modification of that by Simons et al [2]. Degree of buriedness of each amino acid, contacts between residues, and average length between hydrophobic residues were used to evaluate matching between the sequence and the structure, and hydrogen bonds between mainchains, packing of secondary structure segments, exclusive volume to avoid overlap of residues, and radius of gyration were used to evaluate the plausibility of the model as a protein tertiary structure. When we could not obtain compact structures for a target, restrictions of distances between several residue pairs were added to the potential energy function. Weight factor for each energy term was adjusted for each target to obtain compact model structures and was changed as the progress of simulated annealing. For each simulated annealing step, we changed mainchain phi-psi torsion angles at random position to random values according to probability maps. About 200 to 5000 structures were produced for each target by simulated annealing (about 30000 to 200000 steps per each run), followed by clustering of these

A-26 structures. Up to five structures which were the nearest from the centers of large clusters were selected, and sidechain modeling was performed for these structures by using SCWRL. The order of the submission was determined by manual inspection of these structures.

Our procedure of modeling with MODELLER was as follows. We searched templates for the target using the PSI-BLAST and 3D-PSSM. One or more templates were selected considering the matching scores of these templates, the matching between secondary and tertiary structures of them, and the results of secondary structure prediction and tertiary structure prediction of the target by CAFASP3 servers. Sequence alignment of the templates and the target sequence was first obtained by using PSI-BLAST or 3D-PSSM, and then modified manually according to the secondary structures of each sequence and the results of tertiary structure modeling by MODELLER. When more than one templates were used, all of the possible combinations of them for alignments were tried. About 20 to 500 models were produced for each alignment using MODELLER and a few models were selected according to plausibility of the secondary structures and compactness of the models. If there were no templates for some parts of the target sequence, ABLE with restrictions of distances between residue pairs was used for such parts of the target. Sidechain modeling was performed to these structures by using SCWRL. Finally, up to five models were selected and the order of submission was determined by manual inspection.

1. Fischer D. et al. (1996) Protein fold recognition using sequence-derived predictions. Protein Science. 5 (5), 947-955. 2. Simons K.T. et al. (1999) Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins 34 (1), 82-95. 3. Benner S.A., Caraco M.D., Thomson J.M., Gaucher E.A. (2002) Planetary biology. Paleontological, geological, and molecular histories of life. Science 293, 864-868 4. Benner, S. A. (1999) Predicting Folded Structures of Proteins. US Patent 5,958,784.

BioInfo.PL (P0006) - 75 predictions: 75 3D

3D-Jury

Leszek Rychlewski BioInfoBank Institute, Poznań, Poland [email protected]

3D-Jury is a simple consensus structure prediction system, which shares similarity with solutions employed in the field of ab initio fold recognition. Recent advances in the development in this area can be accredited to the application of non-energetic constrains such as preferences for high contact order or the detection of clusters of abundant conformations. The experience with ab initio prediction methods lead to the conclusion that averages of low-energy conformations obtained most frequently by folding simulations are closer to the native structure than the conformation with lowest energy. The direct translation of this findings into the filed of fold recognition by threading methods would mean that most abundant high-scoring models are closer to the native structure than the model with highest score. This is the main rational behind the 3D-Jury approach.

3D-Jury takes as input groups of models generated by a set of servers. All models are compared with each other and a similarity score is assigned to each pair, which equals to the number of C-alpha atom pairs that are within 3.5 Å after optimal superposition. If this number is below 40, the pair of models is annotated as not similar

A-27 and the score is set to Zero. The cutoff value of 40 was taken from previous benchmarking results and indicates a roughly 90% chance for both models to belong to the same fold class. The final 3D-Jury score of a model is the sum of all similarity scores of considered model pairs divided by the number of considered pairs plus one. The 3D-Jury system can operate in two modes, which differ by the allowed set of considered model pairs. The best-model-mode (3D-Jury-single) allows only one model from each server to be used in the sum, while the all-models-mode (3D-Jury-all) allows the consideration of all models of the servers:

N Ni   sim(M a,b , M i, j ) 3DjuryAll(M )  i j,ai OR bj a,b N Ni 1  1 i j,ai OR bj

N Ni max sim(M a,b , M i, j )  j,ai OR bj i 3DjuryAll(M a,b )  N 1 1 i

sim(M a,b , M i, j ) : similarity score between model M a,b and model M i, j 3DjuryAll : 3D - Jury score in the all - models - mode 3DjurySingle : 3D - Jury score in the best - model- mode

M a,b : model number b from the server a

M i, j : model number j from the server i N : number of servers

N i : Number of top ranking models from the server i (maximum 10)

The 3D-Jury system does not utilize directly the reliability score assigned to the models by the servers. This does not necessary mean that the information about the original scores will be lost. It can be expected that highly reliable models produced by fold recognition methods have less ambiguities in the alignments to template structures, which would result in higher similarity between models generated on templates with the same fold and finally in higher 3D-Jury scores.

Biokol (P0258) - 23 predictions: 23 3D

Low-Mode Optimization

A-28 I. Kolossváry1,2 1 – BIOKOL Research LLC, 2 – Budapest University of Technology and Economics [email protected]

The author introduced the low-mode conformational search procedure (LMOD) a few years ago [1] for automated small molecule conformational analysis and has further developed it to treat larger and larger systems with applications to flexible active site docking [2], protein loop optimization [3] and most recently, fully flexible induced fit docking [4]. The CASP5 experiment has been my first attempt to use LMOD for making ab initio fold predictions.

The success of LMOD can mainly be attributed to taking full advantage of correlation among the moving degrees of freedom. While the total number of degrees of freedom in a large molecular system is prohibitive to treat them independently, the high degree of correlation allows for a significant reduction in dimensionality. Conformational interconversions proceed via concerted atomic motions and can be described with only a few, non-correlated degrees of freedom in terms of low- frequency/large-amplitude vibrational modes. LMOD automatically generates its own, low-dimensional search space, which is spanned by the low-frequency mode eigenvectors of the Hessian matrix. LMOD can explore large systems efficiently, because the effective number of degrees of freedom associated with collective motions responsible for conformational dynamics is rather small even for proteins.

Another way of looking at LMOD is by referring to domain motions. A domain can be defined – with respect to a particular vibrational mode – as a set of atoms in a protein, which move only a little with respect to each other, but move considerably with respect to the rest of the protein. In other words, the concerted motion of semi- rigid parts – the domains – of the protein can often describe protein motions in a simplified way. It is clear that the required number of degrees of freedom to describe such domain motion is only a tiny fraction of the total number of degrees of freedom. It is instructive to note that the conformational dynamics of certain classes of molecules can be described by different types of “natural” motions, such as torsional rotation for small to medium size acyclic molecules, or certain “kayaking” and “flapping” motions for cycloalkanes [5]. For proteins, domain motion is the natural motion, not torsional rotation. Levinthal’s paradox does not apply to LMOD, because LMOD operates in normal mode space, not in torsion space.

LMOD has been utilized in CASP5 to refine crude protein models by optimizing backbone and side chain conformations simultaneously. The initial models for each prediction were selected from a diverse set of about 100,000 low-energy C-trace folds obtained by threading. The threading models were clustered and the ten lowest energy cluster representatives were subjected to LMOD optimization. First, all ten C models were turned into all-atom models by fitting an approximate backbone and attaching side chains. The idea of LMOD optimization is refining/relaxing the backbone and the side chains simultaneously by applying large-amplitude structural changes along the low-frequency vibrational modes. The particular modes and their amplitudes were selected randomly [1-4] to generate a mainly downhill trajectory on the potential energy landscape (AMBER94 with GB/SA solvation). The five lowest energy structures found during the course of several independent LMOD runs were submitted to CASP5. LMOD runs were accomplished on linux clusters using MacroModel 8.0. A universal LMOD optimization package will be available from the author.

1. Kolossváry I., Guida W.C. (1996) Low mode search. An efficient, automated computational method for conformational analysis: Application to cyclic and acyclic alkanes and cyclic peptides. J. Am. Chem. Soc., 118 (21), 5011-5019. 2. Kolossváry I., Guida W.C. (1999) Low-mode conformational search elucidated. Application to C 39H80 and flexible docking of 9-deazaguanine inhibitors into PNP. J. Comput. Chem. 20 (15), 1671-1684.

A-29 3. Kolossváry I., Keserü G.M. (2001) Hessian-free low-mode conformational search for large scale protein loop optimization: Application to c-jun N-terminal kinase JNK3. J. Comput. Chem. 22 (1), 21-30. 4. Keserü G.M., Kolossváry I. (2001) Fully flexible low-mode docking. Application to induced fit in HIV integrase. J. Am. Chem. Soc., 123 (50), 12708-12709. 5. Kolossváry I., Guida W.C. (1993) Comprehensive conformational analysis of the four- to twelve-membered ring cycloalkanes: Identification of the complete set of interconversion pathways on the MM2 potential energy hypersurface. J. Am. Chem. Soc. 115 (6), 2107-2119.

Bion (P0474) - 63 predictions: 63 SS

Secondary Structure Prediction with Shuffled Training by SPAM

R. Shigeta and J.P. LeFlohic Bion Bioinformatics Consulting [email protected]

This instance of the Structure Prediction Application Metatool (SPAM) uses two sequential neural networks. The first is a 15-75-3 sequence-to-structure network which takes as input the actual residue and a PSIBLAST [2] position specific sequence profile (PSSM). Similar to the JNET architecture [1], a window of output from the first neural network is fed into a 15-55-3 structure-to-structure network. SPAM also feeds a copy of the original residue and the PSSM probability data to the second network.

A non-redundant set of 504 protein sequences and structures from the protein data bank [3] set were used as the training set, with a random 114 set aside for a non- trained test set. Proteins with out any identified secondary structure were discarded. Upon loading, the sequences are broken into window length training patterns and a shuffled such that the neural network is presented with each class of secondary structure at each training step and a similar number of examples of each structure.

Training proceeds in epochs until all the errors from the neural networks in the application cease to change more than an epsilon value which must be assigned by hand, between 1e-3 and 1e-5. A training epoch is defined as the presentation of 10,000 patterns, and so the training cycles do not contain exactly the same data.

The final prediction of beta, helix, or coil is selected by choosing the highest of the three outputs for each residue. No weighting is applied to the outputs.

The confidence is calculated in the standard way as the difference between the highest float output and the second highest one. CASP entries were then edited by hand for improbable patterns in secondary structure.

1. Cuff J. A and Barton G.J (1999) Application of enhanced multiple sequence alignment profiles to improve protein secondary structure prediction, Proteins 40:502- 511. 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402.

A-30 3. Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E. The Protein Data Bank. Nucleic Acids Research, 28, 235- 242 (2000)

Bionomix (P0475) - 61 predictions: 61 3D

STRUCTFAST: Structure Realization Utilizing Cogent Tips From Aligned Structural Templates

J.F. Danzer and D.A. Debe Eidogen, Inc. [email protected]

While the alignment methods used in comparative modeling techniques have recently begun to incorporate structural information from the homology template, current techniques still do not capture much of the available information from a multiple structure profile [1-2]. We have developed a novel dynamic programming algorithm, STRUCTFAST, uniquely capable of incorporating important information from a structural family directly into the sequence alignment process [3]. This fully automated algorithm routinely produces alignments that meet or exceed the quality obtained by an expert human homology modeler.

Models are constructed from the automated alignments using a variant of the assembly of rigid bodies technique [4] and unconserved side chains are built using a standard rotamer library [5]. Models are refined via a standard minimization procedure employing the Dreiding force field [6] in conjunction with a surface area based solvation potential [7]. Quality of the final model is assessed using the ProsaII program [8]. A final pG reliability score is computed from the ProsaII scores [9].

1. Kelley L.A. et al. (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol. 299 (2), 499-520. 2. Shi J. et al. (2001) FUGUE: Sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties.. J. Mol. Biol. 310 (1), 243-257. 3. Debe D.A. et al. Unpublished work. 4. Marti-Renom M.A. et al. (2000) Comparative protein structure modeling of genes and genomes. Annu. Rev. Biophys. Biomol. Struct. 29, 291-325. 5. Tuffery P. et al. (1991) A new approach to the rapid-determination of protein side-chain conformations. J. Biomol. Struct. Dyn. 8 (6), 1267-1289. 6. Mayo S.L. et al. (1990) Dreiding -- A generic force-field for molecular simulations. J. Phys. Chem. 94 (26), 8897-8909. 7. Danzer J.F. et al. Unpublished work. 8. Sippl M.J. (1993) Recognition of errors in three-dimensional structures of proteins. Proteins 17 (4), 355-362. 9. Sanchez R. and Sali A. (1998) Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. USA 95 (23), 13597-13602.

A-31 Braun-Werner (P0024) - 65 predictions: 65 3D

Evaluating Sequence Alignment of Fold-Recognition Tools by Quantitative Scoring of Physical-Chemical Property Based Motifs

Venkatarajan Mathura, Ovidiu Ivanciuc, Numan Oezguen, Catherine Schein, Yuan Xu and Werner Braun Sealy Center for Structural Biology, Department of Human Biological Chemistry and Genetics, University of Texas Medical Branch, Galveston, TX 77555 [email protected]

We participated in CASP5 to test, in a systematic way, our new methods for incorporating sequence decomposition tools into our modeling methods. Protein sequences of the CASP targets were separated into sequential motifs which were then used to judge and improve the alignment between templates and targets given by fold recognition methods. As in CASP4, we used the 3D modeling suite EXDIS/DIAMOD/FANTOM to generate 3D models of the targets, but this time the modeling was facilitated by incorporating the individual programs into one central tool, MPACK.

Sequence motifs characterizing the protein families of the CASP5 targets were automatically generated by our program MASIA [1]. These motifs are based on conservation of physical-chemical properties, as we have previously demonstrated that even distantly related proteins share contiguous segments with similar patterns of physical-chemical properties. We have recently developed five-dimensional quantitative descriptors for each of the 20 amino acids based on a large number of physical-chemical properties [2]. Conservation of physical-chemical properties at equivalent residue positions in a protein family is then defined by measuring the standard deviations and the relative entropies of these descriptors.

For each target we prepared multiple alignments of the target sequence with similar sequences from other organisms as identified in BLAST/PSIBLAST. These multiple sequence alignments were then analyzed with MASIA to identify regions where the distribution of the quantitative descriptors within the multiple alignment is significantly different from the 'a priori' expected distribution. Each motif is quantitatively expressed as a profile consisting of vector magnitude, standard deviation and the relative entropy. The relative entropy is used to measure the deviation of the actual observed distribution of the descriptors from that of randomly occurring amino acids at a given sequence position in the multiple alignment. These profiles were then used by our program ALIGNSCORER to determine which templates and alignments, from the various fold recognition servers that participated in CAFASP, had the highest score according to their degree of fit in the highly conserved areas. For some of the targets we combined the alignment with high scoring motifs from different fold recognition servers and from different templates. For targets with no significant differences in the alignment for the motifs we used secondary structure predictions to generate an improved alignment. In most cases, the profiles of the template molecule were also determined. In several cases, where there were few known sequences similar to the target protein, we determined instead motifs (and molegos) in the suggested templates from the PDB and matched these to the target in the alignment. For two targets (T139,T170) we could not find suitable templates and made ab initio 3D structure prediction based on secondary structure prediction from JPRED and inside/outside prediction from MASIA. The final models were judged according to stereochemical criteria (PROCHECK) and energy (ECEPP) and where possible, by biological ones, such as surface location of known glycosylation and inter-protein interaction sites, and similar configuration of the active sites of enzymes. We have included in all submission files a general description of our method and a specific section with details for each target.

A-32 1. Zhu H. et al. (2000) MASIA: recognition of common patterns and pro-erties in multiple aligned protein sequences. Bioinformatics 16, 950-951. 2. Venkatarajan M. S. and Braun W. (2001) New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical-chemical properties. J. Mol. Modeling 7, 445-453. Brooks (P0373) - 252 predictions: 252 3D

Structure Prediction with Multiscale Modeling Methods

Michael Feig, Charles L. Brooks III NIH Research Resource for Multiscale Modeling Tools in Structural Biology The Scripps Research Institute, La Jolla, CA [email protected], [email protected]

We have used a multiscale modeling strategy to predict protein structures ab initio or from templates based on sequence homology or fold recognition. In our approach we combine low-resolution, lattice-based representations of protein structures with models in full atomic detail using the newly developed MMTSB Tool Set [1]. Such a modeling method was used to provide both speed and accuracy while exploring conformational space to search for the global free energy minimum that is presumed to coincide with the native fold.

In the standard protocol employed for most targets we first generated a large number of conformations using lattice-based sampling with a modified version of the program MONSSTER [2] that uses the replica exchange methodology for enhanced sampling [3]. For ab initio predictions we started from random extended chains using only information from secondary structure prediction servers to guide the lattice simulations. If structural templates were available from sequence homology or fold recognition, the lattice sampling protocol was applied only to missing regions (from small loops to larger fragments) in the context of the template structure. In some cases different models were built from multiple templates and refined and ranked together at a later stage. Templates were selected based on alignments provided through CAFASP or from using public fold recognition servers, in particular 3D-pssm (http://www.sbg.bio.ic.ac.uk/~3dppsm) and PDB-Blast (http://bioinformatics.burnham-inst.org/pdb_blast/PB_help.html)

The lattice model conformations were subsequently rebuilt to complete all-atom models using an accurate reconstruction procedure [4] and clustered according to distance RMSD. The all-atom models were then minimized and scored with an all-atom energy function using the CHARMM program [5]. As part of the scoring function we used a new, highly accurate Generalized Born implementation [6] to provide an implicit solvent description since this function was found to be most effective in scoring CASP4 predictions [7]. A number of structures were then selected from the clusters with the lowest average energy scores, on the order of 10 to 100, and submitted to further refinement.

In the final refinement step, we used molecular dynamics simulations with Generalized Born implicit solvation and replica exchange. In replica exchange simulations, a number of simulations are run concurrently at different temperatures while temperatures are exchanged periodically according to Metropolis criteria based on the system’s total energy. In such simulations the most favorable conformations populate the lowest temperature windows while less favorable conformations at higher temperatures can search conformational space more extensively for lower energy regions. This greatly enhances sampling towards the global free energy minimum but also provides intrinsic ranking of structures based on free energies from the average temperature for a given replica during the course of a simulation. Accordingly, the final structures at the lowest temperatures from the replica exchange simulations were submitted as predictions.

A-33 1. Feig M., Karanicolas J., Brooks C.L. III. (2001) MMTSB Tool Set. NIH Research Resource for Multiscale Modeling Tools in Structural Biology, The Scripps Research Institute. http://mmtsb.scripps.edu/software/mmtsbtoolset.html 2. Kolinski A., Skolnick J. (1994) Monte Carlo Simulations of Protein Folding. I. Lattice Model and Interaction Scheme. Proteins. 18, 338-352 3. Sugita Y., Okamoto Y. (1999) Replica-exchange molecular dynamics method for protein folding. Chem. Phys. Lett.. 314, 141-151. 4. Feig M., Rotkiewicz P., Kolinski A., Skolnick J., Brooks C.L. III. (2000) Accurate Reconstruction of All-Atom Protein Representations From Side-Chain-Based Low-Resolution Models. Proteins. 41, 86-97 5. Brooks B.R., Bruccoleri R.E., Olafson B.D., Sates D.J., Swaminathan S., Karplus M. (1983) CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations. J. Comp. Chem. 4, 187-217 6. Lee M.S., Salsbury F.R. jr, Brooks C.L. III. (2002) Novel Generalized Born Methods. J. Chem. Phys. 116, 10606-10614 7. Feig M., Brooks C.L. III. (2002) Evaluationg CASP4 Predictions with Physical Energy Functions. Proteins 49, 232-245

Bujnicki-Janusz (P0020) - 215 predictions: 67 3D, 58 SS, 49 RR, 41 DR

Consensus Prediction Using Fragments of Fold-Recognition-Based Homology Models Weighted by the Score of Inter-Residue Contacts and Compatibility with the Local Environment

J.M. Bujnicki International Institute of Molecular and Cell Biology (IIMCB) in Warsaw. Trojdena 4, 01-109 Warsaw, Poland [email protected]

A protein structure prediction strategy has been developed, which is applicable to all prediction categories in CASP5: homology modeling, FR, prediction of new folds, SS prediction, residue-residue (RR) distance prediction and order-disorder (DR) regions prediction i) a multiple sequence alignment is built for the target sequence using as many homologs as possible, careful refinement is carried out to remove false positives and correctly align weakly conserved motifs, ii) FR and SS predictions are carried out for the target sequence/alignment using as many different methods as possible, iii) target-template alignments generated by FR methods are converted into full-atom homology models (HM), iv) the model structures are clustered to identify the most frequently reported folds, v) all models are evaluated using independent criteria to obtain uniformly scaled values corresponding to the expected accuracy of each model, vi) the quality of modeled segments is evaluated by calculation of the 3D profile score in a moving-window scan, vii) for each candidate fold, the superimposed models are used to generate a low-resolution weighted consensus structure, using weights at the level of entire models and individual amino acid residues; models and regions below the certain cutoff are disregarded, viii) the sequence-structure alignment for the core of the target is obtained from superposition of the consensus model onto the family of template structures, ix) the alignment of the core is refined to preserve the continuity of SS elements,

A-34 x) the core of the target is modeled based on all templates, xi) the loops of the target are modeled using the cladistic criteria (from the template structures, only those loops, which exhibit similar length and sequence are used to guide the modeling of loops in the target structure), xii) the model is re-evaluated by calculation of the 3D profile score (using a knowledge-based atomic-detailed potential) and poorly scoring regions are subject to refinement (see p. xvi), xiii) the consensus SS is obtained from the SS calculated from the FR model and the FR-independent SS prediction (p. ii), xiv) the weighted consensus of RR contacts is calculated from all intermediate models for the selected fold (iii), using weights from the 3D profile score (vi), xv) the disordered regions are predicted by analysis of R-factors and unstructured regions in the superimposed template structures combined with analysis of the local sequence composition in the target structure, xvi) the 3D model is refined, using alternative FR alignments as the starting points and restraints from the consensus SS, RR and DR predictions.

In CASP5, I submitted predictions in all categories (TS, DR, RR, and SS). Where possible, all types of prediction algorithms were queried with multiple sequence alignments. Additional FR alignments and ab initio models for the CASP5 targets were obtained from the CAFASP website. For homology modeling, I used Modeller 6v1 [1] and Swiss-Model [2]. Following visual inspection of the models, the conformation of selected sidechains was adjusted by hand. Long loops with no counterpart in homologous structures were often modeled by hand by partial extension of the neighboring secondary structure elements. For evaluation of the local environment and inter-residue contacts I used Verify3D [3], window 5aa. In most cases, I submitted only the best single model.

1. Sali A. and Blundell T.L. (1993) Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol., 234,779-815. 2. Guex N. and Peitsch M.C. (1997) SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis, 18, 2714-2723. 3. Luthy R., et al. (1992) Assessment of protein models with three-dimensional profiles. Nature , 356, 83-85. Bystroff (P0131) - 132 predictions: 45 3D, 40 SS, 45 RR, 2 DR

Contact Map Threading Using HMMSTR

Y. Shao and C. Bystroff Department of Biology, Rensselear Polytechnic Institute [email protected], [email protected]

During the CASP5 prediction, we developed a threading method to predict protein contact maps. The target sequence was aligned with each of the template sequences selected from the PDBSelect database [1]. Target contact maps were generated from these alignments along with the template contact maps. Then the predicted contact maps were evaluated by the contact free energy calculation. The final contact map prediction was selected based upon the contact free energy as well as three other parameters, along with extensive use of intuition.

The templates we used were selected from the PDBSelect database that satisfied the following condition: X-ray structure available, high resolution (<2.5Å) and not - carbon only. The total number of templates was 1239. For each template, a sequence profile was generated using PSI-BLAST [2].

A-35 First, the target sequence profile was aligned against each template sequence profile by a Bayesian adaptive sequence alignment method [3]. For each template, the methods generated one alignment score but a large number of alignments between the target and the template. If the alignment score was below the threshold value, this template was rejected. Otherwise, if the alignment score was higher than the threshold, 100 alignments were selected at random for contact map alignment.

The alignment sets were then improved in two steps. First they were pruned by the compactness score. The Bayesian alignment method tended to insert large gaps into the alignments. In order to find which alignments had longer regions that were aligned, for each alignment we calculated the length of the longest aligned region (ignoring small gaps ≤ 3 residues). This length was defined as the compactness score of the alignment. We kept the top-scoring 10 alignments and discarded the remaining 90. Then the physicality of these 10 alignments was checked. The checking was limited to the ends of the gaps. If in the template structure, the distance between the residues at the opposite ends of an insertion was inconsistent with the sequence distance in the target, then those residues were removed from the alignments. This procedure was repeated until the gap distances in the template were consistent with the sequence distances in the target. In other words, for all gaps (i,j),

Di' j' 3.8Å i  j (1) where i' and j' are the template positions aligned with i and j, and D is the distance in Å between the alpha-carbons.

For each template, these 10 trimmed alignments along with the template contact map were used to generate candidate contact map predictions. For every two target residues that were aligned to the template, if those residues in the template were in contact, then a contact was predicted for those two residues in the target. Each candidate contact map (C) was then scored using the "contact free energy".

Contact free energy (CFE) was calculated in three steps, as stated in detail in the following paragraphs. In brief, the HMM state contact potentials were pre-calculated from the database. Secondly, the contact potential map for the target sequence was calculated. Finally, the candidate contact maps were multiplied by the contact potential map to give the CFE score.

The HMMSTR position-specific state probability matrices ( matrices [4]) were pre-calculated for all the 1239 templates using HMMSTR, a hidden Markov model for local sequence-structure correlations in proteins [5]. The HMM state contact potential between any two HMM states p and q (G(p,q,s)) was calculated as the negative log of the ratio between the sum of the product of HMM state probabilities for states p and q at residues i and i+s, respectively, that are in contact (C-alpha distance less than 8Å) in the template database, and the sum of the same product over all residue pairs i and i+s (Equation 2).   (i, p) *(i  s,q) PDBSelect i D  Å G(p,q, s)  log i ,is 8 (2)  (i, p) *(i  s, q) PDBSelect i where (i,p) is the probability of state p at position i in the sequence/structure, calculated using the forward/backward algorithm [1,5]. The sensitivity of discriminating contacts from non-contacts was increased by calculating G as a function of the sequence separation s (4 ≤ s ≤ 20). For sequence separations greater than 20, s=20 was used. The total number of potential functions was 1037153, one for every pair of 247 Markov states in HMMSTR and every separation distance from 4 to 20.

The contact potential between residues i and j (E(i,j)) in the target was calculated in the following way. First the target sequence profile file was generated by PSI- BLAST [2], then the HMM state probability matrix () was generated by HMMSTR [5] using the sequence profile. Finally, the contact potential was calculated as the - weighted sum of the contact potentials, G(p,q,s), for all HMM state pairs (p and q) with sequence separation s = |i-j|.

A-36 E(i, j) (i, p) *(j,q)* G(p,q,s) p q

Finally, the CFE was calculated by summing the contact potential of all the pairs of residues that were predicted to be in contact in the candidate contact map, C. CFE   E(i, j)  E i, j Cij 1( j (i 3)) where is the mean contact potential for the target. For each template, we calculated the CFE for all the 10 target contact map candidates and chose the one with the best energy as the contact map prediction for that template.

After we carried out the above procedure for every template in our dataset, we usually generated several hundred target contact map predictions. How to evaluate them and choose one as the final prediction became a problem itself. The decision was made by referring to 4 parameters: the CFE, the Bayesian alignment score, the compactness score and the sequence length similarity between the target and the template. The primary parameter was the CFE since it represented the free energy of the sequence aligned to the template. But as we observed during the CASP5 prediction, better alignments (which were represented by Bayesian alignment score and compactness) and similar lengths between the target and template sequences improved the (perceived) prediction quality. The strategy we used during the CASP5 prediction was to first sort the predicted contact maps by CFE, and then among the top-scoring 20 – 30 contact maps, chose (subjectively) one candidate as the final prediction by judging each contact map by eye, along with the other 3 parameters, and manually editing if necessary.

Often several of the top-scoring candidates contained the same fold. Consensus was considered a strong indicator, especially if the fold was uncommon. Multiple candidates were sometimes used to construct a single composite map. If no promising prediction, consensus prediction or composite was found, then an ab initio contact map was made, based on contact potential alone, and rule-based filtering/editing.

1. Hobohm U. and Sander C. (1994) Enlarged representative of protein structures. Protein Science 3, 522. 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI­BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389­3402 3. Zhu J. et al (1998) Bayesian adaptive sequence alignment algorithms. Bioinformatics. 14 (1), 25­29. 4. Rabiner, L. R. (1989). A tutorial on Hidden Markov Models and selected applications in speech recognition. Proc IEEE 77, 257­286. 5. Bystroff C. et al (2000) HMMSTR: a Hidden Markov Model for local Sequence­Structure Correlations in Proteins. J. Mol. Biol. 301 (2), 173­190.

A-37 Cam-Biochem (P0447) - 74 predictions: 74 3D

Iteration of Alignment and Model Building, Using Novel Techniques for Modelling and Validation

T. L. Blundell, V. Bolanos-Garcia, S. C. Brewerton, D. F. Burke, L. Chen, M. V. Cubellis, P. I. W. de Bakker, M. A. DePristo, A. C. B. Drake, M. T. Ehebauer, H. S. Gweon, N. J. Harmer, M. L. Kilkenny, B. S. Kochupurakkal, M. J. Lai, C. M. C. Lobley, S. C. Lovell, R. N. Miguel, K. Mizuguchi, R. W. Montaluoa, B. Popovic, R. P. Shetty, L. A. Stebbings, J. J. C. Thorpe Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge, CB2 1GA, United Kingdom [email protected]

In the CASP 5 experiment the Cam-Biochem group submitted models of proteins predicted to have either close or distant relationships to proteins of known structure. For both of these categories we used a number of novel methods, many of which have been developed since CASP 4.

In the closely related/comparative modelling class we used FUGUE[1] to identify evolutionary relationships between the targets and HOMSTRAD[2] families. Alignments were often edited by hand. Backbone models of conserved regions were built using SCORE[3], and backbone models of variable regions using RAPPER[4,5] or SLOOP[6]. SCORE defines conserved regions of the parents based on geometric features. It then selects the appropriate conserved parent fragment based on the match between it and the target sequence. The target sequence/parent fragment match is derived from environmentally-constrained substitution tables. RAPPER combines both ab initio and knowledge-based methods with model selection using state-of-the-art energy functions. The ab initio method samples conformations using fine-grained phi/psi state sets and side chain conformer libraries under idealized-geometry, excluded-volume and chain-closure restraints, generating thousands of plausible conformations. These conformations, supplemented with compatible fragments from the PDB, are ranked by an all-atom statistical potential (RAPDF) and by the AMBER force field with the generalized Born continuum solvation model. SLOOP selects loops from a database of loop conformations based on sequence/structure profiles and surrounding secondary structure.

Side chains were then built with CELIAN[7], which models side chain conformations by (1) borrowing side chain conformations, where appropriate, from parent structures and (2) building remaining side chains from a high-quality rotamer library[8], optimising the packing by using a SCWRL-like[9] algorithm. Rules determining when to borrow conformations from parent structures are based on an analysis of substitutions in defined structural environments.

In the distantly related/fold recognition category we used FUGUE[1] in conjunction with other fold recognition servers to identify potential homologues and MODELLER to build models. FUGUE uses environment-specific substitution tables derived from the HOMSTRAD database, along with structure-dependent gap penalties, to construct structural profiles. The latest version (FUGUE2[10]) uses structural profiles enriched by homologous sequences. In most cases, variable regions were rebuilt using RAPPER.

In previous CASPs, we have had a degree of success due to the care taken in the production and validation of the alignment, and in validation of the model[11, 12]. An important element of our strategy in CASP 5 was the use of our new program, HARMONY3[13] which validates models and alignments. It uses substitution probabilities, amino-acid propensities and a unique "alignment flexibility" score to predict which regions of the alignment are likely to be incorrect. Based on this, and on visual examination of alignments after annotation with JOY[14], we iterated rounds of alignment and model building. In CASP 4 we applied strict validation criteria

A-38 which resulted in us submitting a small number of high-quality models. In CASP 5 we took equal care with the alignment and modelling procedure, but submitted a substantially larger number of models.

1. Shi J et al. (2001) “FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties” J. Mol. Biol., 310 243-257. 2. de Bakker, P.I.W. et al. (2001) HOMSTRAD: Adding sequence information to structure-based alignments of homologous protein families” Bioinfomatics 17 748- 749 3. Deane C.M., et al. (2001) SCORE: predicting the core of protein models. Bioinformatics. 17(6):541-50. 4. DePristo M.A., et al. (2002) Ab initio construction of polypeptide fragments: Efficient generation of consistent, representative ensembles. Proteins Structure, Function and Genetics in press 5. de Bakker P.I.W., et al. (2002) Ab initio construction of polypeptide fragments:II Accuracy of loop decoy discrimination by an all-atom statistical potential and the AMBER force field with the Generalized Born solvation model. Proteins Structure, Function and Genetics in press 6. Burke D.F. et al. (2001) Improved loop prediction from sequence alone. Protein Engineering 14 (7) 473-478 7. Chen L. et al. unpublished 8. Lovell S.C. et al. (2000) “The Penultimate Rotamer Library” Proteins Structure, Function and Genetics 40 389-408 9. Bower et al. (1997) “Sidechain prediction from a backbone-dependent rotamer library. A new tool for homology modelling” J. Mol. Biol. 267 1268-1282 10. Shi J. et al. unpublished. 11. Burke D.F. et al. (1999) “An iterative structure-assisted approach to sequence alignment and comparative modelling” Proteins Structure, Function and Genetics Suppl 3 55-60 12. Williams et al. (2002) “Sequence-structure homology recognition by iterative alignment refinement and comparative modelling” Proteins Structure, Function and Genetics Suppl 5 92-97 13. Shi J. (2001) Thesis, University of Cambridge 14. Mizuguchi et al. (1998) “JOY: protein sequence-structure representation and analysis” Bioinformatics 14 617-623

Camacho-Carlos (P0098) - 46 predictions: 46 3D

Automated Consensus Method of Alignment for Accurate Comparative Modeling

Jahnavi C. Prasad, Sandor Vajda, Carlos J. Camacho Bioinformatics Program, Boston University, Boston, MA 02215 [email protected]

Quality of the alignment has been cited as the major determinant of the accuracy of the final predicted structure in comparative modeling. Errors that occur in the alignment stage cannot be recovered from in the later stages. Besides, the target may significantly diverge from the template in certain regions, thus making it undesirable to model the entire target from that template. Therefore, in addition to having the best possible alignment, it is also important to identify the target regions that are likely to be structurally dissimilar from the template. We have developed and completely automated a method that addresses both these issues.

A-39 Several methods exist for alignment. In a separate benchmarking analysis [1], we tested ten widely used methods and selected five of them in a hierarchical manner so that we cover a broad range of alignments. We have developed a methodology to build a consensus based on the alignments from variations of the finally selected five methods. Each position in the consensus alignment is assigned a confidence level. Then the regions reliable for homology modeling are predicted by applying criteria involving secondary structure and solvent exposure profile of the template, predicted secondary structure of the target, consensus confidence level, template domain boundaries and structural continuity of the predicted region with other predicted regions.

The method is best suited to predict accurate (structural) alignments given a template. For CAFASP3 we implemented a template search algorithm that resembles PDB- BLAST. This method provided templates for X number of targets, mostly homology models. Alignments are then obtained from the five chosen methods. Method 1 is a SAM-T99 [2] based method that uses a PSI-BLAST[4] multiple alignment of target and template hits as its seed. The second method is similar but based on HMMER[3]. Method 3 is a simple pairwise alignment by BLAST[4]. This alignment is used only when it is long enough and the E-value for it is acceptable. Fourth and fifth methods are again SAM-T99 and HMMER based methods that use HSSP alignments [5] of the template as their seed alignment for generating the first iteration HMM.

Then the consensus alignment is built from these alignments. Now regions suitable for homology modeling are 'selected' in several stages, some of which will be mentioned here. The first stage involves selection of highly confident alignment regions corresponding to buried regions in the template structure. In the second stage, these selected core regions are extended on both sides to lower confidence levels till a Glycine is encountered. Regions that are buried, inside an alpha helix on the template side and satisfying a certain consensus threshold are then selected. Similarly regions corresponding to beta sheets on the template are then selected subject to certain other criteria. If the terminal regions of template or target are suspected to be loose, i.e. dissimilar, then such regions are deselected.

After the selection procedure, the selected regions of target are predicted by simply following the backbone of corresponding regions of the template. If the template is multi-domain, target regions corresponding to each domain are predicted separately. In such cases, full predictions were submitted as first models and the domains as subsequent models. The entire method is automatic and is available as a server at http://structure.bu.edu/cgi-bin/consensus/consensus.cgi

1. Prasad J.C., Comeau S.R., Vajda S., Camacho C.J. Confident Homology Modeling Based On Consensus Alignment. Submitted for publication. 2. Hughey R. and Krogh A. (1995) SAM: Sequence alignment and modeling software system. Technical Report UCSC-CRL-95-7, Univ. of Calif., Santa Cruz, CA. 3. Eddy S.R. (2001) Profile hidden Markov models. Bioinformatics 14 755-763, 1998. 4. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-342 5. Holm L. and Sander C. (1996) Mapping the protein universe. Science 273,:595-602 Camacho-Carlos (P0099) - 184 predictions: 184 3D

Building Protein Structures from Spare Parts

Carlos J. Camacho Department of Biomedical Engineering, Boston University, Boston, MA 02215

A-40 [email protected]

Protein structure motifs can be divided into two qualitatively different groups: more or less unique disorder regions (mostly solvent exposed loops), and a rather limited set of recurrent motifs or spare parts (helices, sheets, alpha-turn-alpha, alpha-turn-beta, etc). Given the more than 10,000 crystal structures available to date, it is very likely that most spare parts are already present in the Protein Data Bank (PDB). With this in mind, we have developed an algorithm named Consensus [1-2] that attempts to detect spare part motifs even in sequences with no apparent sequence similarity with other proteins. We generalized this automated procedure to select motifs from multiple templates, and then combined them to predict protein models including side chains. Based on sequence, template secondary structure, secondary structure prediction, and simple energetic constraints (no functional information), we apply this new technology to all CASP5 targets.

Specifically, for any given target sequence, we run Consensus on multiple templates selected from CAFASP3 results, obtaining confident structural alignments of different length. Then, guided by the confidence level associated with each motif, we put together the pieces and build the scaffold of the protein by methodically matching the jointures. We did not put much effort predicting all disorder regions. The main strength of the method is the ability of Consensus to confidently select the portion of templates that bear structural similarities with the target.

For CASP5, we had just finished the implementation of this software. Therefore, some intermediate steps were done manually. Since the main application we envision for the models generated by our technique is large-scale prediction of physical interactions between proteins, our first aim was to predict only one high confidence model for each target. For practical reasons, we often submitted one model and smaller portions of the same model. For the first 13 targets (deadline before Aug 24th) the order of our submissions was based on length (shorter alignment first) and not overall confidence. We submitted multiple different models only for very few targets (mainly for the short NMR targets). Due to time constraints we were not able to finish the analysis for Targets 187 and 194, so we only submitted preliminary models. In summary, we submitted 185 structures: one structure for 11 targets, two for 14, three for 21, thirteen for 13, and for 6 targets we submitted all 5 possible models.

1. Prasad J.C., Comeau S.R., Vajda S., Camacho C.J. (2002) Confident homology modeling based on consensus alignment. Submitted for publication. 2. Prasad J.C., Vajda S., Camacho C.J. (2002) Using Consensus to predict confident structural alignments. Abstract CASP5 Asilomar Meeting.

CaspIta (P0108) - 133 predictions: 70 3D, 63 SS

Secondary Structure Prediction by Consensus and Homology

S. C. E. Tosatto1, M. Albrecht2, A. Cestaro1, S. Toppo1 and G. Valle1 1 - CRIBI Biotechnology Centre, Universita' di Padova 2 - Max-Planck-Institut fuer Informatik [email protected]

The secondary structure of the CASP targets was predicted by an automated consensus of three public servers: Psi-Pred [1], ssPRO [2] and Sam-T2K [3]. The consensus is built from a jury decision of the three servers. In case of a tie, where possible, the neighbouring secondary structure is extended, otherwise the ‘coil’ state is predicted.

A-41 The prediction is then filtered to remove single mispredictions (e.g. ‘EEEHEE’). Secondary structure elements shorter than two (strand) or three residues (helix) are converted to ‘coil’.

Targets with high sequence identity (>= 60%) to a template structure were predicted using a different approach. In an attempt to reach higher prediction rates, the true secondary structure of a constructed 3D model is used. The model is constructed from the PSI-BLAST [4] alignment using comparative modeling techniques as described in another abstract. The program DSSP [5] is then used to extract the secondary structure.

1. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202. 2. Baldi P. et al. (1999) Exploiting the past and the future in protein secondary structure prediction. Bioinformatics 15, 937-946. 3. Karplus K. at al. (2001) What is the Value Added by Human Intervention in Protein Structure Prediction? Proteins Suppl 5, 86-91. 4. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 5. Kabsch W., Sander C. (1983) Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577- 2637. CaspIta (P0108) - 133 predictions: 70 3D, 63 SS

Modeling Protein Structures Using a Combination of Biological Information, Fold Recognition and Loop Modeling Methods

S. C. E. Tosatto1, A. Cestaro1, E. Bindewald2, F. Fogolari3 and G. Valle1 1 - CRIBI Biotechnology Centre, Universita' di Padova 2 - Bioinformatics Center, Buffalo University 3 - Science and Technology Dept., Universita' di Verona [email protected]

The first step in predicting a protein structure consists in searching for database information on the target sequence. Four cycles of PSI-BLAST [1] are performed on the NR database of protein sequences with an e-value inclusion threshold of 0.005. Only sequences with at least 30% sequence identity, e-value <= 10-10 and alignment of length at least 2/3 of the target sequence are considered. In difficult cases the e-value inclusion threshold in PSI-BLAST is lowered to 0.02 and sequences up to an e- value of 10-5 aligned for over half of the target sequence considered. The domain structure of the target is searched in parallel on the PFAM database [2].

If the results allow the unambiguous identification of a function for the target, the existence of structures associated to this function and their degree of homology to the target are verified. For easier targets this is accomplished by using the previously generated sequence profile to scan the PDBAA database with PSI-BLAST (i.e. PDBBLAST protocol). If the target is associated to a PFAM domain, the alignment is analyzed with respect to conserved and/or functionally important residues and the position of gaps; the presence of a representative structure is also verified. In order to accumulate more functional information, the target is scanned for PROSITE patterns [3]. Only patterns compatible with the characteristics of the target are considered, e.g. all eukariotic patterns are excluded in bacterial targets.

A-42 Once a plausible function has been established, the relevant structural and biological features (e.g. substrate, interacting cofactors and/or ions, cellular localization) are retrieved from the literature. For targets where no template structure has been identified with sufficient confidence, the program MANIFOLD [4] is used to suggest possible folds.

MANIFOLD is a fold recognition program based on sequence, secondary structure, accessibility and functional similarity. For sequence similarity, it uses the output of PDBBLAST performed on the SCOP 1.53 [5] database of domain structures with a very large e-value cutoff (200). The consensus of secondary structure, described in a different abstract, is compared to the predicted secondary structure of all SCOP 1.53 domains using the segment overlap criterion. The sequence and secondary structure similarities are augmented with accessibility, predicted with ACCPRO [6], and function, as expressed by the enzyme code (EC) number (where applicable), to rank the structures. Each of the four features is weighted using a non-linear scoring function that mimicks the behaviour of a neural network.

The top 20 results from MANIFOLD are compared to the previously collected functional information. The templates are re-ranked based on the characteristics shared with the target, mainly function and interacting molecules. The SCOP classes of all candidate templates are analyzed to establish the functional similarity. If a sufficiently similar structure is found, a template is selected. If the SCOP classes are very diverse and the target function uncertain, the search is abandoned without selecting a template.

Depending on how the template was selected, one of three protocols is used to align target and template sequences. In all cases, the automatically generated alignment is inspected visually by locating proposed insertions and deletions on the template structure. The position of insertions and deletions is shifted to order to optimize their position relative to secondary structure elements.

For comparative modeling targets, i.e. those easily detected by PDBBLAST, 4 rounds of PSI-BLAST search are first performed for the target sequence against the NR database of protein sequences. The sequences aligned in this way are used to build a hidden Markov model (HMM) using the HMMer package [7]. This HMM is then used to align target and template.

Fold recognition targets which are believed to be related on the sequence level are subjected to an extended version of the previous protocol. Rather than constructing only a HMM of the target sequence, this is also done with the template sequence. Different similarity cutoffs for inclusion in the HMMs are empirically used to select a sequence alignment that will conserve most of the secondary structure elements.

The third and most difficult case is for fold recognition targets which are either believed to be only related on the secondary structure level or for which the previous protocol fails to produce a satisfactory alignment. In these cases the alignment is constructed from the output of the MANIFOLD program. MANIFOLD uses a global alignment heuristic based on optimizing the segment overlap measure of the secondary structure elements. In practice, these alignmenmts are a starting point for manual intervention, due to their fragmented nature.

The model was generated using the package HOMER [8]. This involves the following steps. First a raw model of the conserved parts is constructed from the template. The backbone 3D coordinates of target amino acids aligned with the template are copied. The coordinates of conserved side chains are modeled at this stage, with only the Cb atoms being copied for all other. SCWRL [9] is used to place all missing side chain atoms. Some basic checks are performed on the model, e.g. the RAPDF [10] and solvation [11] energies are calculated and the model inspected visually, to exclude obvious errors.

A-43 Insertions and deletions are reconstructed after raw model generation using an enhanced version of the fast divide & conquer loop modeling method [12]. This method uses a database of pre-calculated loop fragments derived from a Ramachandran plot distribution of (phi,psi) torsion angles to generate a set of candidate loops. Candidates with steric clashes or amino acids in prohibited areas of the Ramachandran distribution are filtered out. The remaining solutions are then ranked according to a combination of RAPDF energy and geometric fit to the anchor regions. Enhancements to the published protocol consist in adding side chains with SCWRL to the top 20 solutions after ranking before performing a restrained local minimization, typically 500 cycles of conjugate gradient, for each solution. These solutions are then re- ranked according to their CHARMM [13] force field energy and inspected visually to select the most appropriate solution. The extended protocol was found to improve the overall quality of the results. After repeating the above procedure for all insertions and deletions, the model is subjected to a limited local minimization with CHARMM, typically 100 steps of steepest descent, to reduce bad contacts. The final model is again inspected visually before being submitted.

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 2. Bateman A. et al. (2000) The Pfam protein families database. Nucleic Acids Res. 28, 263-266. 3. Falquet L. et al. (2002) The PROSITE database, its status in 2002. Nucleic Acids Res. 30, 235-238. 4. Bindewald E. et al. (2002) In preparation. 5. Murzin A.G. et al. (1995) SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536-540. 6. Pollastri G. et al. (2002) Prediction of coordination number and relative solvent accessibility in proteins. Proteins 47(2), 142-53. 7. Eddy S.R. (1998) Profile Hidden Markov Models. Bioinformatics 14 (9), 755-763. 8. Tosatto S.C.E. et al. (2002) In preparation. 9. Bower M.J. et al. (1997). Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: A new homology modeling tool. J. Mol. Biol. 267, 1268-1282. 10. Samudrala R., Moult J. (1998) An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J. Mol. Biol. 275, 895-916. 11. Tosatto S.C.E. et al. (2002) A divide and conquer approach to fast loop modeling. Protein Eng. 15(4), 279-286. 12. Jones D.T. (1999) GenTHREADER: An e.cient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 287, 797-815. 13. MacKerell J.A.D. et al. (1998) All-hydrogen empirical potential for molecular modeling and dynamics studies of proteins using the CHARMM22 force field. J. Phys. Chem. B 102, 3586-3616.

CBC-FOLD (P0008) - 151 predictions: 151 3D

What’s so Good About Real Proteins?

Ajay K. Royyuru, Ruhong Zhou, Prasanna Athma, B. David Silverman, Gelonia Dent and Rosalia Tungaraza Computational Biology Center, IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA [email protected]

A-44 The basic protocol used in our CASP5 effort was to create initial candidate structures with Comparative Modeling or Fold Recognition processes and to refine them with enhanced sampling methods using an all-atom force field with a continuum solvent model. A newly developed hydrophobic profiling procedure was used to filter candidates before and after the refinement.

For targets with high sequence similarity to existing proteins in PDB, we used Comparative Modeling. The target sequence was aligned to each sequence in pdb_select95 database using Psi-Blast. The alignment score was based on the BLOSUM62 substitution matrix, and the one with the highest score was selected as the best template. To generate an all atom model, we copied coordinates of the backbone atoms of aligned residues from the template to the target. Sidechain coordinates were copied if the atom type in the target was identical or very similar to that in the template. Coordinates of remaining sidechain atoms and gaps in the template were generated with random initial locations and the simulated annealing protocol in X-PLOR. The candidate structure is then subjected to hydrophobic profiling and refinement. For targets without sufficient sequence similarity, we used a Fold Recognition process. We used 3DJury alignments from the CAFASP3 server to build all- atom models corresponding to every alignment and scored and refined them using the same protocol.

One interesting question about protein structures is: What is so good about real proteins? Recently, a detailed spatial profiling of hydrophobicity in native proteins had revealed that the shape of the 2nd order moment profiles is comparable, yielding a relative constant called the hydrophobic ratio [1]. The shape of the hydrophobic profile is similar for 5387 non-redundant globular protein domains in PDB, with hydrophobic ratio 0.71 +/- 0.08. This surprising and interesting feature was thus used in our new procedure, Hydro, to screen candidate structures. Furthermore, a new hydrophobic score has been defined to detect native-like proteins from decoy structures. Thorough tests on three widely used decoy sets showed very encouraging results [2]. We also examined the correlations between hydrophobicity and proximity to the surface of residues along the sequence. The hydrophobic moment profiling and hydrophobic score provided useful complementary information to the force field calculations and were used as a filtering tool before and after refinement.

The candidate structures were then refined with enhanced conformational space sampling using Replica Exchange Method [3]. The method utilizes high temperature walkers to cross over the energy barriers, which would otherwise be difficult for low temperature walkers to overcome. The OPLSAA force field was adopted with a continuum solvent model, Surface Generalized Born (SGB) model. The following procedure was used: (a) conjugate gradient minimization followed by a short molecular dynamics equilibration at 310K; (b) Launching extensive conformation space searching with Replica Exchange Method using 21 replicas for temperatures ranging from 300K to 500K; (c) Sampled conformations at 310K are then clustered to retain structures that differ by at least 1A (1st clustering); (d) Structures from each cluster bin are minimized; (e) Structures are ranked by OPLSAA/SGB energy to identify the one with lowest energy; (f) The ensemble of sampled structures at various temperatures is examined to identify those within 1 A of the lowest energy structure identified above. Then these structures are clustered again to retain structures that differ by >0.25 A (2nd clustering); (g) Structures from 2nd clustering are then minimized for 1000 steps; (h) Structures are finally ranked by OPLSAA/SGB energy; (i) Five lowest energy structures were analyzed through structural alignment to identify distinct and optimal models.

Predictions for 23 targets were obtained through Comparative Modeling and 35 through Fold Recognition.

1. Silverman B. D. (2001), Hydrophobic moments of protein structures: spatially profiling the distribution. (1997) Proc. Natl. Acad. Sci. 98, 4996-5001 2. Zhou R. and Silverman B. D. (2002), Detecting native protein folds among large decoy sets with hydrophobic moment profiling, Pac. Symp. Biocomput. 02, 673-84 3. Zhou R., Berne B. J. and Germain R. (2001) The free energy landscape for -hairpin folding in explicit water, Proc. Natl. Acad. Sci. 98, 14931-14936.

A-45 CBRC (P0041) - 385 predictions: 279 3D, 105 SS, 1 DR

Integrating a New Fold Recognition Method with an Exhaustive Molecular Modeling System: FORTE1 and FORTE-SUITE

K. Tomii1, T. Hirokawa1, T. Noguchi1, A. Suenaga2 and Y. Akiyama1 1 - Computational Biology Research Center National Institute of Advanced Industrial Science and Technology, Japan 2 – Bioinformatics Group, RIKEN Genomic Science Center, Japan [email protected]

The CBRC team attempted to submit TS/AL prediction results for all CASP5 targets. Basically, our prediction method is a pipeline composed of two or three steps: (1) fold recognition and alignment by the new FORTE1 program, (2) exhaustive 3-D structure modeling by the FORTE-SUITE system and, if needed, (3) molecular dynamics simulation for further structure refinement.

(1) Fold recognition and alignment by FORTE1 We have devised a novel profile-profile comparison technique to increase the sensitivity of fold recognition and improve alignment accuracy. The FORTE1 program by Tomii et al. [1] has distinct features of measuring similarity between two profiles as compared with other published methods, such as FFAS [2] and the method developed by Yona and Levitt [3], which exploit alignment information. The FORTE1 program utilizes the sequence profiles of both a target and templates to predict the structure of target sequence. The sequences of templates were derived from the ASTRAL [4] (version 1.59) 40% identity list and the selected PDB entries which are not registered in SCOP (1.59 release) database. With the exceptional-strength computational resource (the Magi cluster, http://www.cbrc.jp/magi/), we performed PSI- BLAST iterations maximally 20 times to prepare the profiles of both target and templates with the NCBI non-redundant database. In profile comparisons the global- local algorithm was employed to build an optimal alignment of a query sequence profile onto a template one. Statistical significance of each alignment score was estimated by calculating Z-score with a simple log-length correction. The candidates of the templates were sorted by Z-scores, and then prediction results in the AL format were submitted.

(2) 3D structure modeling and evaluation 3D models were built with the target-parent(s) alignments from FORTE1. Two molecular modeling programs, MODELLER [5] and SegMod [6], were used. The modeling scheme of MODELLER is to optimize probability density functions for each of the restraint features of the model, while SegMod is a segment match modeling using a database of known protein X-ray structures. For longer-loop cases, we gave priority to the SegMod results. Human inspection, multiple sequence alignments and secondary structure predictions were also performed to refine the target-parent(s) alignments as possible. The modeling process we followed is divided into two categories by FORTE1 Z-score levels: (a) for CM and easy FR targets a simple modeling was done using only promising parents with very high Z-scores and (b) for FR/NF targets an exhaustive modeling was performed utilizing available parents (maximum 100 parents each for both modeling programs) with a medium Z- score level and final models were selected based on the structural quality score (q-score) calculated by Verify3D [7]. For the latter process, Hirokawa developed a semi-

A-46 automatic exhaustive modeling and evaluation system on a parallel computer environment, called FORTE-SUITE. The q-score ordering was in some cases overridden by human intervention, when related knowledge was available from literature or other bioinformatics analysis results.

(3) Molecular dynamics simulation for further structure refinement When the q-score of the model was not sufficiently high (typically not greater than 0.5), we tried to perform molecular dynamics calculation in order to obtain a structure with better q-score. In all simulations, parm96 force field was adapted and explicit water molecule model were used. Non-bonded interactions were calculated without cutoff approximation using Barnes-Hut tree algorithm. For CASP5, we employed two different parallelized MD programs on different parallel computers. The first program is called the MolTreC2 [8] running on the Magi cluster (976 Gflops in total, 30 Gflops in average for a simulation) at CBRC and was operated by Noguchi and Akiyama. The second program is a modified version of AMBER6.0 [9] (AMBER for the MDM special-purpose computer) running on a PC with two MD-Grape2 boards (16 chips, 240 Gflops) and was operated by Suenaga at RIKEN. With the MolTreC2 simulation, we tried 14 CASP5 targets and 7 models for 5 targets out of them were submitted because they showed improved q-scores (T0132_4, T0140_5, T0153_2, T0155_2, T0180_{1,2,3}). The simulations were done for one or two nano-seconds typically. With the AMBER simulation, we tried three CASP5 targets and 3 models for 2 targets out of them were submitted (T0129_4, T0135_{1,2}).

(4) Secondary structure prediction We have also submitted SS predictions for all CASP5 target proteins based on our fold recognition techniques. The SS-FORTE program was newly developed for CASP5 and was utilizing 50 secondary structures of template proteins suggested by FORTE1 with sequence weighting according to the Z-score of FORTE1. We also combined the result from our previous program the New SSThread [10] which utilizes an averaged output from other threading methods. The secondary structure predictions were mainly done by Noguchi.

1. Tomii K. et al. (2002) Fold recognition using FORTE1 server, in CASP5. 2. Rychlewski L. et al. (2000) Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Science. 9 (2), 232-241. 3. Yona G. et al. (2002) Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J. Mol. Biol. 315 (5), 1257-1275. 4. Chandonia J.M. et al. (2002) ASTRAL compendium enhancements. Nucleic Acids Res. 30 (1), 260-263. 5. Sali A. and Blundell T.L. (1993) Comparative protein modeling by satisfaction of spatial restraints, J. Mol. Biol. 234, 779-815. 6. Levitt M. (1992) Accurate modeling of protein conformation by automatic segment matching, J. Mol. Biol. 226, 507-533. 7. Luthy R., Bowie J.U. and Eisenberg D. (1992) Assessment of protein models with three-dimensional profiles, Nature 356, 83-85. 8. Misoo K, Akiyama Y. et al. (2000) Development of Molecular Dynamics Programs for Protein with a Parallelized Barnes-Hut Code, Proc. HPC-Asia 2000, 1103- 1111. 9. Case D.A., et al. (1999) Amber 6.0, University of California San Francisco. 10. Noguchi T. et al. (2001) Prediction of Protein Secondary Structure Using the Threading Algorithm and Local Sequence Similarity, Research. Comm. in Biochem., Cell & Mol. Biol., 5, 115-131.

CBSU (P0417) - 173 predictions: 173 3D

A-47 An Attempt to Improve over the CAFASP Prediction Servers by Means of Manual Intervention

D. Ripoll and J. Pillardy Computational Biology Service Unit,Cornell Theory Center- Cornell University; Rhodes Hall Ithaca NY 14853-3801 [email protected]

The process of structure prediction of the CASP5 targets was carried out using any sequence and structure information that was possible for us to gather. The principal source of structural information for each target was obtained from the CAFASP summary web page[1]. The templates used in the structure generation of our models were selected using the following conditions: (a) CAFASP predictions from different servers that consistently pointed to a structure (or domain of a structure) from PDB[2] (b) If the servers provided predictions with low-level of confidence, only those templates that were consistent with the secondary structure prediction of the respective target sequence were further analyzed; (c) Server predictions showing similarities to isolated fragments only were discarded. For each template, attempts were made to improve the sequence alignments provided by the servers by generating all-atoms 3D models[3] with reasonable loops and hydrophilic/hydrophobic arrangements of the residues. In addition, in the process of model generation of some targets, we tried to use information from structural neighbors of the template protein. The Combinatorial Extension method[4] was used to obtain the corresponding sequence alignments of proteins having low sequence similarity and low C rms deviations with the template. This information was used to attempt to improve the alignment between the target and template sequences obtained from the servers.

1. http://www.cs.bgu.ac.il/~dfischer/CAFASP3/summaries/index.html 2. Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E. (2000). The Protein Data Bank. Nucleic Acids Research 28, 235-242. 3. Šali A., Blundell T.L. (1993). Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815. 4. Shindyalov I.N., Bourne P.E. (1998). Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering 11(9), 739- 747.

Celltech (P0028) - 347 predictions: 347 3D

Error Detection in Sequence-Structure Alignment Using HARMONY3

J. Shi Celltech R&D Inc, 1631 220th Street SE, Bothell, WA 98021, USA [email protected]

MOTIVATION: In the comparative modeling practice, the accuracy of sequence-structure alignment is the dominant factor of the quality of final models [1; 2; 3]. The initial sequence-structure alignment for distant homologues, even produced by experienced modeling experts, often contains erroneous regions. In practice, predictors

A-48 can identify alignment errors that have significant impact on the model quality according to the problems they find in the 3D models built based on the alignment, and can thus iteratively refine the alignments and improve the quality of the models [4]. Unfortunately, this is a subjective and time-consuming process, and requires detailed insights into protein structures, which is not an option for novices or large-scale modeling efforts. No methods have yet been reported to automate this process.

OBJECTIVES: An automated procedure, HARMONY3, has been developed recently in Blundell’s lab to tackle this problem [5]. Our key objectives in CASP5 are to validate the capability of HARMONY3 in identifying erroneous alignment regions that have significant impact on the model quality, and to test the potential of integrating this procedure into fully automated comparative modeling platform.

METHODS: In the CASP5 experiment, we used FUGUE [6] to identify structural templates of the target, and to produce sequence-structure alignment. HARMONY3 was then used to build models (using MODELLER [7]) and to predict problematic alignment regions (see below for detailed description on the HARMONY3 methodology). If any problematic regions were found, the alignment was manually adjusted and then passed to HARMONY3 again to form refinement iteration. Due to limited resources (only 1 predictor available), the iteration number was limited to 2, and no more than 2 hours of human-intervention were spent on each target. As a result, neither literature nor functional information was used for the predictions.

Thus, the performance of HARMONY3 can be evaluated by the comparison between our results and the results of the FUGUE servers, which are registered with CAFASP3 under the names of FUGUE2 and FUGUE3. We are interested to see whether 2-hour human intervention on the problematic sequence-structure alignment regions, as predicted by HARMONY3, could improve the accuracy of the final models. If the HARMONY3 predictions were incorrect, our results should be no better than the results of the FUGUE servers. We are also interested to see whether the structural templates were indeed incorrect when HARMONY3 indicated global alignment errors, even after manual refinement.

The HARMONY3 protocol [5] consists of the following steps. (a) Five models were generated from a given sequence-structure alignment using MODELLER [7]. (b) The observed local structural environment, as defined by main-chain conformation, solvent accessibility and hydrogen-bonding status, was calculated from the models for each residue. (c) The observed amino acid distribution at each sequence position was derived from an alignment between the target sequence and its sequence homologues collected by PSI-Blast [8]. (d) An amino acid propensity score M was calculated from the agreement between the observed local environment of each residue and the expected value based on known structures. (e) An amino acid substitution score N was calculated from the agreement among the observed amino acid distribution and two expected distributions, one being predicted from the local environment of the models and the environment-specific substitution tables [6], and the other from an environment-independent substitution table. (f) A local alignment flexibility score F was calculated for each position of the given sequence-structure alignment. (g) Local alignment errors were predicted by evaluating the M, N and F scores and averaging the results over 5 models.

The M and N scores were used to assess the correctness of the model. It has been reported that the amino acid propensity score M could be used to describe whether a model is reasonable from the perspective of sequence-structure compatibility [9]. However, this score uses only the structural information derived from the models, and makes no use of the readily available information from the sequence homologues of the target. In HARMONY3, the amino acid substitution score N was introduced to take into account such information. The assumption was that the structural constraints on amino acid substitutions provide extra predictive power on the amino acid distribution at each sequence position [6]. Thus, if the local environment of the model is correct, the position-specific amino acid distribution derived from sequence homologues of the target should agree better with the expected distribution derived from the environment-specific substitution tables, than with that derived from the environment-independent substitution table.

A-49 The local alignment flexibility score F was introduced to account for the observation that local alignment errors are more likely to occur in the regions of low sequence identity and the regions where many insertion/deletions are found. Furthermore, there is a problem with the M and N scores: the empirically derived structural restraints cannot be applied towards functional sites, where the amino acid conservation/substitution is mainly constrained by functional reasons. The score F can also “mask” functional sites and minimize such problems, because functional sites are usually well conserved.

The combination of the M, N and F scores indicated the problems in a model that were most likely introduced by alignment errors. Averaging over 5 models further reduced the noise from random modeling errors. The final result was mapped to the sequence-structure alignment to indicate erroneous alignment regions.

1. Moult J., et al. (1999). Critical assessment of methods of protein structure prediction (CASP): round III. Proteins Suppl(3), 2-6. 2. Venclovas C., et al. (1999). Some measures of comparative performance in the three CASPs. Proteins Suppl(3), 231-7. 3. Johnson M.S., et al. (1994). Knowledge-based protein modeling. Crit Rev Biochem Mol Biol 29(1), 1-68. 4. Williams M.G., et al. (2001). Sequence-structure homology recognition by iterative alignment refinement and comparative modeling. Proteins Suppl(5), 92-7. 5. Shi J. & Blundell T.L. (unpublished). 6. Shi J., et al. (2001). Fugue: sequence-structure homology recognition using environment- specific substitution tables and structure-dependent gap penalties. J Mol Biol 310(1), 243-57. 7. Sali A., et al. (1993). Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234(3), 779-815. 8. Altschul S.F., et al. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25(17), 3389-402. 9. Luthy R., et al. (1992). Assessment of protein models with three-dimensional profiles. Nature 356(6364), 83-5.

A-50 CHEN-WENDY (P0264) - 37 predictions: 37 3D

A Newtonian Force-Based Algorithm for Mixed-Integer Optimization in Comparative Protein Modeling

J.L. Pellequer1, G. Imbert1, O. Pible1 and S-w. W. Chen2 1- CEA Valrhô – Centre de Marcoule – DSV/DIEP – Unit of post-genomique Biochemistry and Nuclear Toxicology. BP17171 – 30207 Bagnols sur Cèze – France, 2- 13 ave. de la Mayre – 30200 Bagnols sur Cèze – France [email protected]

Our comparative modeling approach is based on semi-automatic prediction schemes with permanent user interventions. Putative template molecules were identified indulging in the CAFASP3 web server where we placed strong emphases on threading methods. Protein sequences of identified putative templates were aligned with each other using CLUSTALW/T-COFFEE. In case of unsatisfactory resulting multiple alignments, we developed a new methodology collecting all protein sequences of putative templates and re-aligned them to select the most appropriate one. Subsequently, a pair-wise alignment was generated by taking into account of the locations of secondary structure elements of the selected template. Indels locations were identified through an in-depth analysis of the three-dimensional structure of the selected template. In modelling CASP5 targets, most of time was spent in the sequence alignment.

The positions of side-chain atoms were placed using our automatic program. Replaced side chains were clustered and optimised in two steps: first, rotamers (from Tufféry et al.1991. J. Biomol. Struct. Dynam. 8:1267-1289) of side chains were optimised at a cluster level, and second, the chi dihedral angles of each side chain were minimized at a residue level. Most of CASP5 targets resulted in one large cluster (>50 residues) and several other small ones. We employed the Powell algorithm to perform minimization. We chose non-bonded energy (in the 12-6-1 format) with CHARMM-22 all-atom force field parameters as a scoring function.

Loop closure was partially constructed automatically. In a similar manner to side-chain minimization, the backbone conformation of loops was also determined using the Powell algorithm. The side-chain conformations of loop residues were taken into account by optimising their rotamer. Identification of edged residues and conformational refinements of loops for residue deletions and insertions were manually performed using computational graphics. Additionally minor refinements were performed by XPLOR energy-minimization in the CHARMM-22 force field. To avoid over-minimization, the convergence criterion was set to between 1 and 4kcal/mol/Å while the Coulombic interaction was turned off for minimizing side-chain atoms. Each model was visually scrutinized to identify potential conflicts in side- chain conformations and to maximize side-chain-to-main-chain hydrogen bonds.

CHIMERA (P0153) - 94 predictions: 94 3D

Comparative Modeling Using CHIMERA Modeling System

Mayuko Takeda-Shitaka, Chieko Chiba, Hirokazu Tanaka, Daisuke Takaya and Hideaki Umeyama Kitasato University

A-51 [email protected]

Our laboratory registered group CHIMERA [1, 2] in CASP5 and group FAMS [3, 4] in CAFASP3. Procedure of group FAMS, full automatic modeling system, is very important and essential for large-scale genome modeling. In some cases, however, the procedures using human intervention are more accurate than fully automated modeling procedures. Therefore we tried to construct more accurate models with human intervention.

We constructed 3D model structures using CHIMERA modeling system, which predicts protein structure based on homology modeling methods using more than one reference protein. The modeling procedure is 1) selection of reference proteins, 2) alignments, and 3) construction of model structures. CHIMERA is partially automatic modeling system that enables human intervention at necessary stage. Selection of reference proteins Searches for reference proteins are based on results shown by group FAMS (see abstract of group FAMS). According to the target information given by CASP5 organizer beforehand, related papers, secondary structure predictions and CAFASP3 meta-server, we select reference proteins.

Alignments We generate alignments taking biologically important region, secondary structure predictions, homology, hydrophobic core etc. into consideration. Multiple templates are used when possible.

Construction of model structures First, main chain is constructed by loop searches if necessary. Second, side chains are replaced by suitable amino acids keeping the original side-chain torsional angles where possible. Short contacts within 2.0 angstrom are removed.

Accuracy of the models depends on selection of reference proteins and on generating alignments. If reference proteins and alignments are wrong, model structures become wrong even though the modeling software is reliable. Therefore, we laid emphasis on these steps. The result shows tendency that we selected same reference proteins as group FAMS did in high homology cases, and different ones in low homology cases. Even in case of same reference proteins, we manually modified alignments to maintain active site residues, secondary structures, hydrophobic core etc.

1. Yoneda, T., Komooka, H. and Umeyama, H. (1997) A computer modeling study of the interaction between tissue factor pathway inhibitor and blood coagulation factor Xa. J. Protein Chem. 16, 597-605. 2. Takeda-Shitaka, M. and Umeyama, H. (1998) Effect of excepional valine replacement for highly conserved Ala55 in serine proteases. FEBS Lett. 425, 448-452. 3. Ogata, K. and Umeyama, H. (2000) An automatic homology modeling method consisting of database searches and simulated annealing. J. Mol. Graphics Mod. 18, 258-272. 4. Iwadate M., Ebisawa K. and Umeyama H. (2001) Comparative Modeling of CAFASP2 Competition. Chem-Bio Informatics Journal 1, 136-148. CHIMERAX (P0170) - 74 predictions: 74 3D

Full Length Protein Modeling Using CHIMERA eXtending Procedure

Genki Terashi, Ryota Yamatsu, Youji Kurihara, Mayuko Takeda-Shitaka, Mitsuo Iwadate and Hideaki Umeyama

A-52 Kitasato University [email protected]

Different alignments and several alignments for different alignment ranges between target and reference proteins are provided by the PSI-BLAST and other programs. Homology modelings have been performed based upon each alignment, and template models are prepared to construct full length protein. Generally full length proteins are not obtained due to low homology of the target sequence for the reference. In such cases, plural numbers of template protein models are connected in overlapping a few amino acid residues between neighboring template models. In addition, other connecting methods are used to make full length protein models.

Method

Making of template models and selection of base model After template models are made for many alignments, a primary candidate is selected as a base reference protein of the homology modeling. The primary base protein is selected from modeling products of FAMS [1, 2], FAMSD and CHIMERA [3, 4] teams in Umeyama’s group. The criterion of that choice comes from alignment length and matching degree between predicted secondary structures for the target and calculated ones for the reference.

Architecture of secondary structure database Fragmented structures including more than two secondary structures are modeled using initially obtained alignments, and those are conserved as secondary data base.

Connecting base model with template models Other template models are fitted on the base protein model, until the full length protein is produced step by step.

Extension of models with secondary structure database The N-terminal or C-terminal moieties for which the fitted model are not constructed to the full length target protein are modeled in similar RMS fitting procedure by using modeling database based upon secondary structures.

Model refinement Finally, in order to refine the connected protein model from several modeled proteins, we use the FAMS(full automatic modeling system) program again.

Results and Discussion

A high homology model for the primary reference protein is almost complete for the length of the target protein, because almost complete alignments between target and reference proteins are obtained with including smaller insertion and deletion. As the result, since the region in which the model should be extended is very small. The planned extension of the primary base protein is very easy. However, in the case of low homology for the primary reference protein, some models are largely extended in the fitting process of template models on the primary base protein. Then, some low homology models have totally good modeling feature, because local modeled moieties are thought to be comparatively proper structure from using the guaranteed alignment with the low E-value in making use of template and base models.

1. Iwadate M., Ebisawa K., and Umeyama H. (2001) Homology modeling of CAFASP2 competition, Chem-Bio Informatics J. 1 (4), 136-148.

A-53 2. Ogata K., and Umeyama H. (2000) An automatic homology modeling method consisting of database searches and simulated annealing. J. Mol. Graph. Model. 18 (3), 258-272, 305-306. 3. Yoneda T., Komooka H. and Umeyama H. (1997) A computer modeling study of the interaction between tissue factor pathway inhibitor and blood coagulation factor Xa. J. Protein Chem. 16, 597-605. 4. Takeda-Shitaka M. and Umeyama H. (1998) Effect of excepional valine replacement for highly conserved Ala55 in serine proteases. FEBS Lett. 425, 448-452.

CIRB (P0397) - 263 predictions: 200 3D, 63 RR

Prediction of the Residue-Residue Contacts With Neural Networks

P. Fariselli1, O. Olmea2, A. Valencia2 and R. Casadio1 1 - CIRB and Dept. of Biology, University of Bologna Via Irnerio 42, 40126 Bologna, Italy. 2 - Protein Design Group. CNB-CSIC, Cantoblanco, Madrid 28049. Spain. [email protected], [email protected]

We use an ab initio method based on neural networks to predict residue-residue contacts. Our networks were trained to learn the association rules between the covalent structure of each protein from a selected data base and its contact map. The neural network implemented here is similar to that previously described in [1] and called NET, including in the input code evolutionary information in the form of sequence profile.

For training the network, we use a large set of non-homologous proteins of known 3D structure. The list includes all proteins in the PDB-select list of non sequence- redundant protein structures whose chains were not interrupted and for which alignments with more than 15 sequences were obtained: in total our set includes 173 proteins [1,2].

We consider two residues to be in contact when the Euclidean distance between the coordinates of the corresponding C-beta atoms is lower than 8 Å ( ||ri - rj || < 8).

The topology of the neural network consists of: (i) a single output neuron which codes for contact (output value close to 1) and non contact (output value close to 0); (ii) a hidden layer of 8 neurons; (iii) an input coding of 1050 input neurons, which represent the ordered pairs (in the parallel and anti-parallel pairing of two segments of 3- residue long) as described in [1].

The basic novelty is that after the neural network predictions, a filter procedure is applied in order to have an upper bound to the possible number of contacts per residue. Based on the output value, the procedure eliminates the less probable contacts for those residues whose number of predicted contacts is larger than 10. Eventually, the backbone connectivity and the secondary structure predictions are included in the filtering procedure. This is done in two ways: 1) setting the intra-helical predictions as contacts, 2) increasing or decreasing the number of contacts among strands depending on their relative network activation values. The average accuracy, measured as number of correct contacts/ number of predicted contacts and evaluated with a cross validation procedure, is in the range of 0.20.

1. Fariselli P. et al. (2001) Prediction of contact maps with neural networks and correlated mutations. Protein Eng 14, 835-843.

A-54 2. Fariselli P. et al. (2001) Progress in predicting inter- residue contacts of proteins with neural networks and correlated mutations- Proteins: Suppl 5, 157-162.

CIRB (P0397) - 263 predictions: 200 3D, 63 RR

Building on the Basics: Protein Fold Recognition/Threading Based on “Generalized” Profile Alignments

E. Capriotti1,3, P. Fariselli2, I. Rossi2,3 , and R. Casadio2 1 - Dept. of Physics/CIRB, University of Bologna, Italy, 2 - Dept. of Biology/CIRB, University of Bologna, Italy, 3 - BioDec srl, Bologna, Italy [email protected], [email protected]

We developed Tangram, a method for protein fold recognition/threading based on sequence profile alignments.

The profile-profile comparison algorithm known as BASIC [1] can be generalized to represent any type of “property“ profile comparison. Assuming that A and B are two strings of symbols, PA and PB are the rectangular matrices representing the position-specific frequency of the alphabet symbols composing the strings (superscript T indicates a matrix transpose operation), S is a (symmetric) substitution matrix, it can be derived that the matrix D, defined as:

T D= P A S PB represents the “dot” matrix for the profile comparison of the two strings. This can be efficiently computed by means of standard linear algebra routines. The D matrix can be searched for high-scoring alignment by means of standard techniques.

In Tangram for a given target/template comparison, we compute a generalized dot matrix D as follows

T T T D= a*[ P A. Bl62 .PB] + b*[ P A,ss. Sss .PB,ss] + c*[P A.C.PB] where a, b and c are the weights of the linear combination, Px and PX,ss are the composition and secondary-structure profiles, Bl is the BLOSUM62 [2] substitution matrix, Sss is a secondary-structure substitution matrix proposed in [3], C is a long-range contact capacity potential [4]. Then D is searched for the top scoring alignment using the local Smith-Waterman dynamic programming algorithm [5].

The composition profiles are generated by multiple alignment of the sequences reported from a three-iteration PSI-BLAST [6] search on the Non-Redundant database, using an inclusion threshold of E=10-3. Secondary structure profiles for the target are generated by means of a neural network predictor [7]. Our template set comprises 2167 PDB structures whose sequence homology is less than 40%, which has also been used to derive the long-range contact capacity potential.

On a set comprising 185 template/target couples of PDB structures that share the same SCOP label with less than 25% sequence identity, we measured 70% accuracy in detecting the correct SCOP assignment.

A-55 1. Rychlewski J. et al. (1998) Fold and function predictions for Mycoplasma genitalium proteins. Fold. Des. 3, 229-238 2. Henikoff S. et al. (1998). Superior performance in protein homology detection with the BLOCKS database server. Nucleic Acids Res. 26, 309-312. 3. Wallqvist A. et al. (2000) Iterative sequence/secondary structure search for protein homologs: comparison with amino acid sequence alignments and application to fold recognition in genome databases. Bionformatics 16, 988-1002 4. Alexandrov N.N. et al. (1996) Fast protein fold recognition via sequence to structure alignment and contact capacity potentials. Pac Symp Biocomput., 53-72. 5. Smith T.S. and Waterman M.S. (1981)Identification of common molecular subsequences. J. Mol. Biol. 147, 145-147 6. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 7. Jacoboni I. et al. (2000) Predictions of protein segments with the same amino acid sequence and different secondary structure: a benchmark for predictive methods. Proteins 41, 535-544

DelCLAB (P0050) - 310 predictions: 310 3D

Protein Folding Prediction by Spectral Analysis Methods

Carlos A. Del Carpio-Muñoz Lab. for BioInformatics. Dept. of Ecological Eng. Toyohashi University of Technology, Tempaku. Toyohashi. 441-8580 [email protected]

Our methodology [1-2] combines the spectral analysis of the physicochemical properties of the amino acids constituting a particular sequence with a consensus analysis of the secondary structure predicted for that structure by several methodologies reported hitherto, including those participating at the CAFASP contest.

1. Spectral Representation of a Protein Folding

We adopted a well known technique of front-end processing in robust automatic speech recognition (ASR) the objective of which is to preserve critical linguistic information while suppressing irrelevant information such as speaker-specific characteristics, channel characteristics, and noise. This analysis-synthesis technique is based on the transformation of a signal into its cepstrum which is a measure of the periodic wiggliness of a frequency response plot. The cepstrum is calculated as the logarithm of the power spectrum of a signal and leads to a logarithmic periodgram for which the spectral envelope is obtained as a smooth curve depicted by connecting the main local peaks of the minute structure of the frequency spectrum.

The technique applied to the analysis the profile of physicochemical features of the protein sequence allows extraction of information in the form of the spectral envelop which is used to model the relationship between the primary and tertiary structures of a protein.

A-56 Comparison of two sequences is then reduced to an alignment of spectral envelopes representing the primary structures. After obtaining the profile of physicochemical characteristics, this is converted to the frequency domain by applying a Fourier transform.

2. Spectral Alignment and Protein Structure Similarity

Spectral similarities are then obtained by alignment of spectra representing two primary sequences using a dynamic programming algorithm (DP). The hypothesis underlying the methodology is that patterns bearing similarity in spectral space represent similar folding patterns in proteins. Here, while in common sequence alignment by DP a penalty is imposed when no match occurs and the search can continue in both the vertical and horizontal directions, this can ’t be done with spectral matching since it would lead to an unlimited flexibility of the match operation. To avoid this negative effect when using DP for spectral alignment, a gradient is imposed to the search so that the match can continue smoothly in the diagonal direction. This equates to reduce the number of gaps in the DP process, since a horizontal and vertical advance is allowed only once, and recurrent gaps are inexistent. Then the similarity of two sequences in frequency space is computed as the Euclidean distance among the spectral harmonics. Values close to zero stand for high similarity while higher values stand for increased dissimilarity.

3. Dominant Parameters and the SCOP Data Base

Finding the parameters for which the alignments are optimal (within a protein folding category) leads to the determination of the dominant physicochemical properties for a particular folding, class, family and finally super-family of proteins.

This evaluation is performed for the proteins recorded in the SCOP database. The analysis introduced here is carried out after a preprocessing of the structural information found in each family and super-family in the database. This consists in having each super-family maintain diversity in residue sequence deleting sequences with higher than 80% similarity. Physicochemical parameters are obtained from the AAINDEX database which is a compilation of 434 amino acid indices for the twenty naturally occurring residues. Dominant physicochemical parameters for each class of proteins are obtained by alignment of the spectra for all the pairs of proteins constituting the class and using the 434 amino acid indices. Five indices are selected as the dominant physicochemical parameters for which the spectral alignment scores are the highest.

4. Threading of the Target Sequence onto the Template of a Candidate Folding Pattern

Since the methodology allows the comparison of amino acids sequences of different length, it poses some difficulties at the moment of using the candidate folding patterns as the template for modeling the target protein. Since the number of amino acids of the candidate may be larger or smaller than that of the target, we propose the combination of two procedures as a paradigm to build an unknown protein from instances derived from analogical analysis as the one presented here. This paradigm involves the following steps: Derivation of a consensus secondary structure based on 2ry structure prediction methodologies. A threading algorithm based on a Genetic Algorithm. We describe briefly each of these procedures:

A-57 4.1) Consensus Secondary Structure. Several methodologies for prediction of the 2ry structure of a sequence of amino acid residues have been proposed in the literature. These amount to more than 20 methods all available through Internet. Besides, using a consensus secondary structure obtained from analysis of the CAFASP contest, allows to have an idea on the secondary structure of the target. We have developed an alignment of secondary structures between that of the recognized folding pattern (the template structure output by the spectral analysis described above) and the consensus secondary structure.

4.2) Threading based on a Genetic Algorithm This algorithm proceeds with the threading of the target sequence of amino acid residues on the template candidate. The procedure uses the results of the alignment of the secondary structures in the precedent step, then cutting and/or inserting pieces of backbone in the template so as to achieve the consensus secondary structure, proceeds to build the target structure. To constrain distortions of the backbone to a reasonable difference in RMS, this threading operation is executed by a Genetic algorithm, the penalty function being that of least RMS deviation.

1. Del Carpio C.A. and Yoshimori A. (2002) Fully Automated Protein Tertiary Structure Prediction Using Fourier Transform Spectral Methods. Protein Structure Prediction: Bioinformatic Approach. Edited by: Igor Tsigelny. University of California. International University Line Inc. 173-197 2. Del Carpio-Muñoz C.A. (2002). Folding Pattern Recognition in Proteins Using Spectral Analysis Methods. Genome Informatics. In Press.

A-58 Doniach (P0401) - 42 predictions: 42 3D

Ab Initio Protein Structure Prediction Method - Topological Assembly Of Predicted Secondary Structure Segments

Wenjun Zheng1 and Sebastian Doniach1,2 1 – Department of Physics, 2 – Department of Applied Physics Stanford University, Stanford, CA94305 [email protected]

The basic idea of our method is assembling predicted secondary structure segments (helices and strands) so that the 'hydrophobic' center of mass of each segment is in contact with at least another one.

The above hydrophobic contacts are reached by two ways: First, if the number of secondary structure segments is small, human observation is used to specify a short list of possible contact schemes and run folding simulation to satisfy them as pairwise contact constraints. Second, if the number of secondary structure segments is large and the number of combinations is beyond human enumeration, we then run folding simulation directly for multiple times trying to pair each helices/strand with a non- predefined counterpart, thus generating multiple models for later selection. The folding simulation is implemented by a Monte Carlo simulated annealing process where the fitness score is compiled from the contact constraints, Rg and the hydrophobic burial score [1].

To better model the formation of beta sheet, we construct all topologically different ways of forming beta sheet given predicted beta strands before running folding simulation.

The selection of top 5 predictions is based on the following criteria: 1. compactness measured by Rg of all non-loop residues; 2. burial of hydrophobic Residues [1]; 3. human observation of helix docking and strand pairing.

1. Huang E. S., Subbiah S. & Levitt M. (1995). Recognizing native folds by the arrangement of hydrophobic and polar residues. J Mol Biol 252 (5), 709-720. DOROTA (P0589) - 1 prediction: 1 3D

Structure Based Modeling of the Target T0190

D. Sawicka Lawrence Livermore National Laboratory, Livermore, California [email protected]

The AS2TS server [1] was used to select the main template required for model building and also to identify possible templates for loop building. Structure based modeling of the high homology target TO190 was performed using the crystal structure of human transthyretin, (pdb code 1f41), as the main template.[2] Modeler [3] was used to generate the 3D coordinates based on the 1f41 as the template using an alignment generated by PSIBLAST[4]. Only one region required loop building in

A-59 the target for which the human GM2-activator protein was used, (pdb code 1g13).[5] Side chains were built using the SCWRL program[6] and PROCHECK was used to assess the quality of the final model.[7].

1. Zemla A.: http://protein.llnl.gov/as2ts 2. Hornberg A., Eneqvist T., Olofsson A., Lundgren E.,Sauer-Eriksson A.E. (2000) A Comparative analysis of 23 structures of the amyloidogenic protein transthyretin. J Mol Biol, 302, 649. 3. Sali A. and Blundell T.L. (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol, 234, 779-815. 4. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W.. and Lipman D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25, 3389-3402. 5. Wright CS, Li S-C, Rastinejad F. (2000) Crystal Structure of Human GM2-Activator protein with a novel beta-cup topology. J Mol Biol, 304, 411. 6. Bower M.J., Cohen F.E. and Dunbrack R.L., Jr. (1997) Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: a new homology modeling tool. J Mol Biol, 267, 1268-1282. 7. Laskowski R. A., MacArthur M. W., Moss D. S. & Thornton J. M. (1993) PROCHECK: a program to check the stereochemical quality of protein structures. J Appl Cryst, 26, 283-291. Dunbrack (P0329) - 46 predictions: 46 3D

Comparative Modeling of CASP5 Targets

R. L. Dunbrack, Jr., Y. Li, G. Wang, and A. A. Canutescu Fox Chase Cancer Center, Philadelphia PA USA [email protected]

We used two different profile-profile alignment/search methods to identify known structures homologous to the CASP5 targets. Profiles for all sequences in the PDB and for the target sequences were derived with PSI-BLAST[1] using a common procedure as follows. The non-redundant protein sequence database was searched for 5 rounds with E-value cutoff ("-h") for inclusion in the position-specific score matrix of 0.002. We checked for drift by noting whether hits in one round of PSI-BLAST with E-value better than 0.002 were present in subsequent rounds with E-value worse than 0.002. If drift occurred, then we used the last round without drift. All hits identified with E-value less than 10.0 were saved and placed in a new database. Multiple sequence alignments were then created by searching this database with PSI- BLAST. The multiple alignment was culled at 98% sequence identity to remove redundant sequence information, and the sequences were weighted according to the method of Henikoff and Henikoff [2] to produce the sequence profile.

We have developed two new scoring mechanisms for profile-profile alignments. The first is a Dirichlet mixture substitution matrix (DIMSUM) analogous to ordinary amino acid substitution matrices, but in which the scores represent probabilities of substituting profile columns for one another. The columns in the profiles are represented as components of a Dirichlet mixture developed from multiple sequence alignments and structural characteristics (secondary structure and surface exposure). The DIMSUM matrices were developed from structure alignments of homologous proteins using the CE program [3] in a manner similar to the BLOSUM matrices [4]. The profile-profile alignments are performed with a standard local-alignment dynamic programming algorithm.

The second scoring method is a combination of an amino acid substitution matrix and a matrix that represents the probability of predicted secondary structure in one

A-60 profile (the CASP target) aligning to known secondary structure in the PDB entry. This matrix (SSAAC) was also developed from structure alignments by determining the substitution rates of predicted secondary structure in one protein in each structural alignment versus known secondary structure in the other protein. We combined both DIMSUM and SSAAC with a structure-derived amino acid substitution matrix (SDM) [5], applied to the two profile columns, such that the score is the sum over all i,j of piqjSij where pi and pj are the probabilities of amino acid types i and j in the two columns and S ij is the element from the substitution matrix. We use a gap penalty scheme that is dependent on the evolutionary distance of the two profiles. The scoring schemes were optimized at 50% SDM/50%DIMSUM for the DIMSUM method and 65% SDM/35% SSAAC for the SSAAC method.

We applied both the DIMSUM and SSAAC methods to identify homologues of known structure for the CASP targets, and chose parent structures and alignments that subjectively appeared to represent the best alignments, either by length, biological function, or alignment quality. This alignment was then optimized by visual examination of the known structure, with modest changes made to the alignments near insertion-deletion regions. Using visual examination, we chose segments of the parent protein to be replaced by new sequence of different length (i.e., indels) using a loop modeling method we are developing. In most cases, we chose to replace the entire loop between the flanking regular secondary structures. If the loop was too long to model, in some cases we chose shorter segments that seemed likely to be affected by the insertion or deletion.

We have developed an energy function that determines ranking of Ramachandran conformations for a protein loop segment based purely on its sequence. The function is derived from database analysis using Bayesian methods. The analysis provides probabilities of Ramachandran conformations for each position in a loop as a function of the amino acid type of the residue at that position as well as the amino acid type and conformation of the residue previous to that position and the amino acid type and conformation of the residue following it. This allows us to search the entire space of Ramachandran conformations and sort their energies. We find empirically that 99% of true conformations can be found in low energy conformations (<1.5 kcal/mol per residue) from this function. Generally the correct loop conformation is found in the top 100 conformations sorted by energy.

We built 100 random conformations for the top 100 Ramachandran conformations by sampling from phi,psi distributions for each of the 20 amino acids from loop residues in the PDB. We have developed a new loop closure method using an algorithm from robotics called “cyclic coordinate descent” (CCD). The loop closure problem occurs in nearly all loop building methods, and the current algorithms such as “random tweak” [6] are slow and sometimes do not converge. CCD works by by altering the structure of an initial loop conformation built from the N-terminal anchor that does not close the loop at the C-terminal anchor..It alters one dihedral angle at a time to optimize overlap of the C-terminal residue of the constructed loop and the fixed C-terminal anchor residue. Each move in the process is defined by the solution to an equation in one variable (the solution is an inverse tangent). This in contrast to tweak which alters all dihedrals and requires matrix inversion. CCD is very fast and converges 99.95% of the time. It fails occasionally only for very short, highly extended loops, where it may get stuck in a local minimum. CCD is also very flexible in that any desired constraints can be applied to each dihedral angle as a Metropolis criterion either to accept a proposed move or to reject it. We added a Ramachandran probability map, so that moves to higher probability phi,psi conformations for the residue type were accepted, and lower probability conformations were accepted with probability of p(new)/p(old).

We used CCD to close the 10,000 trials per loop. We are in the process of developing and optimizing an appropriate energy function for choosing the best conformation from the set of trial conformations. For CASP, we performed two calculations on the 10,000 trials: first, we built the side chains onto the loop conformation using our program SCWRL [7] using the rest of the protein model as a steric frame; second, we used CHARMM [8] to minimize the energy of the loop conformation (including side chains built with SCWRL). We used these two energies to choose a Ramachandran conformation that appeared most frequently at low energy.One of these conformations was used as the predicted loop and placed into the structure.

A-61 We used SCWRL to predict the side chains on the entire structure, including the constructed loops. Finally, we performed a brief CHARMM energy minimization of the structure to relieve short contacts that result from the rotamer assumption in SCWRL and to fix the phi,psi conformations of nonPro Pro mutations.

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 2. Henikoff S. and Henikoff J.G. (1994) Position-based sequence weights. J. Mol. Biol. 243 (4), 574-578. 3. Shindyalov I.N. and Bourne P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimzl path. Protein Eng. 11 (9), 739-747. 4. Henikoff S. and Henikoff J.G. (1993) Performance evaluation of amino acid substitution matrices. Proteins 17 (1), 49-61. 5. Prlic A., Dominques F.S., and Sippl M.J. Structure-derived substitution matrices for alignment of distantly related sequences. Protein Eng. 13 (8) 545-550. 6. Shenkin P.S. et al. (1987) Predicting antibody hypervariable loop conformation. I. Ensembles of random conformations for ringlike structures. Biopolymers 26 (12), 2053-2085. 7. Bower M.J., Cohen F.E., and Dunbrack R.L. Jr. Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: a new homology modeling tool. J. Mol. Biol. 267 (5), 1268-1282. 8. MacKerell A.D. et al. All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B102, 3586-3616.

Dunker-Keith (P0355) - 195 predictions: 195 DR

Predicting Intrinsic Protein Disorder

A. Keith Dunker1, Pedro Romero1, Xiahong Li1, Ethan C. Garner1, Celeste J. Brown1, Predrag Radivojac2 1 - Washington State University, 2 - Temple University [email protected]

Many protein segments and a few whole proteins are unstructured under their putative physiological conditions. These “intrinsically disordered sequences” have important functions. Recognizing the over-simplification of partitioning into two states, order and disorder, and recognizing that all protein structure is condition- dependent, we are nevertheless focusing on the simplified problem of predicting intrinsic order and disorder from amino acid sequence. From our point of view, a region existing as an ensemble of Ramachandran ,  angles, whether static or dynamic, is intrinsically disordered.

PONDR is a collection of neural network Predictors of Natural Disordered Regions. MODEL 1 is an integration of three predictors, one for each termini and one for internal sequences [1-2]. For the internal sequences, a training set of 15 disordered regions having a total of 1149 residues was compiled and balanced by an equal number of ordered residues taken randomly from NRL_3D. Of the 15 disordered regions in the training set, 8 were characterized by X-ray diffraction (PDB IDs: 2tbv, 2ts1, 1aui, 1bgw, 1elo, 1af3, 1ati and 1lbh) and 7 by NMR (SW IDs: prio_mouse, h5_chick, flgm_salty, regn_lambd, hsf_klula, and hmgi_human, and PIR accession: S50866).

From an initial pool of 31 attributes, a branch and bound search was used to select 10 attributes that gave the best collective discrimination between order and disorder in the training set using a Mahalanobis distance criterion. The 31 attributes in the initial pool included the 20 amino acid compositions, two different hydropathy scales,

A-62 flexibility index, alpha-moment, beta-moment, net charge (K + R - D - E), aromatic composition (W + F + Y), coordination number, codon number, alphabet size, and side chain volumes. The attributes selected by this process were fraction of W, Y, F, D, E, K, R, aromatic composition, coordination number, and net charge.

The back-propagation learning algorithm was used to train a feedforward neural network having the ten selected attributes as inputs, a fully connected hidden layer of ten neurons and a single output. To estimate errors, the training was repeated on 5 disjoint subsets each having 80% of the data with 3 different initializations, so neural network training was repeated 5 x 3 = 15 times. Once the accuracy was established by this 5-cross validation procedure, a new neural network was trained to the same accuracy using all the data.

To enable prediction from the first to the last residue in a protein, disorder was partitioned according to position, with the development of different predictors for N- terminal, and C-terminal regions. These predictors used 8 inputs.

The integration of the three predictors is carried out in 3 steps. First, predictions are made by the three predictors over their respective domains, with overlapping predictions for positions 11 - 14 by the N-terminal and internal predictors, and, for a protein of length M, with overlapping predictions from M-14 to M-11 by the C- terminal and internal predictors. Second, the values for each of the 4 pairs of overlapping prediction are averaged. Third, the now integrated prediction outputs are smoothed by averaging over sliding windows of 9 amino acids, with the first and last 4 sequence positions being assigned the unsmoothed prediction output values from the N- and C-terminal predictors, respectively. This integrated predictor is used herein. Studies have shown that neural network scores are only roughly equivalent to probabilities.

The cutoff for disorder is indicated by a score greater than 0.5. Short strings of amino acids are erroneously predicted to be disordered more often than long strings of predicted disorder. The predictor used here has a per residue error rate of 22%, whereas the error rate for consecutive lengths of disorder 30 or longer is only 3% and 40 residues or longer is only 0.4%. In the remarks section of each prediction, consecutive lengths of disorder 17 residues or shorter were determined to be not significant, except at the termini. Ten or more consecutive residues at the termini were considered significant.

MODEL 2 is a linear predictor known as PONDR VL2 (S. Vucetic, C.J. Brown, A.K. Dunker, Z. Obradovic, Supervised Partitioning of Disordered Proteins, in progress). PONDR VL2 was trained on 145 disordered regions whose lengths were 40 amino acids or longer. These regions were identified either by missing electron density in X-ray crystal structures, or by author's designation for NMR, circular dichroism or proteolysis. The ordered training set was composed of 130 completely ordered proteins with no sequence similarity. (Both datasets are available at http://disorder.chem.wsu.edu). The attributes used for training this predictor were 19 amino acid compositions (excluding F), flexibility and sequence complexity. These values were calculated for a window size of 41. A collapsing window size was used for the termini of each sequence. The algorithm used for prediction was ordinary least squares regression with ordered windows designated 0 and disordered windows designated 1. In this case, prediction values can range below 0 and above 1; any value below 0.5 is considered ordered and greater than or equal to 0.5 is considered disordered.

MODEL 3 is based on an ensemble of feed-forward neural networks, combined using bagging methodology and augmented using an order/disorder boundary predictor. The dataset of disordered proteins had 154 disordered regions that were 30 residues long or longer. The dataset of ordered proteins had 290 completely ordered proteins. (Both datasets are available at http://disorder.chem.wsu.edu). All networks were trained using the Levenberg-Marquardt algorithm with at most 200 iterations, and were optimized for detecting intrinsic disorder of length 30 residues or longer.

A-63 The ensemble of predictors contains 50 feed-forward neural networks. Each predictor in the ensemble uses 20 input attributes, 19 are amino acid frequencies in a sliding window of 41 residues, and the last attribute is sequence complexity. The structure of each network is 20 x 5 x 2. All hidden neurons use a logistic activation function. All output neurons use a linear activation function. Output nodes approximate the conditional posterior probability of each class provided that the number of examples used for training was large enough and the training algorithm found a global optimum. Predictions were windowed using an output window of length 31.

The order/disorder boundary predictor [3] is a logistic regression predictor that uses the frequencies of specific residues at each position on either side of an order/disorder boundary. Training sequences came from the same data sets as the ensemble predictors, however, there were only 123 order/disorder boundaries in the disorder set. Inputs to the predictors are the frequencies of amino acids at each position in a window of 24 residues. To reduce dimensionality, only amino acids that showed significantly different frequencies at an order/disorder boundary relative to the same positions in completely disordered and completely ordered segments were used as inputs. Dimensionality was further reduced using principle components analysis.

After bagging the ensemble of neural network predictors, the order/disorder boundary predictor was used, and the prediction were recalculated. The final predictor is therefore a boundary-augmented, bagged ensemble of feed-forward neural networks.

1. Romero P. et al. (2001) Sequence complexity of disordered protein. Proteins: Struc. Func. Genetics 42, 38-48. 2. Li X., et al. (1999) Predicting protein disorder for N-, C-, and internal regions. Genome Informatics 10, 30-40. 3. Radivojac P., et al. (2003) Prediction of boundaries between intrinsically ordered and disordered protein regions. Pacific Symp Biocomputing (in press).

ESyPred3D (P0034) - 36 predictions: 36 3D

ESyPred3D: Prediction of Proteins 3D Structures

C. Lambert and E. Depiereux Unité de Recherche en Biologie Moléculaire, Facultés Universitaires Notre-Dame de la Paix, rue de Bruxelles 61, 5000 Namur, Belgium [email protected]

The aim of our work is to propose a reliable automatic method for homology modeling (ESyPred3D[1]), especially when the protein of interest shares a low percentage of identities (20-30%) with the chosen template.

Our strategy consists in the usual steps for homology modeling: search for the template in databanks, target-template alignment and modeling. Actually, our method does not provide any assessment of the model.

A-64 For the search of a template in databank, we used four iterations of PSI-BLAST[2] on the non redundant protein database (nr) of the NCBI. All sequences having a expected value lower than 0.001 are included in the profile building. The template is chosen as the sequence of known structure (PDB) that has the lower expected value. The search in the nr databank also gives us a large number of similar sequences.

As far as possible, two sets of sequences are built. The first one contains the 50 best hits below the expected value cutoff of 0.001. The second one contains a subset of the sequences, after dropping too redundant ones. This method aims at creating different conditions to run multiple alignment programs and extracting different consensus in order to raise the confidence of the sequence-structure alignment.

The two sets are then submitted to five alignment programs: ClustalW[3], Dialign2[4], Match-Box[5], Multalin[6] and T-Coffee[7]. A pairwise alignment between the target and template sequences is extracted from each multiple alignment. All the pairwise alignments including the one provided by PSI-BLAST are used to generate a database of aligned positions (boxes). A neural network is used to assign a score to each box. Most confident boxes are taken as anchor points for the building of the final sequence-structure alignment. A three-dimensional model is built using MODELLER [8] on this final alignment.

ESyPred3D web site: http://www.fundp.ac.be/urbm/bioinfo/esypred

1. Lambert C. et al. (2002) ESyPred3D: Prediction of proteins 3D structures. Bioinformatics. 18 (9), 1250-1256 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 3. Thompson J.D. et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Res. 22, 4673-4680 4. Morgenstern B. et al. (1998) DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics 14, 290-294 5. Depiereux E. et al. (1997) Match-Box server: a multiple sequence alignment tool placing emphasis on reliability. Comput. Appl. Biosci. 13, 249-256 6. Corpet F. (1988) Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. 16, 10881-10890 7. Notredame C. et al. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302(1), 205-217 8. Sali A. et al. (1993) Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234(3), 779-815. evolutionaries (P0180) - 99 predictions: 99 3D

A Phylogenomic Approach to Fold Prediction

Kimmen Sjölander1, Emma Hill1, David Konerding1, Steven Brenner1, Andrej Sali2 and Andras Fiser2 1 – UC Berkeley, 2 – Rockefeller University [email protected]

The Berkeley Evolutionaries approach focused on the use of phylogenetic inference to guide our selection of structural templates for targets and to produce an alignment of a target and template pair. These alignments were used as the basis for model construction (using MODELLER), from which the best model was chosen using a statistical potential function (PROSAII).

A-65 1. HMM library construction. We constructed an HMM library using the SCOP PDB40 sequence set (Astral version 1.57, with 4013 domains) as seeds, the UCSC SAM-T99 software to cluster and align homologous sequences in the NR database to each seed, and the UCSC fw0.5 software to construct a general HMM for each cluster. We then ran BETE (Bayesian Evolutionary Tree Estimation) to construct a phylogenetic tree and identify subfamilies (Section 7), and constructed subfamily HMMs (Section 7). This HMM library was expanded to include HMMs for new sequences submitted to PDB since the Astral 1.57 PDB40 database had been completed. 2. Creation of a set of sequence homologs to PDB structures. We generated a consensus sequence for each subfamily in the HMM library, creating a representative set of ‘sequences’ for each cluster. We collected all consensus sequences into one large file (“sfreps.seqs”), containing 552,000 sequences, each of which is mapped to a specific PDB40 domain. 3. Target identification. Each target went through a multi-stage analysis, using first those methods which are computationally efficient but less sensitive, and continuing (as necessary) to the more computationally expensive and sensitive methods. In cases where a target appeared to be composed of multiple domains, we constructed HMMs for each domain separately, and each domain was treated as a separate target with all stages of the target identification process performed independently. All putative target-template matches were assessed using various criteria (see Section 4). Stage 1. Target is scored against the general HMMs in the HMM library. Stage 2. Target is scored against subfamily HMMs for high-scoring clusters. Stage 3. The FlowerPower algorithm (Section 5) is used to identify homologs to the target from the NR database, and to construct a multiple sequence alignment (MSA). The target homologs are scored against the general HMMs in the HMM library, followed by scoring of subfamily HMMs for high-scoring clusters. Stage 4. We construct a general and subfamily HMMs for the target and homologs (using BETE); the HMMs are then used to score the PDB database. Stage 5. The target general HMM is used to score the ‘sfreps.seqs’ file (Section 2), followed by scoring all the top hits against the target subfamily HMMs. Stage 6. We score the NR database with the general HMM constructed for the target to find additional homologs to the target. Any accepted sequences are included in the target HMM training set, and stages 3-5 are repeated 4. Assessing the likelihood of a target-template match. We used a three-stage alignment analysis: (1) analysis of the MSA for the target and homologs to identify key residues, conserved motifs, regions of variability, and so on; (2) an identical analysis of any template MSAs (including literature search for experimentally determined key residues); and (3) construction and analysis of a joint alignment of the members of the target and template families. In cases where the target-template sequence similarity was very low, we used a variety of alignment methods, including SATCHMO (Section 8), and joint HMM construction to generate a joint alignment. Joint HMM construction employs subfamily HMMs to detect intermediate sequences from the target and template homolog sets. These sequences are mutually aligned, using one of the subfamily HMMs, followed by FlowerPower expansion of the joint HMM training set, until the target and template structure can be aligned accurately. Structural alignments (from DALI) were also used as inputs to FlowerPower, for inclusion of sequence homologs, and construction of general and subfamily HMMs. These structurally informed HMMs were then used to align the target and generate a pairwise alignment between the target and template. All alignments produced for the target and template and their homologs were then inspected for agreement at predicted or known key positions in either the target or template structure. 5. FlowerPower clustering and multiple sequence alignment. FlowerPower integrates clustering and alignment into a single process, and thus has obvious similarities to both PSI-BLAST and SAM-T99. There are two fundamental differences which distinguish our approach. First, instead of using a single HMM or profile to expand the existing cluster, we use a set of HMMs: a general HMM for the family as a whole, and a subfamily HMM for each subfamily found by BETE. Each HMM competes for all sequences, and is used to align those sequences most closely related to it. Because subfamily HMMs have specificity for individual subfamilies included in previous iterations, they prevent profile drift (i.e., sequences identified as homologous in early iterations will continue to be identified as homologs at all subsequent iterations) and result in improved alignment, particularly in regions of overall diversity among family members. Second, FlowerPower uses alignment quality control after each cluster expansion step, to prevent the potential intrusion of non-homologs and/or poorly aligned homologs. 6. Bayesian Evolutionary Tree Estimation (BETE). BETE[1] employs agglomerative clustering to construct a tree, given an input multiple sequence alignment. Initially all sequences are in separate classes, and form the leaves of the tree. For each sequence in the alignment we construct a profile to represent the amino acid

A-66 probabilities at each position, using Dirichlet mixture priors. We measure the distance between all pairs of classes, using a symmetrized form of relative entropy between the profiles, and find the two closest classes. We then estimate a new profile to represent all the sequences in the merged class, compute the distances between the new class and the other classes, and join the closest pair. Dirichlet mixture densities are also used to compute the encoding cost of the set of alignments defined at each stage of the agglomeration; the point during the agglomeration producing the minimum encoding cost defines a cut of the tree into subtrees, and a decomposition of the sequences into subfamilies. BETE was used at Celera Genomics (in combination with subfamily HMM construction) to produce the functional classification of the proteins encoded in the human genome[2]. 7. Subfamily HMM construction and performance. For each subfamily, and at each position, we use Dirichlet mixture priors to weight the contribution of amino acids aligned by other subfamilies.. This enables subfamily HMMs to retain specificity in homolog detection without sacrificing sensitivity, even when individual subfamilies may contain very few members. Experimental validation on the PDB40 datasets show that subfamily HMMs provide very high specificity of classification of novel sequences, and improve sensitivity of homolog detection, particularly in the case of fragment detection[3]. 8. Simultaneous Alignment and Tree Construction using Hidden Markov mOdels (SATCHMO). SATCHMO[4] simultaneously estimates a phylogenetic tree and generates a set of multiple sequence alignments, one for each node in the tree. Because SATCHMO requires sequences in each subtree to retain their mutual alignment when two subtrees are joined, sequences are not allowed to get out of register with others that are closely related. The multiple sequence alignment at each node models the consensus structure held by the sequences descending from that node; at the root, the alignment predicts the ‘conserved core structure’ shared by all members of the family, with alignments increasing in length and in specificity as a path is traced from the root towards a leaf. In experiments on the BAliBASE benchmark alignment database, SATCHMO is shown to perform comparably to ClustalW and the UCSC SAM alignment ‘tune-up’ software. 9. Building a full-atom model. Once alignments of the target sequence with several different candidate template structures or alternative alignments with a given template were obtained, MODELLER was used to construct all-atom models of the target[5]. MODELLER implements comparative modeling by satisfaction of spatial restraints. For each template selection and alignment, 20 models were built and subsequently evaluated by statistical potential functions in PROSAII[6]. The template selections and alignments were iteratively refined by hand to increase the PROSAII Z-scores of the corresponding models. In addition, well-defined insertions were modelled with the 'ab initio' loop modeling module of MODELLER[7]. For the difficult modeling cases involving remotely related templates, some predicted helices and strands were also explicitly restrained to maintain the predicted fold. In a few cases, the target sequence was modeled in complex with corresponding cofactors or inhibitors. After the final alignment and template selection were found by optimizing the PROSAII Z-score, the best among the final 20 models was selected based on the value of the MODELLER objective function.

1. Sjolander K. (1998) Phylogenetic inference in protein superfamilies: analysis of SH2 domains. Proc Int Conf Intell Syst Mol Biol 6: p. 165-74. 2. Venter J.C. et al. (2001) The sequence of the human genome. Science 291(5507): p. 1304-51. 3. Christopher W. and Sjolander K. The sum of the parts is greater than the whole: Protein classification using subfamily HMMs. Submitted. 4. Edgar R. and Sjolander K.. Simultaneous Sequence Alignment and Tree Construction using Hidden Markov Models, To appear in Pacific Symposium on Biocomputing. 2003. Kauai, HI. 5. Sali A. and Blundell T.L. (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Bio.l 234(3): p. 779-815. 6. Sippl M.J. (1993) Recognition of errors in three-dimensional structures of proteins. Proteins 17(4): p. 355-62. 7. Fiser A., Do R.K., and Sali A. (2000) Modeling of loops in protein structures. Protein Sci 9(9): p. 1753-1773.

A-67 FAMS (P0168) - 324 predictions: 324 3D

Homology Modeling Server, FAMS

M. Iwadate, R. Yamatsu, G. Terashi, R. Arai and H. Umeyama School of Pharmaceutical Sciences, Kitasato University [email protected]

We developed a homology-modeling server, FAMS [1]. In this server homology modeling software FAMS (Full Automatic Modeling System) was used [2]. Some kinds of BLAST type software, BLAST, PSI-BLAST, RPS-BLAST, IMPALA [3] and some original alignment software and re-alignment software were used.

After calculating many model structures based upon various alignments, considering E-values of alignments, hydrophobic interactions of model structures and secondary structures, favorable scoring function was defined. And highest score 5 models were submitted.

In the case of the target protein having the high homological sequence with a known structure in the PDB database, some kinds of BLAST type software produce many alignments. About each alignment, FAMS modeling was executed. Therefore many models were produced. Considering limited time within 48 hours of CAFASP3 deadline, priority order of alignment was decided by a sorted list for the E-value in the case of each target.

In the case of no or low homology with the known reference structure, our original alignment algorithm effectively worked. Then the alignment algorithm always could give many model structures for all the 67 CAFASP3 targets. Total number of submitted structure was 335 (67 * 5) from this FAMS server.

Actually used computing time of each target strongly depended on the number of query sequences of same day deadline. Sometimes complete calculation requires more than 48 hours. To choice the best structure from many structures within the limited time, we constructed the PC cluster computing system with 150 CPU Linux machines on which FAMS software programs run. Thus such large number of model structures could be calculated from large number of alignments.

1. Iwadate M. , Ebisawa K. and Hideaki U. (2001) Homology modeling of CAFASP2 competition, Chem-Bio Informatics J. 1 (4), 136-148 2. Ogata K. and Hideaki U. (2000) An automatic homology modeling method consisting of database searches and simulated annealing, J. Mol. Graph. Model., 18 (3), 258-272, 305-306 3. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402

FAMSD (P0169) - 322 predictions: 322 3D

Homology Modeling Server, FAMSD

A-68 M. Iwadate, R. Yamatsu, G. Terashi, R. Arai and H. Umeyama School of Pharmaceutical Sciences, Kitasato University [email protected]

We developed a homology-modeling server FAMSD, the advanced version of FAMS [1]. In this server homology modeling software FAMS (Full Automatic Modeling System) was used [2]. Some kinds of BLAST type software [3] and some original re-alignment software were used.

Progressive changes from FAMS server are mainly 2 points. 1) For homology search database, SCOP domain database was used instead of PDB database. 2) E-value threshold was highly set up, and, 10 in all kinds of BLAST type alignment software. Favorable scoring function calculated from modeled structures and E-value was defined to determine the priority order. And 5 models having the highest scores were submitted.

In the case of high homology target protein with known reference structure, similar alignments and many temperate protein coordinates were selected from server FAMS, and some kinds of BLAST type software produce many alignments at high E-value threshold. Therefore many models were produced. Considering time limitation of 48 hours CAFASP3 deadline, priority order of alignment was decided by the E-value sorting for the each target. Also, in the case of no or low homology with known reference structure, many model structures always were obtained due to the high E-value threshold in all 67 CAFASP3 targets. Total number of submitted structures was 334 (for FAMSD T0129, 4 structures was returned) from this server.

In FAMSD server, new techniques, which are belonging to model fitting algorithm and alignment combining algorithm, were introduced. In some targets these algorithm worked effectively, considering the score mentioned above. Alignment combining algorithm was also used in this FAMS server.

1. Iwadate M. , Ebisawa K. and Hideaki U. (2001) Homology modeling of CAFASP2 competition, Chem-Bio Informatics J. 1 (4), 136-148 2. Ogata K. and Hideaki U. (2000) An automatic homology modeling method consisting of database searches and simulated annealing, J. Mol. Graph. Model., 18 (3), 258-272, 305-306 3. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402

A-69 FFAS03 (P0309) - 314 predictions: 314 3D

FFAS03: Automated Profile-Profile Distant Homology Recognition Server Applied to Fold Recognition.

L.Jaroszewski1 and A.Godzik2 1 JCSG Bioinformatics, UCSD, 2 – The Burnham Institute [email protected]

FFAS03 (Fold and Function Assignment System 03) automated fold recognition server is based on the dynamic programming alignment of sequence profiles and builds on the FFAS algorithm described previously [1-3].

FFAS03 sequence profiles are calculated from multiple sequence alignments obtained with PSI-BLAST [4] with a special weighting system. Five iterations of PSI-BLAST were applied to collect sequences from non-redundant NCBI database after clustering it at 85% sequence identity with CD-HIT program [5] and masking low-complexity regions with SEG (6) program. FFAS03 uses sequences identified in PSI-BLAST up to the E-value below 0.01 and the alignment from the PSI-BLAST ouput.

A two dimensional weighting system is based on the matrix of all to all similarity within the homologous family. In addition, FFAS03 performs a normalization of the matrix containing the comparison scores between all positions of both aligned profiles. Final (normalized) score is obtained from raw dynamic programming score by comparison with empirically obtained distribution of raw scores on the representative domain library of different folds based on SCOP(7) database.

1. Rychlewski L., Jaroszewski Ł., Li W. & Godzik A. (2000) Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Science 9, 232-241 2. Jaroszewski Ł., Rychlewski L. & Godzik A. (2000).Improving the quality of twilight-zone alignments. Protein Science 9, 1487-1496 3. Jaroszewski Ł., Li W., & Godzik A. (2002) In the search for more accurate alignments in the twilight zone. Protein Science 11:1702-13 4. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 5. Li W, Jaroszewski Ł, Godzik A. (2002) Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics 17: 282-283 6. Wootton J.C. (1994) Non-globular domains in protein sequences: automated segmentation using complexity measures. Comput Chem. 18:269-85. 7. Murzin A.G., Brenner S.E., Hubbard T., Chothia C. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol 247:536-540.

Flohil (P0545) - 3 predictions: 3 3D

Completion and Refinement of 3-D Homology Models With Restricted Molecular Dynamics

A-70 J.A. Flohil, S.W. de Leeuw Delft University of Technology, Department of Applied Physics Lorentzweg 1, 2628 CJ Delft, The Netherlands [email protected]

Modeling and refinement protocol has been applied according the method described in [1]. http://www3.interscience.wiley.com/cgi-bin/fulltext?ID=95016016&PLACEBO=IE.pdf

1. Flohil J.A. et al. (2002) Completion and Refinement of 3-D Homology Models with Restricted Molecular Dynamics: Application to Targets 47,58, and 111 in the CASP Modeling Competition and Posterior Analysis. Proteins 48, 593-604.

A-71 Flohil (P0545) - 3 predictions: 3 3D

Molecular Dynamics Simulation of Hydrophobic Collapsing From a Model in Extended State

J.A. Flohil, S.W. de Leeuw Delft University of Technology, Department of Applied Physics Lorentzweg 1, 2628 CJ Delft, The Netherlands [email protected]

An initial simulation model was created by mutation of a fully extended polyglycine into the target sequence. GROMACS [1] with GROMOS96 force field was used for all simulations, and among applied parameters were periodic boundary conditions, temperature coupling and long range electrostatics. The system was initially built up by a shell of explicit water of 0.6 nm added around the stretched model, and the system was placed in a 12x12x150 nm box. Before each of the following simulations an energy minimization was performed. After adding water or renewing the water, a 10 ps run with position restraints on the protein was done to equillibrate the water. A main collapsing run of 1 ns was performed. This run started with all amino acid positions harmonically restrained, releasing each 10 ps a consecutive residue from its restraints until the complete chain was able to remove free. The first residue restraint was released at the N-terminal side, the last restraint at the C-terminal side. If the residue releasing procedure was completed before the end of the simulation, then the remaining time was used to continue without restraints. From the 1 ns trajectory, each 1 ps a snapshot was recorded, and for each snapshot the radius of gyration about the x, y and z axes of the protein atoms were computed. The frame with the model in the most compact state was selected for further refinement. Drifting water molecules having no contact with water shell or protein were removed from the box, and the dimensions of the box were maximally reduced, and empty holes in the box were filled with water molecules. The simulation was restarted to run for another 3 ns, and based on the evolution of forming secondary and tertiary segments, as well as backbone-backbone contacts and free energy of the water, the best model conformation was selected for submission.

1. Lindahl E., et al. (2001) GROMACS 3.0: A Package for Molecular Simulation and Trajectory Analysis. Journal Molecular Modeling, 7, 306-317.

FISCHER (P0427) - 161 predictions: 161 3D

Fold Recognition Using 3D-Shotgun

D. Fischer1 and N. Siew1,2 1 Bioinformatics/Computer Science, 2 Dept. of Chemistry Ben Gurion University, Beer-Sheva, Israel [email protected]

A-72 Fully automated structure-prediction methods can currently produce reliable models for only a fraction of the target sequences. However, using a number of semi- automated procedures, human-expert predictors are often able to produce more and better predictions than automated methods. We have recently developed a novel, fully automatic, fold-recognition meta-predictor, named 3D-SHOTGUN[1] that incorporates some of the strategies human predictors have successfully applied. This new method is reminiscent of the so-called cooperative algorithms of Computer Vision. The input to 3D-SHOTGUN is the top models predicted by a number of independent fold-recognition methods. The meta-predictor consists of four steps:

Assembly of hybrid models, Confidence assignment, Selection, and Model Refinement.

The three first steps are fully automated within the bioinbgu fold-recognition server. MaxSub [2] and LiveBench tests have demonstrated that 3D-SHOTGUN is more sensitive than any of the individual methods, and the predicted hybrid models are, in average, more similar to their corresponding native structures than those produced by the individual servers. The models produced by bioinbgu were submitted to CAFASP. The fourth step, which includes model refinement using Modeller, is not yet part of the server, although it is also fully automatic. Fischer’s group predictions to CASP were the results of the application of this fourth step to the server’s 3D- SHOTGUN predictions.

1. Fischer D. (2002) 3D-SHOTGUN: A Novel, Cooperative, Fold-Recognition Meta-Predictor. Proteins, In press. 2. Siew N, Elofsson A, Rychlewski L, and Fischer D. (2000). MaxSub: an Automated Measure for the Assessment of Protein Structure Prediction Quality. Bioinformatics 16 (9), 776-85.

Floudas-C.A. (P0011) - 15 predictions: 15 3D

ASTRO-FOLD: Ab-Initio Tertiary Structure Prediction of Proteins

Christodoulos A. Floudas and John L. Klepeis Department of Chemical Engineering, Princeton University, Princeton, NJ [email protected]

ASTRO-FOLD is an integrated methodology for the ab-initio structure prediction of proteins based on an overall deterministic global optimization framework coupled with mixed-integer optimization. The novel four-stage approach combines the classical and new views of protein folding, while using free energy calculations and integer linear optimization to predict the location of helical segments and the topology of beta-sheet structures and disulfide bridges, respectively. Detailed atomistic- level energy modeling and the deterministic global optimization method, aBB, coupled with torsion angle dynamics, form the basis for the final tertiary structure prediction [1-3].

A-73 The first stage of the approach involves the identification of helical segments. This is accomplished through detailed atomistic-level energy modeling of overlapping subsequences of the overall protein sequence using the selected force-field (e.g., ECEPP/3 [4]). The amino-acid sequence is first decomposed into subsequences of overlapping oligopeptides (e.g., pentapeptides, heptapeptides, nonapeptides). For instance, using heptapeptides, the folowing subsequences are generated: 1-7, 2-8, 3- 9, . . . etc. For each subsequence, global optimization is used to generate an ensemble of low energy conformations along with the global minimum energy conformation [5]. Rigorous free energies that include entropic, cavity formation, polarization and ionization contributions, and involve solution of the Poisson-Boltzmann equation, are calculated for a subset of conformations for each oligopeptide system. Finally, these free energy values are combined to determine helical propensities for each residue by calculating equilibrium occupational probabilities for each possible helical cluster [6].

The second stage focuses on the prediction of beta-sheet and disulfide bridge topology through the analysis of amino acid properties that are based on residue hydrophobicities. The approach, which borrows key concepts from a mathematical framework developed in the area of process synthesis of chemical systems [7], is based on the idea that beta-structure formation relies on a hydrophobic driving force. To model this force, it is necessary to predict contacts between hydrophobic residues. The first important component of the approach is the postulation of a beta-strand superstructure that encompasses all alternative beta-strand arrangements. A novel mathematical model is then formulated to provide the formation of ordered structural features, such as beta-sheets and disulfide-bridge connectivity. The solution of this integer linear programming problem, with the objective being the maximization of the hydrophobic contact energy, provides a rank ordered list of preferred hydrophobic residue contacts, beta strand topologies and disulfide bridge connectivities [8].

The third stage involves the derivation of restraints based on helical and beta-sheet predictions in the form of dihedral angle and atomic distance restraints to enforce the predicted secondary and tertiary arrangements. Additional restraints are determined for the intervening loop residues connecting helical and strand regions through novel application of free energy simulation [9-11]. More specifically, the identified loops are extended on each side to incorporate three additional amino acids of both secondary structure elements that the loop connects. Each set of three flanking amino acids are imposed to be in their respective secondary structure state (e.g., helix, beta-strand). Then, a series of free energy calculations are conducted through the principles of overlapping oligopeptides, similar to the free energy calculations used in the helix prediction stage. The objective of these calculations is to produce improved bounds on the dihedral angle and backbone distances within the loop residues.

The fourth and final stage of the approach involves the prediction of the tertiary structure of the full protein sequence. The problem formulation, which relies on dihedral angle and atomic distance restraints introduced from the previous stages, as well as on detailed atomistic energy modeling, represents a nonconvex constrained global optimization problem. This problem is solved through the combination of a deterministically based global optimization approach, the aBB, and torsion angle dynamics [1,11]. A distributed computing framework of each stage of the proposed approach has been developed, and our predictions in the CASP5 competition employ this parallel implementation.

1. Klepeis J.L. and Floudas C.A. (2002) Ab-Initio Tertiary Structure Prediction of Proteins, Journal of Global Optimization, in press. 2. Floudas C.A. (2000) Deterministic Global Optimization: Theory, Algorithms and Applications, Kluwer Academic Publishers. 3. Klepeis J.L. et al. (2002) Deterministic global optimization and ab-initio approaches for the structure prediction of polypeptides, dynamics of protein folding, and protein-protein interactions, Advances in Chemical Physics 120, 265-457. 4. Nemethy et al. (1992) Energy parameters in polypeptides. 10. Improved geometrical parameters and nonbonded interactions for use in the ECEPP/3 algorithm with applicatoins to proline-containing peptides, Journal of Physical Chemistry 96, 6472-6484. 5. Klepeis J.L. and Floudas C.A. (1999) Free energy calculations for peptides using deterministic global optimization, Journal of Chemical Physics, 110, 7491-7512. 6. Klepeis J.L. and Floudas C.A. (2002) Ab-Initio Prediction of Helical Segments in Polypeptides, Journal of Computational Chemistry, 23, 1-22. 7. Floudas C.A. (1995) Nonlinear and Mixed-Integer Optimization: Fundamentals and Applications, Oxford University Press.

A-74 8. Klepeis J.L. and Floudas C.A. (2002) Prediction of Beta-Sheet Topology and Disulfide Bridges in Polypeptides, Journal of Computational Chemistry, in press. 9. Klepeis J.L. and Floudas C.A. (2002) Analysis and Prediction of Loop Segments in Protein Structures, in preparation. 10. Klepeis J.L., Pieja M.J. and Floudas C.A. (2002) A new class of hybrid global optimization algorithms for peptide structure prediction: integrated hybrids, Computer Physics Communications, in press. 11. Klepeis J.L., Pieja M.J. and Floudas C.A. (2002) Hybrid global optimization algorithms for protein structure prediction: alternating hybrids, Biophysical Journal, in press.

FM-AF (P0571) - 17 predictions: 17 3D

Folding Machine: Coarse-Grained Folding Dynamics Using an Implicit-Solvent Potential

A. Colubri1, A. Fernández2,3 1 - Department of Chemistry, University of Chicago, IL 60637, 2 - Institute for Biophysical Dynamics, University of Chicago, IL 60637, 3- Instituto de Matemática, Universidad Nacional del Sur, Consejo Nacional de Investigaciones Científicas y Técnicas, Bahía Blanca 8000, Argentina [email protected]

The CASP5 predictions submitted by this group were generated with a coarse-grained ab-initio folding algorithm implemented as a computer program called Folding Machine (FM). This algorithm is based in three key simplifications: 1. The folding dynamics is computed in the space of backbone torsional angles by assuming constant bond lengths and plane angles. Furthermore, the algorithm takes advantage of the local geometrical constraints imposed on backbone motion: the torsional dynamics is subordinated by a discrete process of hopping between the Ramachandran basins accessible for each residue. 2. The conformations of the side-chains are space-averaged by using soft-spheres to represent the actual side-chain geometry. This is based in the assumption that the timescales of side-chain dihedral motions are much faster than those of the main-chain torsional angles. 3. An empirical intramolecular potential with an implicit treatment of the solvent is used to quantify the stability of the conformations generated along the folding pathway. This potential is constructed so that distinctive local environments shaped by the chain during folding are treated through many-body correlations defining a rescaling of the two body energy contributions.

Earlier versions of the algorithm are described in [1-2]. The FM has been used previously to study pathway diversity [3-4] and cooperativity [5-6] in protein folding. Detailed analysis of points 1, 2 and 3 are given in all these references.

The dynamics can be briefly described as follows:

A-75 a. At time t, the extent of structural involvement of each residue is quantified by means of our empirical potential function. In this way, we assign a certain probability of basin hopping to each residue. b. According to these hopping probabilities, the residues that change basin in the interval [t, t+dt] (where dt = 10 -8s) are determined, and an initial selection of phi-psi coordinates is made for those residues. c. Using the resulting selection of basins as a constraint, a simulated annealing optimization of the protein conformation is performed in order to improve the nascent secondary and tertiary structures. This intra-basin optimization is not performed all the time, but with a given frequency which is a parameter of our algorithm.

Now the main aspects of steps a, b and c will be presented (for more details see refs. [3-4]). Lets denote with P(k, t) the hopping probability of residue k at instant t, this is, the probability of residue k to unergo a basin change at time t. This magnitude is defined by P(k, t) = exp[-dG(k, t)/RT], where dG(k, t) is the change in free energy associated with the basin hopping of k, assuming that all the interactions that depend on residue k are destroyed during this basin change.

The empirical potential function used to evaluate dG(k, t) has the following terms: U excluded-volume + Usolvophobic + Ucoulombic + Udipolar + Uhydrogen-bond. In a zeroth-order approximation, generically denoted U0, each of these terms can be expressed as a sum of pairwise contributions U0(i, j). Under this approximation, the potential function does not reflect the effect of local solvent environments on the stability of the dielectric-dependent interactions (solvophobic, coulombic and hydrogen-bond). In order to 0 incorporate this effect, we rescale the zeroth-order contribution of each pair, U (i, j), by introducing renormalization factors fi and fj which depend on the level of 0 desolvation of residues i and j. Thus the rescaled pairwise energy which implicitly includes the solvent effect is U(i, j) = f i fj U (i, j), where fi = fi(Li) and Li is the extent of desolvation of residue i.

The accessible Ramachandran basins for each aminoacid and the intra-basin distributions of phi-psi points were obtained by analyzing the torsional coordinates of the chains listed in the culled PDB database [7]. Using this data, the parameters for the discrete basin dynamics were calculated. For example, the probability of adopting a basin is given by its relative lacunar area.

In order to include local correlations between the basin dynamics of neighbor residues, the I-sites library of local sequence-structure patterns [8] is used to complement the information encoded in the distributions of Ramachandran basins. This is done in the following way: when a certain region of the chain matches in sequence one of the I-sites motifs and its backbone geometry is close enough to the basin assignment of this motif, then the torsional coordinates of the motif are applied to all the residues of that region of the chain. The FM also includes the possibility of using sequence-structure motifs obtained from the PHD server [9].

The structure optimization step requires a secondary structure assignment done in parallel with the simulation. The secondary structure assignment algorithm built in the FM is roughly similar to the DSSP algorithm [10], but is more error-tolerant in order to recognize imperfectly formed structures. It is also able to detect the structure topology, this is, the pattern of connections between the secondary elements, information which is needed in the subsequent structure optimization.

The structure optimization algorithm is designed to minimize Q(S 1, S2) with restrictions of the form Rn(S1, n, S2, n) < qn*, n = 1, 2..., where S1, S2, S1, n, S2, n stand for different secondary structures detected in the chain, and Q(S1, S2), Rn(S1, n, S2, n) denote different magnitudes defined between structures, for example, the total interaction energy, the hydrogen-bond energy, the alignment variable (the angle between the end-to-end vectors of both structures), etc. Using this general optimization algorithm, different optimization schemes can be constructed in the FM, for example: to minimize the hydrogen-bond energy between two strands preserving the remaining interactions, or to improve the alignment pattern between some set of nascent secondary structures without interfering with the tertiary structures already formed. 1. Fernández A., Colubri A., Appignanesi G. (2001) Finding the collapse-inducing nucleus in a folding protein. J. Chem. Phys. 114, 8678-8684 2. Fernández A., Appignanesi G., Colubri A. (2001) Semiempirical prediction of protein folds. Phys. Rev. E 64, 21901-21914

A-76 3. Colubri A., Fernández A. (2002) Pathway diversity and concertedness in protein folding: an ab-initio approach. J. Biomol. Struct. & Dyn. 19, 739-764 4. Fernández A., Colubri A. (2002) Pathway Heterogeneity in Protein Folding. Proteins: Func., Struct. & Genetics 48, 293-310 5. Fernández A., Colubri A., Berry R.S. (2001) Topologies to geometries in protein folding: hierarchical and non-hierarchical scenarios. J. Chem. Phys. 114, 5871- 5887 6. Fernández A., Colubri A., Berry R.S. (2002) Three bodies correlations in protein folding: the origin of cooperativity. Physica A 307, 235-259 7. Culled PDB website: http://www.fccc.edu/research/labs/dunbrack/culledpdb.html 8. Bystroff C., Baker D. (1998) Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs. J. Mol. Biol. 281, 565-577 9. Rost B., Sander C. (1993) Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232, 584-599. 10. Kabsch W., Sander C. (1983) Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22, 2577– 2637

FORTE1 (P0290) - 276 predictions: 276 3D

Fold Recognition Using the FORTE1 Server

K. Tomii and Y. Akiyama Computational Biology Research Center National Institute of Advanced Industrial Science and Technology [email protected]

We attempted to submit the prediction results for all CASP5 targets in order to evaluate a novel fold recognition technique we have devised (K. Tomii, manuscript in preparation). To this end we have constructed a fold recognition system FORTE1 based on our new technique.

Throughout the distant comparative modeling and fold recognition lessons we have recognized that the most important factor influencing model quality still remains with alignment accuracy [1-2]. We have also realized that fold recognition methods utilizing evolutionally information outperform other methods [3]. Thus, we have devised a novel profile-profile comparison technique to increase the sensitivity of fold recognition and improve alignment accuracy. Our method has distinct features of measuring similarity between two profiles as compared with other published methods, such as FFAS [4] and the method developed by Yona and Levitt [5], which exploit alignment information.

The FORTE1 system utilizes the sequence profiles of both a target and templates to predict the structure of target sequence. The sequences of templates are derived from the ASTRAL [6] (version 1.59) 40% identity list and the selected PDB entries, which are not registered in SCOP [7] (1.59 release) database. We mainly update the template library according to the update of PDB database. As the exceptional-strength computational resource (http://www.cbrc.jp/magi/) is available at our center CBRC, PSI-BLAST [8] iterations are performed maximally 20 times to prepare the profiles of both target and templates. With the NCBI non-redundant database the profiles are updated about half a month during the prediction season.

A-77 In profile comparisons the global-local algorithm is employed to build an optimal alignment of a query sequence profile onto a template one. Statistical significance of each alignment score is estimated by calculating Z-score with a simple log-length correction. The candidates of the templates are sorted by Z-scores, and then prediction results in AL format are submitted. Those implement in a fully automatic manner. The server is available at http://www.cbrc.jp/forte1/.

1. Venclovas C. et al. (2001) Comparison of performance in successive CASP experiments. Proteins. Suppl. 5, 163-170. 2. Tramontano A. et al. (2001) Analysis and assessment of comparative modeling predictions in CASP4. Proteins. Suppl. 5, 22-38. 3. Fischer D. et al. (2001) CAFASP2: The second critical assessment of fully automated structure prediction methods. Proteins. Suppl. 5, 171-183. 4. Rychlewski L. et al. (2000) Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Science. 9 (2), 232-241. 5. Yona G. et al. (2002) Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J. Mol. Biol. 315 (5), 1257-1275. 6. Chandonia J.M. et al. (2002) ASTRAL compendium enhancements. Nucleic Acids Res. 30 (1), 260-263. 7. Murzin A. et al. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247 (4), 536-540. 8. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402.

Friesner (P0112) - 174 predictions: 174 3D

Bridging the Gap Between Physical Chemistry and Bioinformatics

Matthew P. Jacobson1,4, Yuling An1, Tyler J. F. Day2, Volker A. Eyrich2, Ramy S. Farid2, John R. Gunn2, Susan Harrington1, Xin Li1, David L. Pincus1, Chaya S. Rapp3, Daron M. Standley2, and Richard A. Friesner1,* 1 Columbia University Department of Chemistry, 2 Schrödinger, Inc.,3 Yeshiva University, 4 Currently: UCSF Department of Pharmaceutical Chemistry. * [email protected]

Overview The major emphasis of our participation in CASP5 was the integration of knowledge-based and physics-based methods for protein structure prediction. For proteins with low sequence identity, threading methods employing novel alignment techniques and residue based potential energy functions are used to identify remote homologues and build low resolution structures. When identification of the template family is straightforward, the alignment methods are combined with a physical chemistry based energy function (all-atom force field including electrostatics, Generalized Born model for the polar component of the solvation free energy, and a function describing the nonpolar component of the solvation free energy) which is deployed at one or more points during the modeling process: model construction, refinement, and/or final scoring. This energy function is of course substantially more computationally expensive than most scoring functions used for building and refining protein models, and we have invested substantial effort towards the development of new sampling algorithms to accelerate convergence.

Specific notable aspects of our methodology include 1) deliberate sampling of helix conformations, to complement the usual sampling of side chains and loops, 2) the construction of several models with biological symmetry and/or the explicit inclusion of ligands (by analogy to the templates), to improve local structural details of the models, 3) the use of a new alignment algorithm that employs both sequence and secondary structure information, 4) the use of a well-validated fold recognition algorithm that identified reasonable templates for several difficult targets, including all-alpha helical targets (T129, T139, T170), that were classified as challenging

A-78 cases on the CAFASP website, 5) ab initio generation of unaligned beta sheet regions, and 6) the use of a composite model building facility, when a combination of multiple templates appeared to be superior to any single template.

Alignments Our alignment algorithm used both an amino acid substitution matrix and secondary structure matching using a profile built from several prediction servers [1]. A variable gap penalty was employed, with a larger penalty assigned to gaps within secondary structure elements. In a majority of comparative modeling cases, the alignments generated with this algorithm were broadly consistent with alignments generated by other algorithms, such as PDB-Blast, with differences confined mostly to loops, and some minor shifts (a few residues) in secondary structure regions. In a few cases, including T133, the alignment produced using our algorithm differed substantially from others, although these were mostly targets at the outer fringes of comparative modeling. In these cases, we used our alignments to the preference of others, because it was designed to operate seamlessly over a large range of sequence identity, ranging from true fold recognition cases, where secondary structure matching dominates the alignment scoring, to homology modeling, where sequence matching dominates. Regions with significant variability among different alignment algorithms, including our own, were isolated for sampling after model building. In a majority of cases, multiple models were built using several different alignments and/or different templates, each model was refined independently, and the lowest energy resultant structure was submitted.

Composite templates were constructed for certain comparative modeling and many fold recognition targets. With respect to the latter, when the choice of template was ambiguous (about 15% of the cases), we generated alignments based on composite templates by enumerating variable regions within a group of structurally related proteins [2]. This procedure, which in general resulted in many thousands of candidate structures, was followed by a hierarchical filtering process that utilized clustering and statistics-based scoring [3].

Model Refinement and Scoring We have created a new software package, PLOP (Protein Local Optimization Program) [4], which is explicitly designed to build and refine protein models using physical chemistry all-atom force fields and implicit solvent models (specifically a Generalized Born model). The primary emphasis has been on the development of new sampling algorithms that complement molecular dynamics and utilize knowledge-based statistics and fast steric screening to reduce computational expense. The hierarchy of sampling algorithms includes direct minimization, combinatorial side chain optimization, loop sampling, and sampling of helix positions/orientations. Together, these sampling algorithms permit energy-based refinement of homology models. Automated, iterative refinement was carried out in a parallel manner on up to 20 processors until the energy ceased to decrease substantially (typically 1–3 days).

The lowest level sampling algorithm is direct minimization, which is accomplished using a novel multi-scale algorithm based on the Truncated Newton method [5]. Side chain conformational sampling [6,7] is accomplished primarily through the use of highly detailed rotamer libraries developed by Xiang and Honig [8]. Loop prediction utilizes a dihedral angle sampling procedure for the backbone degrees of freedom to generate many loop candidates (102–106), followed by clustering, and finally side chain optimization and complete energy minimization on representative conformations in a hierarchical manner. Finally, because helix positions are also variable among homologs, particularly at relatively low sequence identity, we have implemented a helix sampling algorithm, in which rigid body (6 degrees of freedom) sampling of helix positions is coupled with loop prediction on either side of the helix, side chain re-packing, and energy minimization. In all of these sampling algorithms, the energy function consisted of the all-atom OPLS force field [7,9,10] and SGB/NP solvent model [11,12].

A-79 The sampling strategy for model refinement was informed by 1) location and number of gaps in the sequence alignment, 2) structural diversity among proteins in the same family as the template, as evidenced by multiple structure alignments, 3) non-conservative amino acid substitutions, particularly those involving Gly and Pro, 4) known structural problems with the initial model (e.g., steric clashes), and 5) local sequence similarity in different portions of the alignment.

Case-Specific Strategies Composite Model Building: The submitted models for T132, T149 (N-terminal domain), T186, and T192 were each constructed from two homologous templates in PLOP. Models for T136, T146, T147, T162, T172 (domain 2), T173, T174, T181, T187, and T194 were constructed from several structurally similar proteins using the automated method of generating and filtering composite templates.

Biological Symmetry: Several targets (T151, T160, T167, T184 N-terminal domain, T189, T190) were specified to be homodimers or homotetramers (or assumed to be so based on the template protein). Neglect of the inter-chain interactions could lead to significant errors, for example due to exposure of hydrophobic residues that are buried in the biologically relevant complex. For this reason, we used symmetry operations, derived from the template structures, to replicate the monomer appropriately. All copies of the monomer retain the same conformation at all times during the refinement, thus reducing the sampling effort required [6].

Explicit Inclusion of Ligands: HETATM groups were explicitly included in model building and refinement for several targets, either because the CASP instructions specified their existence, or because the biological function of the protein would presumably require a cofactor. Examples include the fatty acid ligand and several strongly conserved water molecules in T137, a Zn ion in T141, the CoA cofactor in T169, two Co ions in T182, and a Mg ion in the N-terminal domain of T184.

Ab Initio Sheet Construction: Unaligned beta sheets were constructed using an algorithm that enumerates all possible ways of combining unpaired strands and extending existing sheets, given a fixed assignment of strand residues [13]. New sheet topologies were screened according to strand-strand connectivity [14] and loop crossing [15], and scored based on hydrophobic contacts [14]. Each model topology was then simulated independently and ranked as with other FR predictions. This method was applied to several difficult targets, including T130, T140, T148, T149, and T156.

1. An Y.L. and Friesner R.A. (2002) A novel fold recognition method using composite predicted secondary structures. Proteins 48, 352–366. 2. An Y.L. (2002) Homology-based protein structure prediction: fold recognition and alignment. Ph.D. Thesis, Columbia University. 3. Eyrich V.A. et al. (2001) Ab initio protein structure prediction using a size dependent tertiary folding potential. Adv. Chem. Phys., 120. 4. Jacobson M.P. and Friesner R.A. Unpublished. 5. Schlick T. and Overton M. (1987) A powerful truncated Newton method for potential energy minimization. J. Comput. Chem. 8, 1025–1039. 6. Jacobson M.P. et al. (2002) On the role of crystal packing forces in determining protein sidechain conformations. J. Mol. Biol. 320, 597–608. 7. Jacobson M.P. et al. (2002) Force field validation using protein sidechain prediction. J. Phys. Chem. B, accepted. 8. Xiang Z. and Honig B. (2001) Extending the accuracy limits of prediction for side-chain conformations. J. Mol. Biol. 311, 421–430. 9. Jorgensen W.L. et al. (1996) Development and testing of the OPLS all-atom force field on conformational energetics and properties of organic liquids. J. Am. Chem. Soc. 118, 11225–11236. 10. Kaminski G.A. et al. (2001) Evaluation and reparameterization of the OPLS-AA force field for proteins via comparison with accurate quantum chemical calculations. J. Phys. Chem. B 105, 6474–6487. 11. Ghosh A. et al. (1998) Generalized Born model based on a surface integral formulation. J. Phys. Chem. B 102, 10983–10990.

A-80 12. Gallicchio E. et al. (2002) The SGB/NP hydration free energy model based on the Surface Generalized Born solvent reaction field and novel non-polar hydration free energy estimators. J. Comput. Chem. 23, 517–529. 13. Gunn J.R. and Friesner R.A. Unpublished. 14. Klepeis J.L. and Floudas C.L. (2002) Prediction of beta-sheet topology and disulfide bridges in polypeptides. Preprint. 15. Ruczinski I. et al. (2002) Distribution of beta sheets in proteins with applications to structure prediction. Proteins 48, 85–97.

A-81 FROST-MIG (P0047) - 72 predictions: 72 3D

FROST: a filter based fold recognition method

A. Marin1, J. Pothier2, K. Zimmermann1 and J-F. Gibrat1 1 -Mathématique Informatique et Génome, INRA, Jouy-en-Josas, 78352 cedex, FRANCE 2 -Atelier de Bioinformatique, 12 rue Cuvier, 75005 Paris, France [email protected]

The FROST method consists of four components: i) a library of cores, ii) a fitness function that measures the compatibility of a sequence to a fold, iii) an algorithm for optimal alignment of the sequence onto the fold and iv) a statistical analysis of the raw scores.

We have clustered all the sequences of proteins in the PDB into groups having more than 35% identical residues. The best structure from each group (based on miscellaneous criteria such as the resolution, the number of missing residues, etc) was then chosen as a group representative. The corresponding core is defined as the conserved secondary structure elements, disregarding the loops of this representative.

The fitness function is based on two distinct sets of parameters. The first set (1D parameter set) involves only one site (site are defined as the Ca of the protein residues) in the structure. The second set (3D parameter set) involves pairs of sites in contact in the structure. The first set only requires the knowledge of a degenerate version of the 3D structure, namely, the list of the structural states. These states are defined by their secondary structures (H helix, E strand, C coil) and their buried state (b buried, e exposed). The second set, on the other hand, requires the knowledge of the true 3D structure. Parameters for both sets are calculated using a definition of information due to Fano [1]. 1D parameters are a direct extension of BLOSUM matrices in which the known state of the residues, i.e., Hb, He, Eb, Ee, Cb, Ce is taken into consideration. 3D parameters are a further extension in which one consider the cost of replacing a pair of amino acid in contact in a given structural context by another pair.

Using the 1D parameters we align a profile corresponding to the core (for which we know the residue states) with a profile corresponding to the query sequence using a suitably modified dynamic programming algorithm. Insertions/deletions inside core elements are strongly penalized. For 3D parameters that involve pairs of residues it is not possible to use the dynamic programming algorithm and we have to use an exact but slow branch and bound algorithm [2] for the smallest proteins or a faster heuristic method for the largest proteins.

Each set of parameters is used in a different filter. Each filter provides, for the query sequence, scores for being aligned with the database cores. This score are only meaningful when they are normalized. In addition one must also evaluate the significance of the normalized score. In FROST this is done empirically for each core in the database by aligning a set of true protein sequences without relationship with the core thus providing an empirical distribution of scores (see [3] for details) .

We have developed a test database that allows us to empirically determine which threshold for the filter normalized scores must be used to obtain a given rate of error (say 1% or 5%). Note that results of both 1D and 3D filters are simultaneously considered. For this we use an approach similar to the technique known as support vector machine [4] where points in a M dimensional space are separated into different classes by hyperplanes. Here M=2 and hyperplanes are just lines. Using the test database we have determined the position of the lines for obtaining a given error rate.

A-82 Using the test database we showed that for an error rate of 1% we are able to detect 60% of all related pairs. In teh same conditions, PSI-BLAST only detects 30% of the related pairs.

1. Fano R.M., Transmition of information: a statistical theory of commu-nication. MIT press Cambridge, 1961 2. Lathrop R.H. and Smith T.F., (1996) Global optimum protein threading with gapped alignment and empirical pair score functions. J. Mol. Biol. 255, 641-665. 3. Marin A., et al (2002), FROST: a filter based fold recognition method, Proteins, 49, 493-509. 4. Vapnick V., Statistical learning theory, Wiley, New-York, 1998

FUGUE2 (P0014) - 330 predictions: 330 3D FUGUE3 (P0226) - 330 predictions: 330 3D

Homology Recognition Using Environment-Specific Substitution Scores Enriched with Homologous Sequence Information

K. Mizuguchi, T.L. Blundell, H.S. Gweon, J. Shi* and L.A. Stebbings Department of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1GA, UK, [email protected]

Our sequence-structure homology recognition program FUGUE (http://www-cryst.bioc.cam.ac.uk/fugue/)[1] has been ranked among the top fold recognition servers in CAFASP2 and LiveBench exercises. Unlike most other fold recognition servers, it utilizes environment-specific substitution tables and structure-dependent gap penalties, where scores for amino acid matching and insertions/deletions are evaluated depending on the local environment of each amino acid residue in a known structure. Key features defining local environments (such as secondary structure, solvent accessibility and hydrogen-bonds) have been examined and various weighting parameters optimized using extensive benchmark sets [1]. The program has been successfully used to identify novel homologies [2-3].

Since CASP4, FUGUE has been updated substantially and we now call the new version FUGUE2. While the original version constructs a position-specific scoring table (profile) for each family in the HOMSTRAD database (http://www-cryst.bioc.cam.ac.uk/homstrad)[4] using the environment-specific substitution tables, the new version enriches it by adding information derived from the homologous sequences. This is done by taking each sequence from the HOMSTRAD structural alignment, running PSI-BLAST [5] against the NCBI nr database and combining all the sequence alignments with the original structural alignment. The new profile is calculated by assuming that all the homologues adopt the same environments as the known structures. Our new benchmarking suggests that the enriched profiles can indeed recognize 10% more homologues than the original profiles (unpublished). These improvements are particularly pronounced for protein families that have only one representative known structure. These single-member families account for more than half the total HOMSTRAD families.

One important issue is whether multiply-aligned structures with divergent sequences (as in some HOMSTRAD families) always contain more information and can improve the fold/homology recognition performance, or rather they can increase the noise and thus it is better to use profiles derived from individual single structures. To test this, we set up two separate servers. FUGUE2 searches against a library of profiles, which are derived from the HOMSTRAD families as described above. FUGUE3 uses an enlarged library, which includes, in addition to the original HOMSTRAD profiles, all representative single structures in the PDB, as defined in the culled pdb [6]. The search programs themselves are identical in FUGUE2 and FUGUE3. Although we cannot draw firm conclusions at this stage, it appears that in some

A-83 cases, the single-structure profiles in FUGUE3 produced significant Z-scores, whereas the multiple-structure profiles in FUGUE2 failed to recognize the target. Thus, using a highly redundant structural library such as that in FUGUE3 may further improve the performance of FUGUE.

1. Shi J. et al. (2001). FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure- dependent gap penalties. J. Mol. Biol., 310, 243-257 2. Shirai H. et al. (2001) A novel superfamily of enzymes that catalyze the modification of guanidino groups. Trends Biochem. Sci. 26, 465-468. 3. Witty M. et al. (2002) Structure of the periplasmic domain of Pseudomonas aeruginosa TolA: evidence for an evolutionary relationship with the TonB transporter protein. EMBO J., 21, 4207-4218. 4. Mizuguchi K., et al. (1998). HOMSTRAD: a database of protein structure alignments for homologous families. Protein Sci., 7, 2469-2471 5. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 6. Wang G. and Dunbrack R. L. Jr. (2002) PISCES: a protein sequence culling server. Bioinformatics, submitted.

* Present Address: Celltech R&D Inc., 1631 220th Street SE, Bothell, WA 98021, USA Garnier-Kloczkowski (P0396) - 91 predictions: 91 SS

Combining the GOR V Algorithm With Evolutionary Information for the Protein Secondary Structure Prediction from the Amino Acid Sequence

A. Kloczkowski1, R.L. Jernigan1 and J. Garnier2 1 Laboratory of Experimental and Computational Biology, NCI, NIH 2 Analytical Biostatistics Section, Laboratory of Structural Biology, CIT, NIH [email protected]

The GOR algorithm has been modified by employing the evolutionary information provided by the multiple sequence alignments, adding triplet statistics and optimizing various parameters (GOR V, 1, 2). The PSI-BLAST multiple sequence alignments was used after 5 iterations with an E value of 5.10-4.

The methodological procedure was based on the calculation of the matrices of the probabilities of various (H, E, C) secondary structure elements P H(i, j), PE(i, j) and PC(i, j) for each j-th residue in the i-th alignment (with the inclusion of alignment gaps). The gaps were skipped by the GOR program in the calculation of the probabilities of various secondary structure conformations, but the information about them was retained for the averaging purposes. Then the averages were calculated over alignments , and at the j-th position in the alignment by summing PH(i, j) (and similarly PE(i, j) and PC(i, j) ) over i, and by dividing this sum by the number of alignments, excluding (in the alignment count) alignments with gaps at the j-th position. In the alignment matrix A columns containing gaps in the query sequence were skipped, contracting the size of the matrix A to the original length of the query sequence. The prediction of the secondary structure conformation for the j-th residue was based of the set of three probabilities {, , }. The secondary structure of the j-th residue was assigned to the conformation with the largest probability value max{, , } modified by introducing decision constant thresholds. The coil state was being predicted only if the calculated probability of the coil conformation is greater then the probability of the other states (H, E) by the imposed thresholds (0.15 for E and 0.075 for H).

A-84 All calculations of the parameters for the observed states were performed with the translation of the eight state DSSP assignments into the three secondary structure states H, E and C as the following: DSSP states H and E were translated to H and E in the three state code, and all other letters of the DSSP code were translated to coil (C). Additionally helices shorter than 5 residues (HHHH or less) and sheets shorter than 3 residues (EE or E) were considered as coils. Similarly the GOR algorithm has a built-in correction scheme, which removes secondary structure segments that are too short (helices shorter than 4 residues, and sheets shorter than 3 residues) treating them as the most likely prediction errors.

In the case the PSI-BLAST alignment detected a sequence from the PDB with an E value smaller than 5.10-4 to the query sequence, two models of prediction were given. Model 1 used the PSI-BLAST alignment to transfer the observed conformation of the PDB sequence to the predicted conformation of the aligned residues of the query sequence; if more than one PDB sequence were below the E value, no more than three of them were taken. Most of the three observed conformations were identical, subject to alignment errors or ends of secondary structures. Only the DSSP secondary structures displayed by the PDB site were used although somewhat at variance with the crystallographer assignments and our own DSSP assignments for the calculation of the GOR parameters. Model 2 was the prediction made as described in the paragraphs above, without taking into account of the observed conformation of the PDB sequences. The order was chosen expecting that model 1 should be the most accurate even if PSI-BLAST alignments might differ from structural alignments.

1. Kloczkowski A. et al. (2002). Protein Secondary Structure Prediction Based on the GOR Algorithm Incorporating Multiple Sequence Alignment Information. Polymer, 43, 441-449. 2. Kloczkowski A,. et al. (2002). Combining the GOR V Algorithm With Evolutionary Information for Protein Secondary Structure Prediction from Amino Aid sequence. Proteins: Structure, Function, and Genetics, 49, 154-166

GEM (P0359) - 76 predictions: 76 3D

Comparative Modeling in CASP5

H. Scheib1,2, K. Koretke1, A. Diemand1,2, M. Word1, C. Combet1,3 and E. Migliavacca1,2 1 GlaxoSmithKline, 2 Swiss Institute of Bioinformatics, 3 Institut de Biologie et Chimie des Protéines, University of Lyon [email protected]

Comparative modeling in CASP5 was separated in five steps: (1) template structure selection, (2) alignment, (3) model building, (4) loop building, and (5) model structure refinement. From the possible 65 targets, our group submitted model structures for 40 targets in comparative modeling.

Template structure selection. Template structures were obtained mainly from SENSER[1] and Match2, but also to some lower extend from 3D-PSSM[2]. In cases were multiple templates were available for a target, a multstep process was used; (1) the set of templates were superimposed using either DeepView (formerly SwissPDBViewer)[3] or STAMP[4] producing a structural based multiple sequence alignment; (2) creating a multiple sequence alignment of the target protein family; (3) combining the two alignments using both sequence similarity and mapping of the target's predicted secondary structure elements to the corresponding ones in the templates. The final template was selected based on the best "fit" of the target sequence to a putative template.

A-85 Alignment. The target sequence was manually aligned to a single or a set of superimposed template structures using DeepView or MACAW[5]. The alignment was guided by conserved residues as identified by multiple sequence alignments of related proteins, secondary structure prediction (PHD[6] and results from CAFASP3 server[7]), InterPro signature sequences[8] as well as hydrophobicity patterns. It was attempted to move insertions and deletions into loop regions to conserve the secondary structure elements, i.e. in the core of the protein. However, in rare cases gaps were placed violating these conditions in order to retain an overall homologous and compact shape of the resulting model.

Model building. Models were build using either DeepView or MODELLER[9]. The advantage to using DeepView is that a user keeps full control over the modeling process. This is done by allowing user intervention at any step. In DeepView, coordinates were assigned to the target sequence by applying the “Build Preliminary Model” option. In MODELLER, the default parameters were used.

Loop building. When generating models using DeepView, most loops were built manually, since the automatic loop building facility implemented in DeepView has its limitations. A loop database scan was carried out to identify possible solutions for a loop. In cases where this approach failed, loops were built de novo. Anchor points were chosen manually taking into account the local environment of the putative loop and neighboring biologically relevant residues. The resulting loops were ordered according to their force field energy as implemented in DeepView. The most suitable loop was usually selected among the top five proposed solutions with the lowest force field energy. This procedure was applied to loops in models generated by MODELLER wherever gaps of more than 2 residues occurred.

Model structure refinement. The resulting structures were energy minimized applying 100 to 200 steps of either Steepest Descend or Conjugate Gradient to the model using the GROMOS96 force field[10] as implemented in DeepView. Unfavorable side chain conformations were identified using “Amino Acids Making Clashes”, “Amino Acids Making Clashes With Backbone” and a force field energy report. Sterical problems were fixed by changing the side chain rotamer of one or more residues in the affected area. After fixing side chains, another 100 steps of either Steepest Descent or Conjugate Gradient minimization was carried out. In all cases, clashes could be removed by this procedure.

1. Koretke K.K et al. (2002) Fold recognition without folds. Protein Sci. 11, 1579-1579. 2. Kelley L.A. et al. (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol. 299, 499-520 3. Guex N. and Peitsch M.C. (1997) SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis 18, 2714-2723 4. Russell R.B. & Barton G.J. (1992). Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels. Proteins, 14, 309-323. 5. Schuler G.D., Altschul S.F., Lipman D.J. (1991) A workbench for multiple alignment construction and analysis. Proteins 9,180-190. 6. Rost B. (1996) PHD: predicting one-dimensional protein structure by profile based neural networks. Methods Enzym. 266, 525-539 7. http://www.cs.bgu.ac.il/~dfischer/CAFASP3/summaries/index.html 8. Apweiler R. (2001) The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29, 37-40 9. Sali A. and Blundell T.L.. (1993) Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815. 10. van Gunsteren et al. (1996) Biomolecular Simulation, the GROMOS96 Manual and User Guide, Vdf Hochschulverlag AG an der ETH Zuerich, Zuerich, Switzerland, 1-1042.

A-86 GEM (P0359) - 76 predictions: 76 3D

Applying Sequence Homology and Secondary Structure Prediction in Fold Recognition

K. Koretke1, H. Scheib1 and A. Lupas2 1 - GlaxoSmithKline, 2 - Max-Planck-Institute for Developmental Biology [email protected]

Summary. All CASP5 targets were submitted to the sensitive search routine program SENSER[1] to gather information about each sequences' homologous space. Secondary structure and other fold recognition predictions were obtained from CAFASP2 server[2]. Additional sequence searches were done using regular expression patterns and HMMs. If a protein of known structure appeared to match the properties of the target, alignments were generated using a combination of MACAW[3], PSI- Blast[4] and HMMer[5] with final adjustments made by mapping conserved hydrophobicity patterns with secondary structure elements. A total of 18 targets were predicted as fold recognition; 4 had templates identified using SENSER, 4 were predicted based on distant homology detected through other methods and 10 were predicted through secondary structure patterns.

Details. SENSER is a multi-step program that uses PSI-Blast to search sequence space and identify distantly related sequences for a given query sequence. In the first step SENSER performs a PSI-Blast search with the target sequence for a maximum of 6 iterations. Proteins identified in the search are divided into a significant sequence space, containing those sequences with an E value lower than 10-2, and a 'trailing end' of sequences between 10-2 and 10. Because some of the proteins detected may contain unrelated domains, all proteins are trimmed to the actual region detected in the PSI-Blast run.

In the second step, transitive searches are used to expand the significant sequence space. Only proteins within the significant sequence space that have less than 30% identity to the target sequence are used as starting points for further PSI-Blast searches, in order to avoid redundant searches, i.e. those that produce similar profiles and sequence spaces. This value was chosen as it is a frequently quoted threshold for the 'twilight zone', below which sequences can not be confidently said to be homologous.

In the third step trailing-end sequences are tested for their ability to back-validate, i.e. detect any sequence of the significant sequence space of the target in PSI-Blast. Because several PSI-Blast searches were performed to establish the significant sequence space, trailing-end sequences are pooled and ranked first by number of occurrences and second by E-value, before being tested. If a trailing-end sequence back-validates, its significant sequence space is added to that of the target. The process is then repeated until no further sequences are detected.

Domains were automatically predicted for each target and identified using alignment information from the final iteration of the target sequence's PSI-Blast run. A domain was identified if a significant sequence aligned to less than 50% of the target sequence. The boundaries of a domain were determined by the maximum overlap of a target to all of the significant sequences that overlapped the same region and only aligned to less than 50% of the target sequence. If a domain was predicted, a new SENSER run was initiated with the target sequence trimmed to predicted domain region.

A-87 If SENSER identified a potential template structure, its match with the target was evaluated using predicted secondary structure, the occurrence of sequence patterns, and biochemical information. The aligment was generated using MACAW, HMMer, PSI-Blast or a combination of these methods to produce the alignment that seemed most plausible to us based on conserved residues, hydrophobicity, and secondary structure.

If SENSER did not identify a potential template structure, regular expression patterns, predicted secondary structure, other fold recognition predictions and biochemical information were used to search for possible templates. In addition, in cases where the target was only a fragment of a larger protein, the entire protein was used in sequence searches. If a template was judged to match the properties of the target, an alignment was produced using MACAW, HMMer, Clustal[6], or a combination of these methods, to produce the alignment that seemed most plausible to us based on conserved residues, hydrophobicity, and secondary structure.

1. http://www.cs.bgu.ac.il/~dfischer/CAFASP3/summaries/index.html 2. Koretke K.K., Russell R.B., Lupas A.N. (2002) Folds without a Fold. Protein Science 11(6):1575-9. 3. Schuler G.D., Altschul S.F., Lipman D.J. (1991) A workbench for multiple alignment construction and analysis. Proteins 9,180­190. 4. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 5. Eddy S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755­763. 6. Thompson J.D., Higgins D.J., and Gibson T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4680.

GEM (P0359) - 76 predictions: 76 3D

Applying Sequence Homology and Secondary Structure Prediction in Fold Recognition

K. Koretke1, H. Scheib1 and A. Lupas2 1 - GlaxoSmithKline, 2 - Max-Planck-Institute for Developmental Biology [email protected]

Summary. All CASP5 targets were submitted to the sensitive search routine program SENSER[1] to gather information about each sequences' homologous space. Secondary structure and other fold recognition predictions were obtained from CAFASP2 server[2]. Additional sequence searches were done using regular expression patterns and HMMs. If a protein of known structure appeared to match the properties of the target, alignments were generated using a combination of MACAW[3], PSI- Blast[4] and HMMer[5] with final adjustments made by mapping conserved hydrophobicity patterns with secondary structure elements. A total of 18 targets were predicted as fold recognition; 4 had templates identified using SENSER, 4 were predicted based on distant homology detected through other methods and 10 were predicted through secondary structure patterns.

Details. SENSER is a multi-step program that uses PSI-Blast to search sequence space and identify distantly related sequences for a given query sequence. In the first step SENSER performs a PSI-Blast search with the target sequence for a maximum of 6 iterations. Proteins identified in the search are divided into a significant sequence space, containing those sequences with an E value lower than 10-2, and a 'trailing end' of sequences between 10-2 and 10. Because some of the proteins detected may contain unrelated domains, all proteins are trimmed to the actual region detected in the PSI-Blast run.

A-88 In the second step, transitive searches are used to expand the significant sequence space. Only proteins within the significant sequence space that have less than 30% identity to the target sequence are used as starting points for further PSI-Blast searches, in order to avoid redundant searches, i.e. those that produce similar profiles and sequence spaces. This value was chosen as it is a frequently quoted threshold for the 'twilight zone', below which sequences can not be confidently said to be homologous.

In the third step trailing-end sequences are tested for their ability to back-validate, i.e. detect any sequence of the significant sequence space of the target in PSI-Blast. Because several PSI-Blast searches were performed to establish the significant sequence space, trailing-end sequences are pooled and ranked first by number of occurrences and second by E-value, before being tested. If a trailing-end sequence back-validates, its significant sequence space is added to that of the target. The process is then repeated until no further sequences are detected.

Domains were automatically predicted for each target and identified using alignment information from the final iteration of the target sequence's PSI-Blast run. A domain was identified if a significant sequence aligned to less than 50% of the target sequence. The boundaries of a domain were determined by the maximum overlap of a target to all of the significant sequences that overlapped the same region and only aligned to less than 50% of the target sequence. If a domain was predicted, a new SENSER run was initiated with the target sequence trimmed to predicted domain region.

If SENSER identified a potential template structure, its match with the target was evaluated using predicted secondary structure, the occurrence of sequence patterns, and biochemical information. The aligment was generated using MACAW, HMMer, PSI-Blast or a combination of these methods to produce the alignment that seemed most plausible to us based on conserved residues, hydrophobicity, and secondary structure.

If SENSER did not identify a potential template structure, regular expression patterns, predicted secondary structure, other fold recognition predictions and biochemical information were used to search for possible templates. In addition, in cases where the target was only a fragment of a larger protein, the entire protein was used in sequence searches. If a template was judged to match the properties of the target, an alignment was produced using MACAW, HMMer, Clustal[6], or a combination of these methods, to produce the alignment that seemed most plausible to us based on conserved residues, hydrophobicity, and secondary structure.

1. http://www.cs.bgu.ac.il/~dfischer/CAFASP3/summaries/index.html 2. Koretke K.K., Russell R.B., Lupas A.N. (2002) Folds without a Fold. Protein Science 11(6):1575-9. 3. Schuler G.D., Altschul S.F., Lipman D.J. (1991) A workbench for multiple alignment construction and analysis. Proteins 9,180­190. 4. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 5. Eddy S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755­763. 6. Thompson J.D., Higgins D.J., and Gibson T.J. (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4680.

GeneSilico (P0517) - 195 predictions: 86 3D, 64 SS, 45 RR

A-89 From Fold-Recognition Analysis via the Genesilico Meta-Server, to Modeling and Refinement by Several Predictors, to Uniform Evaluation and Generation of Hybrid Models

M. Kurowski, M. Feder, J. Kosinski, I. Cymerman, J. Sasin, J.M. Bujnicki International Institute of Molecular and Cell Biology (IIMCB) in Warsaw. Trojdena 4, 01-109 Warsaw, Poland [email protected]

Assessments of protein structure prediction (CASP, CAFASP, Livebench) have demonstrated that fold recognition (FR) methods can identify remote similarities when standard sequence search methods fail, but the reported target-template alignments are often only partially correct, leading to models with misfolded parts. The use of additional information, such as secondary structure (SS), and/or localization of ligand-binding residues can help to improve the target-template alignments. Moreover, models constructed from multiple parents are often found to be more accurate than models constructed from single parents only. The final prediction accuracy can be therefore improved if the best fragments obtained from various FR alignments can be judiciously combined to generate a consensus model.

Based on our experience with the meta-server approach to protein structure prediction, and both fully automated and expert-refined fold-recognition analysis in CASP4 and CAFASP2, we developed a novel fold-recognition gateway, which combines the useful features of other meta-servers available previously with the greater flexibility of the input (the beta version of the new tool is available at http://genesilico.pl/meta).

Whenever possible, we attempted to identify as many homologs of the target sequence as possible. For this purpose, we created a database of putative translation products (length>20aa) of all unfinished genomes, whose sequences were publicly available. This allowed a roughly two-fold increase of the size of the non-redundant database. In a cases of a few targets, it allowed to increase the size of the multiple sequence alignment from ~3 sequences to > 10 sequences and much better delineation of conserved and variable regions. The alignments were used to divide the query sequence into domain-size fragments. Fold recognition analysis was carried out for the single sequences, for the individual domains, as well as for the alignment sections corresponding to the individual domains. In the case of submission of alignments for the fold-recognition analysis, two options were used: i) columns with >30% of gaps were deleted (i.e. only the core regions were analyzed) and ii) gaps were treated as unknown characters (X) (i.e. the variable regions of the target sequence were “extended” to the maximal size, using the longest insertions present in homologous sequences as the reference).

Results of fold-recognition analysis carried out via our meta-server for all the variants of the target sequence as well as all FR and ab initio predictions obtained from the CAFASP website were collected and presented to a team of six human predictors. Their varying experience in protein sequence analysis and modeling notwithstanding, all members of the GeneSilico team attempted to build and refine the models independently. They used different software (Modeller, Swiss-model, MOE, WhatIf, and ICM-Pro) and applied different refinement protocols. The purpose of this exercise was to sample the “model space” in a vicinity of the solution suggested by the consensus between the fold-recognition methods. This sampling was not meant to be too extensive and was carried out with an assumption that the knowledge-based refinement carried out by the human predictor in the case of crude FR models is superior to the refinement carried out by the fully automated procedures. All models were evaluated using independent criteria (Verify3D and ProsaII) and compared with each other. Following superposition of all modeled structures, a consensus model was built from the best-scoring fragments of all models. The consensus model was re-evaluated and further refined or its parts were replaced with parts from other models, if generation of a “hybrid” model resulted in deterioration of the score due to apparent incompatibility of fragments of different preliminary models. Following the manual correction of selected sidechains and energy minimization with the GROMOS forcefield, the refined model was submitted in the TS format. The same procedure applied to the targets in the homology modeling, fold recognition and “novel folds” categories.

A-90 The final model and all the well-scoring parts of the intermediate models were used to calculate the average residue-residue separation distances (submitted as the RR category). The secondary structure of the final model was inferred according to DSSP and combined with the independent alignment/sequence-based prediction to generate the output in the SS format. For targets with no reliable models of the tertiary structure, the independent SS prediction was based solely on alignment-based predictions.

A-91 GERLOFF (P0240) - 9 predictions: 9 3D

Incorporation of Constraints Derived from Active/Functional Site Predictions in Protein Tertiary Structure Assembly

R. Schmid, D. C. Soares, Z. A. M. Hussein, B. J. Mitchell, R. S. Hamilton and D. L. Gerloff Biocomputing Research Unit & Structural Biochemistry Group, Institute of Cell and Molecular Biology,University of Edinburgh, UK [email protected]

We have submitted tertiary structure predictions for five CASP5 target proteins in order to investigate the potential of knowledge and/or predictions about functional sites in these proteins for being used in combination with established structure prediction methods. The degrees of difficulty assigned to the prediction targets, and the categories in which our predictions are considered, vary. Similarly, the way in which functional site information is used, and its impact on the final model varies slightly from target to target.

Our primary postulates are that: (a), the interchange between structure and functional prediction (or knowledge) leads to improvement at both ends; (b), formulation/adaptation of systematic fold-specific heuristics and function-specific heuristics is possible, at least for certain folds and functions; and (c), prediction of structure/function can go beyond trying to find re-occurrences of known cases.

While we found little opportunity within the set of CASP5 targets to demonstrate and/or test postulate (b), we attempted to use function prediction/ knowledge in all predictions we submitted. Primarily, we used predicted key residues in proteins presumed to function as enzymes to “anchor” threading alignments (in T0130, T0173, and to an extent in T0136 and T0132) so that their arrangement in the model would allow catalysis. In T0129, we could not find a suitable fold template and used the presumed proximity of presumed functional residues to guide the assembly of helices ab initio.

Besides our emphasis on formulating distance and/or geometrical constraints for our models based on functional site prediction from multiple sequence alignments, another unifying link between our submissions is the consideration of predicted Surface/Interior/ActiveSite/Parse positions (SIAP-predictions) according to the approach by Benner & Gerloff [1], in refining our predictions/threading alignments. Prediction of key residues from multiple sequence alignments (typically generated using ClustalX on sequences retrieved by standard BLAST/PSI-BLAST searches on the nr database, with subsequent manual editing (!)) was generally based on complete, or high, conservation of functional type amino acids, sometimes taking into consideration patterns of conservation similar to those described in [1]. The choice of template structures used in our predictions was often influenced by the publicly available CAFASP2 predictions by automated servers, albeit not exclusively. Here again, the compatibility between the folds and biologically sensical arrangements of predicted key residues was our primary criteria in non-obvious cases. Secondary structure predictions by CAFASP2-servers were used by default but often refined according to [1] and in the course of modeling. Refer to individual prediction headers for further details of interest, particularly the speculative functional roles of individual predicted key residues in predictions where this was possible in the least. These blind predictions of functional aspects are influenced by the structure predictions as much as vice versa. While the “manual component” in our CASP-predictions is obviously significant, our goal is to identify systematic aspects in the way biochemists’ knowledge influences (and quite often improves) tertiary structure predictions, with the goal of providing “refinement modules” for existing automated methods. Besides functional site assembly, consideration of the usually observed pseudo-symmetry in protein quaternary structures is under-explored in our field, and we believe that the prediction

A-92 of (non-transient) quaternary structure besides tertiary structure would be a highly relevant addition to future CASPs. Interesting quaternary structure cases in the targets we considered were T0132 and T0136. Again, the benefits of further developing efforts in this direction could be mutually beneficial to either tertiary and quaternary structure prediction.

1. Benner S.A., Cannarozzi G.M., Gerloff D., Turcotte M. and Chelvanayagam G. (1997) Bona fide predictions of protein secondary structure using transparent analyses of multiple sequence alignments. Chem. Reviews 97, 2725-2843 Ginalski (P0453) - 71 predictions: 71 3D

Modeling of CASP5 Target Proteins with 3D-CAM

K. Ginalski1,2 1 - Interdisciplinary Centre for Mathematical and Computational Modelling, Warsaw University, Warsaw, Poland, 2 - BioInfoBank Institute, Poznań, Poland [email protected]

For the fifth round of Critical Assessment of Techniques for Protein Structure Prediction (CASP5), 67 target proteins were modeled using the 3D-Consensus Alignment Method (3D-CAM). The issue of sequence-to-structure alignment of target sequences with their respective parent structures was the main emphasis, and as shown in previous rounds of CASP, this part of the modeling procedure is the major source of errors. The critical steps in modeling: selection of template(s) and generation of sequence-to-structure alignment, were based on the results of secondary structure prediction and tertiary fold recognition carried out using the Meta Server [1].

Initially, related proteins with known structures were identified from the consensus of the Meta Server results. For difficult targets, template (fold) identification was based on the results of the 3D-Jury method (Rychlewski L., unpublished). Structural determinants of the fold were then analyzed: all the structures representing a given fold, and corresponding structural alignment extracted from the FSSP database [2], were inspected for both conservation and variability of the structural elements. Conservation of specific residues and contacts responsible for maintaining tertiary structure, and critical for substrate binding and/or catalysis, were also established. Additionally, homologous sequences that matched the targets were collected with PSI-BLAST searches [3] performed against the non-redundant protein sequence database and unfinished genomes until profile convergence. The CLUSTAL W program [4] was used to generate multiple sequence alignments for sets of sequences containing target, and other closely-related proteins, to identify conserved residues within the family.

All alignments produced by different servers interacting with the Meta Server were inspected for both variability and violation of structural integrity. Initial alignment was obtained by taking, in most cases, the common alignment for each region (mainly for each secondary structure element), taking into account the structural alignment of templates where possible, within the context of the structural and sequential constraints identified above. In some cases close homologues were also submitted to the Meta Server as the query sequences. For regions that displayed low stability (i.e. highly dependent on the server), possible alignment variants were derived manually, guided mainly by secondary structure predictions.

All plausible alternative sequence-to-structure alignments were tested by building 3D molecular models for the target sequence with the Homology module of InsightII (Accelrys Inc., San Diego, CA). Backbone conformation was taken from the template structure, and only non-conserved side chains were substituted. Modeling of loops that contained insertion and deletion regions was skipped in this procedure. Models were then subjected to detailed evaluation, mainly by visual inspection of structural consistency and using Verify3D [5] and ProsaII [6] energy profiles. Such a 3D evaluation procedure enabled selection of final sequence-to-structure alignments.

A-93 Final models of target proteins were built using the MODELLER program [7]. Where possible, more than one template protein was used, after superimposition of their molecular structures. The overall quality of each modeled structure was checked in detail with the WHAT_CHECK program [8]. No energy minimization procedures were employed.

1. Bujnicki J.M. et al (2001) Structure prediction meta server, Bioinformatics 17 (8), 750-751. 2. Holm L. et al, (1996) Mapping the protein universe, Science 273 (5275), 595-603. 3. Altschul S.F. et al (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res, 25 (17), 3389-3402. 4. Thompson J.D. et al (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Res, 22 (22), 4673-4680. 5. Luthy R. et al (1992) Assessment of protein models with three-dimensional profiles, Nature 356 (6364), 83-85. 6. Sippl M.J. (1993) Recognition of errors in three-dimensional structures of proteins, Proteins 17 (4), 355-362. 7. Sali A. et al. (1993) Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234 (3), 779-815. 8. Hooft R.W. et al. (1996) Errors in protein structures. Nature 381 (6580), 272.

harrison (P0188) - 43 predictions: 43 3D

Estimated Distance Matrices and Self-Assembling Models

John Petock1, Ping Liu1, Irene T. Weber1, and Robert W. Harrison2,1 1- Department of Biology, 2- Department of Computer Science, Georgia State University [email protected]

Three potential improvements in methodology were explored in our submissions to CASP5. These were applied to both ab initio and similarity modeling. The problem of modeling inserted and deleted regions in homology modeling is related to the problem of modeling a structure from sequence alone and therefore we expect that techniques developed for one problem will be partially transferable to the other. The improvements were: 1) using Floyd’s algorithm to generate a full rank distance matrix for homology modeling, 2) an improved soft hydrophobicity potential with structural fragments and an improved version of self-assembling neural network model for ab initio modeling, and 3) a non-stationary multiple sequence alignment algorithm for initial alignments.

The fundamental problem in insertion/deletion modeling is to generate coordinates from potential energy information alone. Unfortunately, molecular mechanics potentials will not generate a unique minimum structure by energy minimization, although sometimes solvated molecular dynamics will converge to reasonable model. Therefore it is a common practice[1] to augment the potentials by searching the structural database and finding structural fragments that fit the known parts of the structure. These are then used to generate an initial guess for the missing parts of the structure. The basic problem with this approach is that while there are O(N 2) distances that should be estimated for a well-conditioned full rank model building problem, this approach only produces O(N) distance estimates. Floyd’s algorithm is a

A-94 dynamic programming algorithm that uses repeated iterations of the triangle inequality to fill in missing distances. Floyd’s algorithm produces a strict upper-bound for each distance, but is not a reliable predictor of lower-bounds. Distance terms were added to the molecular mechanics potential from three sources: 1) distances in the initial model were added with tight error bounds, 2) distances were added from searching the structure database with error bounds derived from the deviation to the model, and 3) upper-bound only distance terms were added by using Floyd’s algorithm. These distances were then used as distance restraints during the initial part of the model building to generate reasonable insert/deletion structures.

Ab initio models were generated for several systems where no close homolog was detected. The basic approach is a development of our earlier self-assembling methods, where the self-assembly of proteins and other polymers is modeled by the self-assembly of a Kohonen neural network[2]. Two improvements were implemented. First, a softer hydrophobicity potential was used, and second the relaxation scheme in the Kohonen network was improved after extensive studies. The softer hydrophobicity potential meant that we could use short structural restraints (10mers) to impose local secondary structure preferences, and take the best structure as our model, rather than use a longer structural restraint and average the distances over an ensemble of models. Structural restraints were generated with a diagonal programming algorithm that maximized the similarity between overlapping segments while picking the individual segments based on local sequence similarity. The improvements in the implementation of the Kohonen network changed the initial relaxation radius from a small fixed value to a successively decreasing value that started half the size of the space for the molecule and linearly contracted to a small value. This resulted in faster and more accurate convergence in test systems. This algorithm is capable of producing left and right-handed folds with all L amino acids, so both hands of the fold were converted to all-atom models and submitted.

The alignment approach was altered in a small, but important manner. Earlier we had implemented a non-stationary alignment scheme that maximized the correlation between the distributions of amino acids at each position along two sequences. This approach worked by constructing a local set of moments of the distribution and assumed that two sequences were aligned when the distributions were similar. The algorithm was implemented as a dynamic programming algorithm and is fast enough to scan the entire pdb database. The problem with this work is that pairs of sequences can be misleading. It is possible for the closest alignment between a pair of sequences to not be the best alignment between a sequence and the structural class of the protein. Multiple sequence algorithms attempt to remedy this problem by building a discrete model for the structural class from several high similarity sequences. The CASP5 target and several other sequences were aligned by conventional approaches, which are perfectly adequate for high homology (80%+, no gaps) and this aligned set was then used to derive the target vs. starting point alignment.

The models were built using the program AMMP with the sp4 potential set[3]. For homology models the structures were constrained to the initial coordinates while building the unknown parts, and then the constraints were released. The ab initio models were built using a reduced atom potential, and then converted to a all-atom model. Distance restraints were implemented with a split harmonic potential where the potential is zero between and upper and lower bound. Nine ab initio models (targets 129,132, 135, 138, and 157) and 29 homology models (targets 132, 137,139, 143, 144, 150, 151, 153, 154, 155, 156, 158, 160, 163, 167, 178, 179, 182, 183, 188, 190, 192, 193 and 194) were submitted.

1. Bates P.A., Sternberg M.J. (1999) Model building by comparison at CASP3: Using expert knowledge and computer automation. Proteins 37 (s3), 47-54. 2. Harrison R.W. (1999) A self-assembling neural network for modeling polymer structure. J. Math. Chem 26, 125-137. 3. Bagossi P., Zahuczky G., Tozser J., Weber I.T., and Harrison R.W. (1999) Improved parameters for generating partial charges: correlation with observed dipole moments. J. Mol. Model 5, 143-152. Head-Gordon (P0271) - 93 predictions: 93 3D

A-95 A Physical Approach to Protein Structure Prediction

Teresa Head-Gordon1, Silvia Crivelli2, Oliver Kreylos3, Betty Eskow4, Harry Choi1, Richard Byrd4, and Robert Schnabel4 1 Department of Bioengineering, University of California, Berkeley, 2 NERSC, Lawrence Berkeley National Laboratory, 3 Department of Computer Science, University of California, Davis, 4 Department of Computer Science, University of Colorado, Boulder [email protected]

The Stochastic Perturbation with Soft Constraints (SPSC) is a global optimization method that uses some information from known proteins to predict secondary structure, but not in the tertiary structure predictions or in generating the terms of the physics-based energy function [1-4]. Our approach is also characterized by the use of an all atom energy function that includes a novel hydrophobic solvation function derived from experiments that shows promising ability for energy discrimination against misfolded structures [5-7]. We competed for the first time in CASP4, where we showed that our approach is more effective on targets for which less information from known proteins is available. Our SPSC method produced the best prediction for one of the most difficult targets of the competition, a new fold protein of 240 amino acids [4].

The SPSC algorithm is a two-phased approach in which the first phase generates starting structures which are local minima containing predicted secondary structure, and the second phase improves upon the starting structures using both global and local optimizations. The most substantial differences between our CASP4 and CASP5 methods are in Phase I. In CASP4, Phase I begins with a starting structure that is the fully extended chain, and locates good minimizers through local minimizations with soft constraints. The soft constraints are derived from predictions of secondary structure obtained from Psi-Pred [8], and encourage the formation of helices and sheets through the use of penalty (reward) functions; the strength of a penalty (reward) function depends on the strength of a neural network prediction. In CASP5, all starting structures were generated with a new inverse kinematics (IK) tool developed by Kreylos and co-workers [9], that allows for interactive manipulation of local and global dihedral angle moves.

Using Psi-pred predictions, the IK tool forms helices and (isolated) strand structures based on ideal geometric definitions of the two types of local secondary structure. This tool allowed us to create a diverse population of initial configurations regardless of the size and topology of the targets. It was used to form different starting beta-sheet topologies for proteins with predicted strands, in part based on the full list of open sheet motifs described in Ruczinski et al. [10], although we did not use their scoring function for ordering or eliminating certain sheet topologies. We also used the IK tool to generate some starting sheet topologies that were not representative of those found by Ruczinski et al. [10], but seemed possible given the mechanics of the chain. These IK structures were then locally minimized and added for Phase II optimization. We also relied less on strand prediction by trying a new strategy that uses the IK tool to form all strand structure for all backbone dihedral angles predicted not to be helical. We have found that the global optimization itself forms its own sheet topologies; this is especially important for cases where there is poor or ambiguous secondary structure predictions, such as was the case for targets like T145.

The local minimizers resulting from Phase I contain predicted secondary structure but do not contain any significant tertiary structure. Phase II improves those minimizers through global minimizations in a sub-space of the torsion angles of amino acids predicted to be coil. A brute-force search is avoided by selectively doing a

A-96 local minimization based on whether a new proposed start structure lies within a certain distance metric of another structure, and whether its energy is lower than other existing structures; if a new start structure lies within the distance metric, and is higher in energy, it is assumed to lie within an existing basin of attraction, and is rejected from further computational consideration. This global optimization approach is one of the few that provides a theoretical guarantee of finding global optimum, and is general in the sense that subspaces of arbitrary dimension can be explored. However, in practice, the amount of work required to reach the theoretical guarantee is prohibitive for large subspaces. Because the theoretical guarantee is higher for small dimensional problems, we select a subset of ~6-10 variables from the space of torsion angles predicted to be coil, and a global optimization is performed on the selected torsion angles as variables while keeping the rest temporarily fixed at their current values. The global optimization produces a number of local minimizers in the subspace of torsion angles chosen, and a number of those conformations with low energy values are selected for local minimizations in the full variable space. The new minimizers obtained from the local minimizations are merged into the current list, are clustered and ordered by energy value and the second phase starts again. The process repeats for a number of iterations, until no further progress is made according to the following stopping criteria.

At the end of each Phase II run the algorithm returns between 60-140 of the lowest energy configurations found thus far. These structures are clustered into groups in which members of a given cluster are within 5-15Å r.m.s.d. of each other (lower bound for small proteins, upper bound for large proteins) and the members of each cluster are energy ranked. The best configuration from each cluster is used as a starting point for the next round of Phase II. Our experience is that convergence correlates with no new distinct clusters, and an energy value that is no longer changing. In general, we submitted structures that were one of the three lowest in energy of the lowest energy cluster as the first prediction, the second prediction generally as one of three lowest energy members of the second lowest energy cluster, etc.

The SPSC algorithm uses the AMBER 95 molecular mechanics energy function to represent the protein-protein interactions. We have added an empirical solvation free energy term that models the hydrophobic effect as a two-body interaction between all aliphatic carbon centers, and is motivated by our recent experimental and theoretical work to determine the role of hydration forces of model protein systems [5-7]. In addition to other validation studies of the potential [1-4], a validation using the set of protein misfolds in the decoys database [http://dd.stanford.edu/] is in progress. Our recent preliminary tests found the native structure as the 11 th lowest in energy relative to the several hundred subtly misfolded structures of 2CRO. Further validation on the full set of decoys structures is planned, but were put aside to compete in CASP5.

The method is parallelizable as different subspaces can be searched independently. The global optimization algorithm runs on the T3E and IBM/SP at NERSC running up to 256 processors, the T3E at the Pittsburgh SC, a 40-processor IBM/SP, and a 32-node cluster of Compaq DS20's. The parallelization uses a new load balancing technique that is general for large tree search problems using a hierarchical approach [11]. We selected targets by considering the CAFASP servers estimate of what might be a new fold or difficult fold recognition target. We submitted predictions on 20 targets with strength percentages below 60% according to CAFASP statistics, and ranging in size from 53 to 417 amino acids. Some predictions on the largest targets were not converged, although it is certainly possible with at least a few more weeks of lead time.

1. Crivelli S., Head-Gordon T., Byrd R. H., Eskow E., Schnabel R. (1999). A hierarchical approach for parallelization of a global optimization method for protein structure prediction. Lecture Notes in Computer Science, Euro-Par '99, Amestoy, Berger, Dayde, Duff, Fraysse, Giraud, Ruiz (eds.), 578-585. 2. Crivelli S., Philip T.M., Byrd R., Eskow E., Schnabel R., Yu R.C., Head-Gordon T. (2000). A global optimization strategy for predicting protein tertiary structure: a- helical proteins. Computers & Chemistry 24, 489-497. 3. Azmi A., Byrd R.H., Eskow E., Schnabel R., Crivelli S., Philip T.M., Head-Gordon T. (2000). Predicting protein tertiary structure using a global optimization algorithm with smoothing. Optimization in Computational Chemistry and Molecular Biology: Local and Global Approaches, Floudas and Pardalos, editors (Kluwer Academic Publishers, Netherlands), 1-18.

A-97 4. Crivelli S., Eskow E., Bader B., Lamberti V., Byrd R., Schnabel R., Head-Gordon T. (2002). A physical approach to protein structure prediction. Biophys. J. 82, 36- 49. 5. Pertsemlidis A., Soper A.K., Sorenson J.M. & Head-Gordon T. (1999). Evidence for microscopic, long-range hydration forces for a hydrophobic amino acid. Proc. Natl. Acad. Sci. 96, 481-486. 6. Sorenson J.M., Hura G., Soper A.K., Pertsemlidis A. & Head-Gordon T. (1999). Determining the role of hydration forces in protein folding. Feature Article for J. Phys. Chem. B 103, 5413-5426. 7. Hura G., Sorenson J.M., Glaeser R.M. & Head-Gordon T. (1999). Solution x-ray scattering as a probe of hydration-dependent structuring of aqueous solutions. Perspectives in Drug Discovery and Design 17, 97-118. 8. McGuffin L.J., Bryson K, Jones D.T. (2000) The PSIPRED protein structure prediction server. Bioinformatics 16, 404-405. 9. Kreylos O., Hamann B., Max N., Bethel W. & Crivelli S. (2002). Interactive Protein Manipulation, Tech. Report CSE-2002-28, UC Davis. 10. Ruczinski I., Kooperberg C., Bonneau R., Baker D. (2002). Distributions of beta sheets in proteins with application to structure prediction. Proteins: Structure, Function and Genetics 48, 85-97. 11. Crivelli S. & Head-Gordon T. (2002). A New Load Balancing Technique for the Solution of Large Tree Search Problems using a Hierarchical Approach. In preparation for IBM Research Journal.

HMMSPECTR (P0025) - 285 predictions: 285 3D

Protein Structure Prediction Using Hidden Markov Models Based System (HMMSPECTR)

I.F. Tsigelny, Y.V. Sharikov, A.P. Kornev, V. Kotlovyi, L.F. Ten Eyck University of California, San Diego, San Diego Supercomputer Center [email protected]

For CASP5 predictions we used the advanced version of HMMSPECTR system (http://hmm-spectr.sdsc.edu) that was initially implemented in CASP4 (group TSIGELNY (Tsigelny) PO274). The system is based on searching for the best alignments between the target primary sequence and members of Hidden Markov Models (HMMs) library of protein structural homologs [1]. As compared with the previous version of HMMSPECTR, we have increased the size of the HMMs library and developed new search approaches.

Structural classification of the library was made in accordance with SCOP [2]. The total size of the used HMMs-library was 26263 models, including three sub-libraries derived from three different sources. The first was the HMMs library of SCOP superfamilies, SUPERFAMILY v.1.59, downloaded from http://sup fam. mrc- - lmb.cam.ac.uk/SU PERFAMILY. The second and the third sub-libraries have been created using protein structure alignments made by two different programs: CE[3] and the original structure alignment tool SA ( http: //www.npaci.edu/CCMS/wsat) [4]. In the case of SA implementation alignments were made within each SCOP family, while in CE the alignments have been built within a group of structural homologs selected from the PDB (Protein Data Bank) for typical representatives (“title proteins”) of each SCOP family. The HMMs were built using HMMER package [5], with 8 different filters for gaps in HMM matrix columns (from 10% to 90% gaps

A-98 allowed). While building HMMs using the SA engine for families with a small number of proteins, we included up to five of the best versions of alignments for each pair of compared proteins. Such an enhancement of HMMs basis diminished the influence of its reduction to the homologs’ primary sequences in the sparcely populated families.

The next step of protein structure prediction was the alignment of a target primary sequence with suggested template proteins. The above-mentioned SCOP “title proteins” were considered as template proteins for each HMM. The alignment of the target protein and the title protein was made using the HMMER package with the BLOSUM62 substitution matrix.

The final step of the prediction was assessment of the alignments obtained. The assessment was made using multiple parameters, including the HMM-score, Secondary Structure score (SS-score) and Hydrogen Bonds score (HB-score). The HMM-score was calculated directly by the HMMER package. For assessment of the secondary structure identity between the target and the template proteins an original approach for the target secondary structure prediction was applied.

The secondary structure prediction was performed analyzing information presented in the PDB and creating a database of all secondary structure patterns related to particular primary sequences. Patterns of 6 and more residues were taken into account. Different secondary structure patterns related to the identical primary sequence patterns were averaged. After that, a BLAST-like search of the primary sequence patterns presented in the created database was made in the target protein primary sequence. The predicted secondary structure of the target protein was calculated from a sum of all secondary structure patterns aligned. SS-score was derived from comparison of the predicted secondary structure of the target protein with the secondary structure of the template protein.

To minimize dependency of the used scores on the alignments length, all of the alignments were separated into groups of the same size (alignments within each group did not differ more than 5% of the target protein length). Within every group the alignments were ranked twice: according to HMM-score and according to SS-score. Joined weighted rank was used for each alignment: the HMM-score weight was 100%; the SS-score weight did not exceed 50% and depended on the secondary structure prediction quality coefficient. The three best alignments belonging to three unique proteins were selected for the subsequent analysis from each group.

In the final stage the HB-score was taken into account. Calculation of the HB-score was performed by locating hydrogen bonds in template proteins using the HBPLUS program [6]. To exclude hydrogen bonds involved in alpha-helixes formation, the bonds with the donor and acceptor separated by less than 5 residues were not considered, and the pairs of the target protein residues corresponding to each hydrogen bond in the template protein were estimated according to their ability to form a hydrogen bond. Finally, the total bonus/penalty score was normalized by the number of analyzed hydrogen bonds.

The HB-score was used for selection between two alignments of the same template protein as well as for distinguishing between template proteins with close HMM/SS- ranks. For example, if a template protein had two alignments – a short one with a high HMM-score and a long one with relatively low HMM-score, but the HB-score for the long alignment was high enough, the long alignment was considered as a probable candidate for the target structure prediction.

If several template proteins were aligned with different parts of the target protein, multi-domain prediction (CASP5 – AL format) was presented (e.g. T0139 #5, T0162 #4, T0173 #5). For some of the target proteins a new approach using multi-aligned HMMs was tested. It was used when different HMMs were aligned with different parts of a target protein (overlapped or not). In this case the aligned HMM matrices were cut in accordance with the maximum scores and merged into one ‘synthetic’ HMM matrix. This HMM was used to search through PDB primary sequences. The protein aligned with the HMM was considered to be the most probable structure template for the target protein.

A-99 1. Tsigelny I. et al. (2002) Hidden Markov Models-based system (HMMSPECTR) for detecting structural homologies on the basis of sequential information. Protein Eng. 15(5), 347-352. 2. Murzin A.G. et al. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247(4), 536-540. 3. Shindyalov I.N., Bourne P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering 11(9), 739- 747. 4. Kotlovyi V. et. al. (2002) A flexible method for structural alignment: Application to structure prediction assessments. In Protein structure prediction: Bioinformatic Approach (ed. I.F.Tsigelny), pp. 433-447. International University Line, La Jolla, CA. 5. Eddy S.R. (1998) Profile hidden Markov models. Bioinformatics 14(9), 755-763. 6. McDonald I.K., Thornton J.M. (1994) Satisfying hydrogen bonding potential in proteins. J.Mol.Biol. 235(5), 777-793.

Ho-Kai-Ming (P0437) - 129 predictions: 129 3D

Three Dimensional Threading Approach to Protein Structure Recognition

Kai-Ming Ho, Haibo Cao, Yungok Ihm, Zhong Gao, Cai-Zhuang Wang and Drena Dobbs Iowa State University [email protected]

Our protein recognition scheme uses a threading approach in which candidate structures are represented by contact matrices following the work of Miyazawa and Jernigan[1]. The alignment of the target sequence on a template structure is determined by a scoring function consisting of a sum of all residue-residue contacts with hydrophobic strengths evaluated using the Li, Tang and Wingreen parameterization[2] of the Miyazawa Jernigan matrix[1]. Contributions of local secondary structure preference are included by multiplying the raw score by an enhancement factor equal to (1+ alpha*(Nright-Nwrong)/Naligned) where Naligned is the number of residues aligned to the structure, Nright is the number of residues where the secondary structure (helix, sheet or loop) of the template agrees with the result from secondary structure predictions and Nwrong is the number of residues where they disagree. The secondary structure predictions are obtained from jpred, psipred and samt99 (as posted on the CAFASP website) and only predictions where all three methods agree are counted.

Our searches are divided into two classes: for targets which has significant sequence similarities to proteins of known structure (homology modeling or HM targets) as indicated by the results from various servers posted on the CAFASP website, we run threading studies on the suggested structural families (using the structural classification from the SCOP and ASTRAL database). If the threading score is above our threshold, we stop the search. For non HM targets, threading studies are done with ~14000 protein structures selected from the ASTRAL domain library[3]. This dataset covers ~1500 out of ~1800 domain families in the ASTRAL database and includes all domain families which are shorter than 300 residues. For longer target sequences, we augment the above database with additional families with lengths from 300 to 600 residues. To facilitate recognition, we perform our initial threading studies not on the whole target sequence but on short subsequences with lengths ~100-120 selected from different positions on the target. Once we have hits with threading scores above threshold, threading studies are performed on the selected families using

A-100 longer and longer subsequences representing a larger and larger fraction of the target. In some cases, high-scoring fragments are pasted together to yield a more complete structural prediction for the whole protein. In the last part of the prediction process, final PDB geometries are generated from high-scoring template structures using the alignment obtained from the threading studies. The final modeling and refinement are done with the software package JACKAL (J. Xiang, Columbia University) and/or the MODELLER and PROCHECK package.

1. Miyazawa S. and Jernigan R.L., (1985) Estimation of effective interresidue contact energies from protein crystal structures: Quasi-chemical approximation. Macromolecules 18, 534-552 2. Li H., Tang C., and Wingreen N.S. (1997) Nature of driving force for protein folding: A result from analyzing the statistical potential. Phys. Rev. Lett. 79, 765-768 3. Brenner S.E., Koehl P., and Levitt M. (2000) The astral compendium for protein structure and sequence analysis. Nucleic Acids Res. 28, 254-256

HOGUE-SLRI (P0267) - 254 predictions: 254 3D

Homology Modeling using a Novel Flexible Fragment Assembly Approach and Ab Initio Prediction Using Distributed Computing

H.J. Feldman1,2, M. Dumontier1,2 and C.W.V. Hogue1,2 1 – Samuel Lunenfeld Research Institute, Mount Sinai Hospital 2 – Department of Biochemistry, University of Toronto [email protected]

We employed two distinct approaches for structure prediction, depending on whether homology with a protein of known structure existed for the target. Initially, CDDSearch (http://www.ncbi.nlm.nih.gov/structure/cdd) was employed to identify protein families with significant similarity to the target. We also checked for low E- value hits from PDB-BLAST on the CAFASP site. If a template sequence with an E-value below 0.01 was found (from either PDB-BLAST or CDD) we proceeded with the homology modeling procedure outlined below. Otherwise, we assigned the target to the Distributed Folding Project (described below).

A total of 39 targets were modeled using homology modeling in the following way. The family with the best E-value which also contained a protein of known structure was aligned to the query CASP target, and the most similar sequence with a structure in that family was chosen as the template for homology modeling. In the case of multi-domain proteins, the best hit from CDD for each domain was used as template. Template structure(s) were manually inspected, and gaps manually adjusted when necessary to ensure they fell on loop regions and not in the center of secondary structure elements.

Next, using a modified version of our TRADES algorithm [1] the backbone alpha-carbon trajectory of the template was recorded, and a trajectory distribution built with the new sequence of the target. Each gapless stretch of alignment was replaced by a single fragment from the recorded trace. Where gaps occurred in the alignment, fragments were built to span the gaps. These fragments were created as follows.

A-101 The 'takeoff angles' were recorded starting from one residue prior to the gap and ending one residue following the gap, on the template structure. These consisted of six degrees of freedom - the distance between the start and end of the gap, two Ca virtual angles (i.e. angles between three consecutive Ca atoms – ‘virtual’ because they are not covalently bonded) and three Ca virtual dihedrals. Then three Ca atoms from each side of the gap were placed in space, according to the recorded takeoff angles. Alpha carbons required to fill the gap were then given arbitrary starting co-ordinates within the gap region, and a steepest descent energy minimization carried out. For the purposes of this minimization, the energy function consisted of Ca virtual bond length restraints, Ca virtual angle restraints, and a van der Waals term. The three anchoring atoms on either side of the gap were held fixed during the minimization. Finally, the resulting loop was incorporated as a fragment using its own Ca trace.

Fragments are not completely rigid but rather are allowed a small amount of flexibility by adding some random noise to each Ca position relative to the previous backbone. Then roughly 1000 structures were generated using the fragments obtained from the previous steps and our Foldtraj software, with bump checking disabled, and due to the slight flexibility in the fragments, some variation occurs in this pool of structures.

Using a slightly modified version of a statistical residue-based potential [2] which we have termed 'crease energy', the best five structures were chosen. These were then refined with a steepest-descent minimization using the CHARMM EEF1 force field to resolve any steric clashes but without significantly changing the structure (typically 1A RMSD between the refined and unrefined structures).

A total of 13 targets were predicted with the help of distributed computing using an ab initio approach. Using a modified version of our TRADES algorithm [1] we incorporated secondary structure prediction from PsiPred [3] and performed random walks in Ramachandran space. Sidechains were placed probabilistically using Dunbrack's backbone dependent rotamer library [4]. All residues are chirally and sterically valid, having a minimum of non-hydrogen van der Waal collisions.

Up to 1 billion structures were generated for each target using the Distributed Folding Project framework (http://www.distributedfolding.org/). This allowed us to make use of spare CPU cycles on thousands of computers across the world to sample structures.

Finally, from the pool of generated structures various statistics were collected including radius of gyration, exposed surface area, exposed hydrophobic surface area, and energy score according to three different scoring functions: the EEF1 solvation term, a modified version of a statistical residue-based potential [2] which also compared actual secondary structure content to predicted content, and a species-specific contact potential developed in our lab. Structures with radii of gyration greater than 120% * 2.59 * N^0.346, where N is the number of residues in the protein, were all discarded. This ensured only compact structures were retained.

The best structures were chosen based on their energy scores. The top 2-5 structures for each of the three energy functions used were visually inspected and five chosen for submission.

1. Feldman H.J. and Hogue C.W.V. (2000) A Fast Method to Sample Real Protein Conformational Space. Proteins 39 (2), 112-131. 2. Bryant S.H. and Lawrence C.E. (1993) An Empirical Energy Function for Threading Protein Sequence through the Folding Motif. Proteins 16 (1), 92-112. 3. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202. 4. Dunbrack R.L., Jr. and Karplus M. (1993) Backbone-dependent rotamer library for proteins. Application to side-chain prediction. J.Mol.Biol. 230 (2), 543-574. Holm (P0090) - 38 predictions: 38 3D

A-102 Adaptive Profile Alignment

A. Heger and L. Holm Institute of Biotechnology, P.B. 56, 00014 University of Helsinki, Finland [email protected]

The method uses only sequence information and has two steps: (1) select a set of homologous sequences that includes both the target and template proteins, (2) align target and template sequences using many intermediate sequences as stepping stones. For the first step, we used the transitivity of homology to search for connected sets of sequence-similar proteins, as it has been shown that profile-based sequence similarity searching fails to detect a large fraction of more distant homology relationships. The second step used a novel algorithm, MaxFlow, which in our own tests has improved both the reliability and coverage of alignments compared to PSI-Blast. In particular, MaxFlow is capable of generating accurate alignments between proteins, which are only indirect PSI-Blast neighbours. If the target and template are distant homologues, they can have genuinely different amino acid preferences, which cannot be reasonably modelled by a single profile. MaxFlow mimics evolutionary adaptation in that it allows the profile model to change gradually through many intermediate stages.

Data: Sequence analysis was based on our PairsDb database, which organizes one million non-identical protein sequences (nrdb100 set) into hierarchical clusters. Nrdb90 is a representative subset generated at the 90 % identity level, and nrdb40 is a representative subset generated at the 40 % identity level. All sequences in nrdb100 are mapped to the nrdb90 or nrdb40 representative via Blast [1] alignments. The database also stores the results of all-against-all PSI-Blast searches [1] in the nrdb40 set.

Homology detection: Templates for structure prediction were selected based on overlapping sequence neighbourhoods. Sequence neighbours were defined as reciprocal PSI-Blast hits, i.e., profiles seeded from protein A or B detected protein B or A, respectively, with an e-value < 1. We used a pre-computed library of sequence neighbours of all PDB structures. If any sequence neighbour of the target was included in the library, we counted a hit. We then pooled the counts per SCOP superfamily. The superfamily with the most votes was chosen as template-set. Typically, there was one SCOP superfamily, which had a distinctly higher count than any other. We gain in sensitivity compared to a plain PSI-Blast search, as the target A and template B need not be direct PSI-Blast neighbours but may be linked by an intermediate (domain) X as in A-X, X-B. Sequence alignment: The notion of scoring alignments for consistency – rather than amino acid similarity – has been around for a long time [2-4]. The input to MaxFlow is a library of pairwise alignments. The input set of sequences was the union of the sequence neighbours of the template set and of the target. The pairwise alignments were taken from the PSI-Blast all-against-all database. Provided that the target and template are in the same connected component, the pairwise alignment library implies a transitive alignment between target and template via a number of intermediates. There are, of course, very many choices of intermediates that will lead to mutually inconsistent alignments between the proteins at the start and end of the chain. The classical multiple sequence alignment problem aims to reconcile such inconsistencies using ad hoc objective functions, usually a sum-of-pairs score, leading to NP-complete optimisation problems [4]. MaxFlow uses a novel type of objective function for transitive alignment, which is based on a path score. The path score measures the total support in the alignment library for pairing a given pair of residues of two proteins. The algorithmic advantage is that we need only address the standard pairwise alignment problem, which can be solved exactly.

Empirically, MaxFlow’s consistency score correlates with the reliability of alignment, so that one can select more reliable core parts of the alignment, but this information could not be entered into the AL format of CASP.

A-103 1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 2. Vingron M., Argos P. (1991) Motif recognition and alignment for many sequences by comparison of dot-matrices. J. Mol. Biol. 218, 33-43 3. Notredame C., Higgins D.G., Heringa J. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 302, 205-217 4. Notredame C. (2002) Recent progress in multiple sequence alignment: a survey. Pharmacogenomics 3, 131-144 Honig (P0110) - 113 predictions: 113 3D

Comparative Modeling Using HMAP, NEST, Troll and Physical-Chemical Principles

Zhexin Xiang1,2 , Donald Petrey1,2, Cinque Soto2, Chris Tang3 and Barry Honig1,2 1Howard Hughes Medical Institute, 2 Department Of Biochemistry And Molecular Biophysics, 3 Integrated Program in Cellular, Molecular and Biophysical Studies, Columbia University [email protected]

Overview - We participated in the fold recognition and homology sections of the experiment using primarily in-house software. Much of this software is novel and has not yet been published. The in-house software we used includes HMAP (a hybrid sequence and structure based alignment between query and template profiles), NEST (a new homology modeling program that is based on an artificial evolution method), SCAP [1] and LOOPY [2] (a side-chain and loop prediction program based on the colony energy approach), Troll [3] /GRASP2 (an interactive program which contains all of the features of GRASP plus multiple structure alignments and an easy to use graphical user interface that displays both sequence and structure alignments), DIFALN/BINGO (a graphical program to display and manually tune sequence alignments between HMAP and CAFASP servers) and physical-chemical based energy functions to evaluate alternate conformations.

Our strategies for fold recognition and homology modeling were very similar. For fold recognition we generally attempted targets where HMAP detected templates with a reasonable e-value threshold, or where we felt that HMAP improved the alignments that came from the CAFASP servers. On occasion, we noticed that CAFASP servers would detect significant hits where HMAP did not. In all cases, this happened because the hit detected by the servers was not in our database. Thus, we built a profile using HMAP for the new template and used it to generate our own alignments. If we felt we had nothing to add beyond what the servers listed, we decided not to submit that target.

For each target we would perform the following: 1) build 3D models for sequence alignments from HMAP and selected CAFASP servers; 2) evaluate each model with our own energy functions and with Verify3D [4]; and 3) identify regions of the sequence where multiple structure alignments of family members revealed either similarities or differences. If differences were identified, we generally used energetic criteria to decide between models, but on occasion used intuition derived from visually inspecting the alignments. The alignments were adjusted based on the energy criteria and steps 1-3 above were carried out again. This process was repeated until a satisfactory structure was generated. One area where visual inspection was particularly useful was in deleting insertions. In many cases we could easily delete loops and even some secondary structure elements while minimally perturbing the structure.

Our strategy for homology modeling was closely related to that used in fold recognition but with a few additional steps. Since NEST works so rapidly we were able to use regions from different templates where we believed they provided better local templates, and then fuse the ends of these regions into our original template with a loop closure procedure [2]. In general, we did not try to keep the target as close as possible to the template. We realized that this was a risky procedure but we felt it

A-104 important to test our ability, for example using the refinement module of NEST, to try to relax the structure. This was sometimes done with manual input. For example we always tested for buried charges and unless we could visually identify a potential ion-pairing partner we would either change the alignment or try to change the structure. This involved both backbone and side chain movement.

Methods-HMAP is a fold-recognition and alignment program that relies on profile-to-profile dynamic programming. Template profiles were derived from SCOP- defined protein domains (version 1.57 at the time we built our database) and consisted of several different types of information that could be derived from the sequence and structure of a protein. In CASP5, our templates primarily used information derived from secondary structure, fixed-length sequence motifs, automated multiple structure alignments and sequence-based profiles. Position-specific gap penalties were derived from the secondary structure profiles generated from multiple alignments of structurally related proteins. The results were stored in the form of a database of structural templates. Profiles were calibrated so that the statistical significance of a hit could be estimated. When a new target was released from CASP, we built a query profile for the sequence based on its sequence-based profile and secondary structure prediction (using a consensus between PSI-PRED [5], PHD [6] and JNET [7]). The alignments given by HMAP were manually assessed and then fed to the homology-modeling program NEST. NEST is a homology program based on an artificial evolution method (http://trantor.bioc.columbia.edu/~xiang/jackal). The program can build and refine homology models based on single, composite or multiple templates. Given an alignment between a query sequence and a template, the alignment can be considered as a list of operations such as residue mutation, insertion or deletion. Building a structure for the query sequence based on the template is a process of performing these operations. Each operation will disturb the template structure and involves an energy cost, either positive or negative. The model building starts from the operation with the least energy cost and so on. Each operation is finished with a slight energy minimization to remove atomic clashes. The final structure is then subjected to more thorough energy minimization. The minimization is done in torsion angle space. The energy function consists of the following terms: van der Waals energy, hydrophobic, electrostatics, torsion angle energy, hydrogen-bond network energy of the template, and statistical energy of a residue’s solvent accessibility. The structure refinement module in NEST can refine the models in four levels: energy minimization of clashing atoms, refinement of insertion and deletion regions, refinement in all loop regions and refinement in all α/β regions. Refinement of loop regions is done using LOOPY and refinement of side-chain conformations is performed using SCAP, where both SCAP and LOOPY use the colony energy approach to account for the flexibility of side chains and loops on the protein surface. Refinement of helix or sheet regions is done by a procedure similar to LOOPY, but the hydrogen constraints in the regular secondary structure regions are applied so that the refinement does not disrupt the original hydrogen bond network.

Models were evaluated by comparing energies of the models using a protocol that combines an extensive molecular mechanics minimization with an evaluation of the total electrostatic energy using the finite-difference Poisson-Boltzman method. Powell minimization using an all-hydrogen model and CHARMM22 parameters and a dielectric constant of 10 was performed. Low energy structures were considered for submission. This procedure was combined with visual evaluation of the models using the program GRASP2 written with the Troll software library of molecular analysis and visualization tools. In addition to the molecular graphics, surface display, and electrostatic features of the original version of GRASP, GRASP2 now integrates structure alignment and sequence display/alignment tools into the graphical user interface. These tools allow a user to conveniently search a database of domains for proteins that are structurally homologous to a given template, and to simultaneously display/compare different alignments to a template or alignments to different templates. This is accomplished by carrying out a multiple structure alignment of a set of templates and then adding alignments of a query to each template to the multiple structure alignment. Structure alignments were generated as follows. First, equivalent secondary structure elements are identified using a double-dynamic programming algorithm. Once structurally equivalent secondary structure elements are identified, structurally equivalent residues are identified by superposing the end-points of the equivalent secondary structure elements and then carrying out an iterative process of sequence alignment. Residue similarity at this stage is a simple function of the distance between alpha-carbons given the current rigid body superposition. A sequence alignment is determined using this similarity score and rigid body superposition is carried out again. This process is repeated until the change in root-mean square deviation of aligned carbon-alpha atoms does not change by more than a given threshold. The simultaneous display/comparison of alignments and structures allows

A-105 convenient identification of structural features that may be responsible for differences in the more objective evaluation criteria such as calculation of molecular mechanics energies or Verify3D profiles [4] and contributed significantly to the decision as to which model/alignment to submit.

1. Xiang Z. and Honig B. (2001) Extending the Accuracy Limits of Prediction for Side Chain Conformations. J. Mol. Biol. 311:421-430. 2. Xiang Z., Soto C and Honig B. (2002) Evaluating Conformational Free Energies: The Colony Energy and its Application to the Problem of Loop Prediction. Proc. Natl. Acad. Sci. USA 99:7432-7437. 3. Petrey D. and Honig B. (2000) Free Energy Determinants of Tertiary Structure and the Evaluation of Protein Models. Protein Science 9:2181-2191. 4. Luthy R., Bowie J.U. and Eisenberg D. (1992) Assesment of Protein Models with Three- Dimensional Profiles. Nature 356:83-85. 5. Jones D. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 292(2):195-202. 6. Rost B. (1996) PHD: predicting one-dimensional protein structure by profile-based neural networks. Methods Enzymology. 266:525-39. 7. Cuff J.A. and Barton G.J. (2000) Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins. 240(3): 502-11.

Huber-Torda (P0351) - 83 predictions: 83 3D

Fold Recognition and Sequence to Structure Alignment Using Wurst

T. Huber1, J.B. Procter2 and A.E. Torda2 1 - Department of Mathematics, The University of Queensland, Australia, 2 - Zentrum für Bioinformatik Hamburg, University of Hamburg, Germany [email protected], [email protected], [email protected]

Our calculations were performed with the "wurst” [1], a locally written protein structure prediction package. Fold recognition calculations were done using a two-step approach where completely different score functions are used for alignment and ranking of models [2].

The first score function was based on optimized Bayesian-mixture models designed to measure sequence to structure compatibility in small fragments. This treats both sequence and backbone angles as statistical descriptors. Despite the unusual formalism, it allows one to build a score matrix for sequence to stucture alignments and easily combined with a term accounting for sequence similarity. For ranking models, these terms were mixed with a low-resolution (five site / residue), z-score optimized score function [3].

A-106 Alignments were calculated using a Smith and Waterman style local alignment and extended using the globally optimal Needleman and Wunsch algorithm. Gap penalties and relative contributions of different terms were optimized using a simplex method to produce the best models, in a geometric sense, for a set of structurally similar proteins. The penalty function only considered the quality of a model and did not refer to any “ideal” alignment.

The library of candidate models was built using a clustering of known proteins, based on their apparent similarity in the appropriate score function, rather than a conventional sequence or structure measure.

Finally, models were regularized using a distance geometry code.

1. Huber T et al. (1999) SAUSAGE: Protein threading with flexible force fields. Bioinformatics 15 , 1064-1065. 2. Huber T and Torda A.E. (1999) Protein sequence threading, the alignment problem and a two-step strategy. J. Comput. Chem. 20, 1455-1467. 3. Huber T. and Torda A.E. (1998) Protein sequence threading, the alignment problem and a two-step strategy. Protein Sci. 7, 142-149. 4. Russel A.J. and Torda A.E. (2002) Protein sequence threading – averaging over structures. Proteins 47, 496-505.

I-sites/Bystroff (P0132) - 64 predictions: 64 3D

Fully Automated Ab Initio Tertiary Structure Prediction Using I­SITES, HMMSTR and ROSETTA

C. Bystroff and Y. Shao Department of Biology, Rensselear Polytechnic Institute [email protected], [email protected]

The Monte Carlo fragment insertion method for protein tertiary structure prediction (ROSETTA) of Baker and others, has been merged with the I-SITES library of sequence structure motifs and the HMMSTR hidden Markov model for local structure in proteins, to form a new public server for the ab initio prediction of protein tertiary structure (www.bioinfo.rpi.edu/~bystrc/hmmstr/server.html). The server also predicts fragment structure, backbone angles, and secondary structure. The CASP4 results for this server are presented in [1].

The following is a short description of each of the main programs in the order they appear in the server script

Generate a multiple sequence alignment Single sequences were submitted to PSI-BLAST[2], which returns a multiple sequence alignment. The multiple sequence alignment was converted to a sequence profile.

Predict I-sites motifs

A-107 The sequence profile is compared in a sliding-window fashion with each of the 261 I-sites Library scoring matrices [3]. The score is mapped to a confidence. The server returns a list of fragment predictions, expressed as backbone angles, sorted by confidence. The highest confidence fragments are referred to as "I-sites predictions," the whole list as "I-sites fragments."

Generate a fragment moveset The I-sites fragment list was converted to a ROSETTA move set containing libraries of length 3 and 9 peptides for each window of the sequence. Each I-sites fragment with length L ≥ 9 was divided into L-9+1 subsegments of length 9. The 25 highest confidence fragments were kept for each 9-residue window in the query. If fewer than 25 high confidence fragments were found, then the list was augmented by extending 7 and 8 residue I-sites fragments. A similar procedure was done for the moveset of length 3. Restrain high-confidence regions High confidence I-sites predictions were restrained to their predicted backbone angles to increase the efficiency or ROSETTA. Fragment insertion was allowed in the restrained regions, but moves were rejected if any angle deviated by more then 60° from the I-sites prediction. A maximum of one-third of the residues could be restrained.

Divides long sequences If the target sequence had more than 36 un-restrained residues after teh previous step, then it was divided into overlapping segments having about 36 un-restrained residues each. Adjacent segments overlap by at least 18 un-restrained residues, plus any intervening restrained segments.

Assemble fragments ROSETTA[4] searches protein conformational space using fragment insertion moves and a Monte Carlo acceptance critereon. An insertion point in the target is selected at random, then a fragment (either length 3 or 9) is selected at random from the fragment library. The backbone angles are changed to those of the fragment, new coordinates are computed from the backbone angles, and the move is accepted or rejected, using Monte Carlo. The energy function [5] is composed of structure-based Bayesian conditional probability expressions, drawn from the PDBselect database [6].

The probability of acceptance depends on the change in energy, and the temperature setting (T). T is set initially set to a high value so that most physically-possible moves are accepted, then decreased linearly over 12,000 moves. The optimal temperature schedule depends on the length of the chain being simulated, or more specifically, the number of degrees of freedom. In this automated proceedure, a fixed temperature schedule was used and the length of the input sequence was restricted to a narrow range.

Re-assemble split sequences A total of 15 fragment predictions were produced by ROSETTA for each segment. The 5 best predictions for adjacent segments were re-combined by exhaustive splicing. Starting with two sets of overlapping segment predictions, all possible crossover hybrid models were made and the five with the lowest energy were saved for the next round, or for final output.

1. Bystroff C. & Shao Y. (2002). Fully automated ab initio protein structure prediction using I-SITES, HMMSTR and ROSETTA. Bioinformatics 18, S54-61. 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 3. Bystroff C. & Baker D. (1998). Prediction of local structure in proteins using a library of sequence-structure motifs. J Mol Biol 281, 565-77.

A-108 4. Simons K.T., Kooperberg C., Huang E. & Baker D. (1997). Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 268, 209-25. 5. Simons K.T., Ruczinski I., Kooperberg, C., Fox B.A., Bystroff C. & Baker D. (1999). Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins 34, 82-95. 6. Hobohm U. and Sander C. (1994) Enlarged representative of protein structures. Protein Science 3, 522.

INFORMAX (P0326) - 24 predictions: 24 3D

Modeling of CASP5 Targets with GenoMax 3.3 ™ Homology Modeling Tool

Feodor Tereshchenko InforMax Inc. [email protected]

The homology modeling algorithm is described in [1]. Alignments were created in GenoMax 3.3 ™ environment [2] and manually edited. The following predictions were submitted: T0133: Residues 293-304 modeled. Template - 1SP1. T0135: Residues 42-100. Template – 1FCE. T0137: Residues 1-131. Template – 1B56. T0141: Residues 3-73. Template - 1MIM. T0144: Residues 5-172. Template - 1DYW_A. T0150: Residues 2-97. Template - 1CN9_A. T0154: Residues 12-286. Template - 1IHO_A. T0155: Residues 4-118. Template - 2DHN. T0158: Two models submitted. Model No. 1: Residues 12-317. Template - 1JJI_B. Model No. 2: Residues 10-319. Template - 1EVQ_A. T0160: Residues 11-128. Template - 2MSP_A. T0163: Residues 1-216. Template - 1EL7_A. T0169: Residues 17-69. Template - 1CZF_B. T0171: Residues 43-130. Template - 1OIL_A. T0175: Residues 14-211. Template - 1JG4_A. T0177: Residues 41-83. Template - 1FYJ_A. T0178: Residues 96-166. Template - 1JCL_A.

A-109 T0179: Residues 4-275. Template - 1JQ3_D. T0180: Residues 10-30. Template - 1F6D_A. T0182: Two models submitted. Model No. 1: Residues 6-245. Template - 1C27_A. Model No. 2: Residues 6-246. Template - 1MAT. T0190: Residues 6-114. Template - 1FTP_A. T0193: Residues 94-203. Template - 2SCU_A. T0194: Residues 164-233. Template - 1D4U_A.

1. Tereshchenko F., Daraselia N. (2000) A homology modeling algorithm for protein tertiary structure prediction. CASP4 Proceedings, Pacific Grove, CA. 50-51. 2. GenoMax 3.3 Users Manual. (2001) InforMax Inc. Bethesda, MD.

Irback (P0559) - 20 predictions: 20 3D

Hierarchical All-Atom Procedure for Protein Structure Prediction

G. Favrin1, A. Irbäck1, B. Samuelsson1, F. Sjunnesson1 and S.Wallin1 1 – Complex Systems Division, Department of Theoretical Physics, Lund University, Sölvegatan 14A, SE-22362 Lund, Sweden [email protected]

Our calculations are performed using an all-atom model with a minimalistic potential [1], in combination with PSIPRED [2] secondary structure predictions. The potential function of this model consists of three terms representing excluded volume, hydrogen bonds and hydrophobic attraction, respectively. The model has been tested on an -helix and a -hairpin, using the same parameters for both peptides [1]. To be able to deal with larger chains, we proceed in a hierarchical manner, using secondary structure predictions.

The first step in our calculations is to perform separate simulations of different fragments of the chain. The secondary structure prediction guides the choice of fragments to be studied, and is sometimes also used to bias parts of the fragments towards -helix or -strand structure. Structures obtained from these simulations are then taken as input for simulations of larger fragments. This is iterated till the full chain has been studied. If no reasonable structure is found, the calculations are restarted from the beginning or some intermediate level. To analyze the structures obtained from the simulations, we use a simple clustering algorithm, based on root-mean-square deviations.

A-110 Our conformational search is Monte Carlo-based. Only torsional degrees of freedom are considered. Backbone torsion angles are updated using different move sets at different places along the chain. For parts of the chain to be refined only, we use a semi-local `small-step’ algorithm. For other parts, we use both this update and more drastic updates that lead to non-local deformations of the chain.

1. Irbäck A., Samuelsson B., Sjunnesson F. and Wallin S. (2002) Thermodynamics of - and -structure formation in proteins. Preprint LU TP 02-28, available at www.thep.lu.se/complex/publications.html. 2. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202.

IST-ZORAN (P0454) - 195 predictions: 195 DR

Three Ensemble Predictors of Protein Disorder

K. Peng, S. Vucetic, and Z. Obradovic, Center for Information Science and Technology, Temple University [email protected]

To predict protein disorder in CASP proteins, we used three different predictors. The first was based on the attributes derived from amino acid compositions, without incorporating evolutionary information [1]. Another two predictors incorporated evolutionary information [2] - the first was built using training set enhanced by homologues of the true disordered regions, the second was built from family profiles obtained from PSI-BLAST [3]. All the three predictors are ensembles of 10 feed- forward neural networks constructed from the same training data.

Predictor1. The first predictor constructs 20 attributes (18 amino acid frequencies, average flexibility and sequence complexity) at each sequence position using an input window of length Win centered at the position. The raw predictions are averaged over an output window of length Wout to obtain the final prediction for a given position. The dataset included 150 proteins with disordered regions longer than 30 consecutive residues and 290 completely ordered proteins. By comparing the predictor performance, we selected the best window size combination as Win = 41 and Wout = 61. The accuracy using 30-fold cross-validation was 76.08% on disordered regions and 91.11% on ordered proteins.

Predictor2. To incorporate evolutionary information, the second predictor was trained using disordered examples taken from 150 disordered proteins as well as their homologues found by PSI-BLAST search against the non-redundant database (nr), while ordered examples were still taken from 290 completely ordered proteins. Homologues with too high E-values (>1e-05) or too low E-values (<1e-30) were excluded. Using random sampling, the disordered examples from the same family had an equal chance to be selected for training. This predictor constructs a set of composition-based attributes and averages the raw predictions in the same manner as predictor1. Given Win = 41, Wout = 61, the accuracy using 30-fold cross-validation was 80.77% on disordered regions and 89.87% on ordered proteins.

Predictor3. The third predictor constructs attributes from the family profiles obtained by PSI-BLAST search against the non-redundant database (nr). The substitution scores for each amino acid are averaged over an input window of length Win to obtain 20 profile-based attributes. They are used along with the average flexibility and

A-111 the sequence complexity attributes calculated over the same input window of length W in. Similarly to the predictor1, the raw predictions are also averaged over an output window of length Wout. Given Win = 41 and Wout = 61, the accuracy using 30-fold cross-validation was 79.17% on disordered regions and 91.34% on ordered proteins.

1. Vucetic S. et al. (2001) Methods for Improving Protein Disorder Prediction, Int'l Joint Conf. on Neural Networks 2001. 2. Peng K. et al. (unpublished) Improving Protein Disorder Prediction by Incorporating Evolutionary Information and Optimizing Knowledge Representation. 3. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402.

Jager (P0582) - 7 predictions: 7 3D

Protein Minimization by Multiscale Decomposition

Lukas Jager University of Bonn, Institute for Applied Mathematics, Department for Scientific Computing and Numerical Simulation [email protected]

The tertiary structure of the target proteins are predicted using a novel multilevel minimization algorithm. The proteins are described by the CHARMM [1] force-field with an all-atom representation including hydrogens. The minimization procedure employs a three level decomposition of the protein where the coarse level potentials are automatically computed. This computation is based on the all-atom representation only. For the minimization itself we use an outer (global) Basin-Hopping algorithm [2]. The Basin-Hopping algorithm accepts or rejects each minimum, which is calculated from a disturbed configuration of the last accepted minimum, according to a standard metropolis probability. To speed up convergence each outer minimization is preconditioned by a combined Basin-Hopping on the coarsest level with a local minimization on the other levels. Our numerical results clearly show that this multilevel sampling strategy significantly improves the efficiency compared to a simple local minimization algorithm [3].The details of the method are as follows:

Starting from the all-atom representations of the protein two additional representations with fewer particles are calculated. The first coarse representation of the protein, what we call the "endless" representation, is built by simply removing all ends of the protein, i.e., all atoms with only one bond-neighbor are removed. To adjust the force-field the potential parameters of the remaining atoms are recalculated. The second coarse level consists only of hard spheres each representing one amino acid. The spheres are located at the center of "mass" of the amino acids, where the "epsilon" parameter of the Lennard-Jones-Potential on the finest level, which determines the depth of the potential minimum, is used for weighting. The radius of the sphere is defined by the maximum distance between an atom of an amino acid and the sphere's center. The potential on the coarsest level consists only of bonds between adjacent spheres and a (coarse) Lennard-Jones-Potential where the (coarse) parameters are computed from the all-atom representation on the finest level.

The coarsest model of the protein is then minimized using the Basin-Hopping algorithm described above. The "endless" representation is rearranged according to the lowest found minimum on the coarsest level and updated by local minimization. Finally the all-atom representation is minimized locally starting from the structure

A-112 which was calculated by refining the "endless" representation. Each minimum found by this procedure is taken as one step of the global Basin-Hopping algorithm and thus accepted or rejected depending on the potential energy on the finest level. Finally the configuration with the lowest potential energy is taken as the best guess for the protein configuration.

1. MacKerell A.D. Jr., Brooks B., Brooks C.L., Nilsson L., Roux B., Won Y., Karplus M. (1998) CHARMM: The Energy Function and Its Parameterization with an Overview of the Program, The Encyclopedia of Computational Chemistry, 1, 271-277, P. v. R. Schleyer et al., editors John Wiley & Sons: Chichester. 2. Wales D.J., Doye J.P.K. (1997) Global Optimization by Basin-Hopping and the Lowest Energy Structures of Lennard-Jones Clusters Containing Up to 110 Atoms. Chem. Phys. Lett. 269, 408-412. 3. Jager L. (2002) Zur globalen Minimierung von Energie-Funktionen. Diplomarbeit, Institut für Angewandte Mathematik, Universität Bonn. jive (P0506) - 37 predictions: 37 3D

JIVE: Protein Structure Prediction by the Assembly of Local Supersecondary Structural Motifs

David F. Burke, and Tom L Blundell Department of Biochemistry, University of Cambridge,80 Tennis Court Road, Cambridge, CB2 1GA, United Kingdom [email protected]

In the CASP 5 experiment, models of proteins which had low confidence values across the CAFASP3 servers were selected to be modelled by JIVE.

JIVE predicts the structure of small continuous domains of proteins by the assembly of fragments of local supersecondary motifs. Initially, homologous sequences were identified using PSI-BLAST[1] and secondary structure prediction was performed using PHD[2]. The conformational class of the supersecondary fragments were predicted using SLOOP[3-5]. SLOOP uses sequence/structure profiles derived from a database of loops clustered on the conformation of the loop and surrounding secondary structures. These fragments were then assembled using a Monte Carlo simulation. Unsuitable models were rejected based on excluded volume and a distance- dependent conditional probability function[6].

The generated structures were then searched against protein atructures from both the HOMSTRAD[7] database of homologous families and the CAMPASS[8] database using the program SEA[9]. Potential hits were then analysed further for validation. In all, 17 targets were submitted.

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.Nucleic Acids Res. 25(17):3389-402. 2. Rost B., et al. (1994) PHD-an automatic mail server for protein secondary structure prediction.Comput Appl Biosci.10(1):53-60 3. Donate L.E., et al. (1996) Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5(12):2600-16 4. Rufino S.D. et al (1997) Predicting the conformational class of short and medium size loops connecting regular secondary structures: application to comparative modelling. J Mol Biol. 267(2):352-67. 5. Burke D.F. et al. (2001) Improved Loop prediction from sequence alone. Protein Engineering 14 (7) 473-478 6. Samudrala R. et al. (1998) An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J Mol Biol. 275(5):895- 916

A-113 7. Mizuguchi K., et al. (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Science 7 2469-2471. 8. Sowdhamini R., et. al. (1996) A database of globular protein structural domains: clustering of representative family members into similar folds. Fold Des 1 (3):209- 20 9. Rufino S.D. et al. (1994) Structure-based identification and clustering of protein families and superfamilies. J Comput Aided Mol Des 8(1):5-27

Jones (P0067) - 121 predictions: 68 3D, 53 SS

Fold Recognition Using THREADER and GenTHREADER

D. T. Jones & L. McGuffin Bioinformatics Unit, Dept. of Computer Science, University College London, Gower Street, London WC1E 6BT [email protected]

THREADER3 is the latest incarnation of our original program to implement threading [1] (D.T. Jones et al. Nature 358, 86-89, 1992) and although it now incorporates a number of new features (in particular the use of PSI-BLAST [2] profiles), and a more refined set of potentials, the overall components of the current implementation remain more or less unchanged since CASP2. The fold library and potentials used throughout CASP5 was based on representative protein chains from the FSSP data bank and domains found in SCOP V1.57.

As for our CASP4 predictions, the raw threading output was evaluated using a neural network (similar to that used in GenTHREADER [3]) trained to discriminate between correct and incorrect fold recognition matches. This method is still very experimental, but it was used for all "non-obvious" predictions targets. Final predictions were based on the neural network output. Predictions for targets where the neural network output (range 0-1) of the top match was < 0.5 were not submitted (but were still considered for ab initio prediction if the size permitted). Only a single prediction was submitted for each target, unless either a second fold had an equal score to the top hit or in a few cases where more than one alignment was generated with and without secondary structure prediction inputs.

Targets with obvious homology to existing structures were predicted using GenTHREADER and mGenTHREADER [3] as submitted to the CAFASP3 prediction section. However, in making CASP5 submissions, we also considered other models obtained from the CAFASP3 web server. A new program called MODCHECK was used to evaluate the ensemble of collected structures in order to identify the model predicted to have the highest accuracy. MODCHECK is based on the same potentials used for THREADER3, but makes use of a large number of shuffled sequence re-alignments in order to estimate the specificity of the initial sequence-structure alignment.

1. Jones D.T., Taylor W.R. & Thornton J.M. (1992) A new approach to protein fold recognition. Nature 358, 86-89. 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402. 3. Jones D.T. (1999) GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 287, 797-815.

A-114 Jones-NewFold (P0068) - 214 predictions: 87 3D, 63 SS, 64 DR

FRAGFOLD, PSIPRED and DISOPRED: Methods for Prediction Of New Folds and Elements of Local Structure

D. T. Jones, J. Ward & L. McGuffin Bioinformatics Unit, Dept. of Computer Science, University College London, Gower Street, London WC1E 6BT [email protected]

For CASP5 targets which we could not reliably predict using fold recognition methods, our FRAGFOLD [1] method was used to generate up to 5 structures. This approach to protein tertiary structure prediction is based on the assembly of recognized supersecondary structural fragments taken from highly resolved protein structures using a simulated annealing algorithm.

For all targets (including CM and FR targets), secondary structure was predicted using PSIPRED [2-3]. PSIPRED predictions in CASP5 (as opposed to CAFASP3) were generated with a database updated at the CASP deadline rather than the CAFASP deadline. Also for CASP targets which were obviously multidomain, PSIPRED predictions were made for the individual domains and then combined. Two new programs were tried at CASP5: PSIPRED-SVM and DISOPRED. PSIPRED-SVM is a reimplementation of PSIPRED using Support Vector Machines rather than neural networks. PSIPRED-SVM was trained on a much smaller dataset than PSIPRED, and yet appears to have equal, if not slightly better performance from our own cross-validation benchmarks. DISOPRED makes use of a variation of the original PSIPRED method to predict disordered regions in proteins. Regions which are predicted by PSIPRED to be coil regions are further analysed using a second neural network trained to identify disordered regions. At present, a crude training set has been used for this network, which is derived by defining missing regions in protein structures (determined by alignment of the PDB SEQRES records with the ATOM records) as regions likely to be disordered. We hope to refine this training set by manual inspection, and by examination of NMR structure ensembles.

Our 3-D submissions were calculated using the following procedure:

1. Selection of fragment library. At each sequence position a list of 10 structural fragments is generated. These fragments (supersecondary motifs or fixed length fragments) are taken from 200 highly resolved protein chains with no chain breaks. The selection process involves ranking the fragments in order of potential energy Z- scores (an ungapped alignment is used for this ranking), and excluding fragments based on the PSIPRED secondary structure prediction.

2. Simulation. A classic Metropolis scheme is employed in running the simulation. Random moves are made by selecting either one of the preselected 10 fragments at a randomly chosen sequence position, or a free choice is made from all 3-5 residue fragments from the entire fold library. These moves are first tested to ensure than the generated structure is physically possible (steric checks) and then accepted if the Metropolis criterion is met. The starting temperature for the simulation is selected by making 500 random moves to the starting conformation and calculating the largest absolute energy change between any two moves. The simulation is started at a temperature corresponding to 10 times this delta-E, and the temperature is halved after either 5000 random moves have been accepted by the Metropolis criterion, or a total of 50000 sterically allowable moves have been tested.

3. Potentials. The THREADER V3.0 potentials were used. These are distance-dependent potentials of mean force compiled from a non- redundant set of protein chains with resolutions < 2.6 Angstroms. Predicted secondary structure was not incorporated in the objective function. Energies were summed over homologous sequences. In

A-115 addition to the mean force terms, simple terms were added to take account of hydrogen bonding in beta-sheets and steric hindrance. An improved solvation potential was used which allowed us to dispense with an additional chain compactness term.

4. Final model selection. Between 500 and 2000 separate simulations were run in parallel with different random seed values on a farm of 75 dual-CPU Linux machines. The final structures were clustered using a fast interative rigid-body clustering program (RMSDCLUST) and the representatives of the largest clusters were submitted (up to the CASP maximum of 5) as final predictions.

1. Jones D.T. (1997) Successful ab initio prediction of the tertiary structure of NK-Lysin using multiple sequences and recognized supersecondary structural motifs. PROTEINS. Suppl. 1, 185-191. 2. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202. 3. http://www.psipred.net

Kaznessis (P0548) - 15 predictions: 15 3D

Guiding Molecular Mechanics Simulations of Protein Folding with Correlated Mutations Analysis

Yiannis Kaznessis, Himanshu Khandelia and Spyros Vicatos Department of Chemical Engineering and Materials Science, and Digital Technology Center, University of Minnesota, Minneapolis, MN 55455 [email protected]

A combination of correlated mutations analysis and molecular dynamics simulations was used to predict the structure of the target sequences.

CORRELATED MUTATIONS ANALYSIS Multiple sequence alignments (MSA) were built processing the target sequences using PsiBlast and Clustalw [1,2]. A correlated mutations analysis (CMA) was performed on each MSA to predict pairs of amino acids that are proximal in the structure of the protein [3]. Specifically, the CMA consists of calculating three different correlation coefficients for all pairs of positions in the MSA. Principal component analysis was used on 142 experimentally determined amino acid properties [4] to filter out three orthogonal descriptors of amino acid properties. The first principle component is associated with the hydrophobicity of the residues, the second is a measure of size, and the third is related to electronic properties of amino acids. The descriptors were used to calculate correlation coefficients for each pair of positions in the MSA. Pairs of positions distant in the alignment (i-j>4) and having the highest coefficients were used to form a set of distance constraints in molecular dynamics simulations. Eventually, 18 pairs of amino acids were proposed to be proximal in the structure of each target protein. Proximity is declared if the C-alpha atoms are closer than 6 Angstrom.

MOLECULAR DYNAMICS CHARMM [5] was used to carry out molecular dynamics simulations (MD). The protein's initial configuration was a linear chain. The simulation was carried out in infinite dilution in a continuum dielectric of epsilon=70.0. A restraining spring-like attractive potential was used to bring together the C-alpha atoms of the amino acids

A-116 predicted to be close by the CMA. The strength of the spring constant was varied in different simulations, so that the energy of the constraints ranged between 0.5% and 3% of the total energy of the protein. The protein was minimized for 5000 steps using the SHAKE algorithm. The minimized structure was then subjected to 10 ps of heating by velocity scaling from 23.3 K to 323.3 K. MD was then carried out for 11 ns with a time step of 2 femtoseconds. The constraints were then relaxed and MD was carried out for another 11 ns with the same time step. We picked the protein conformations having the lowest energies in the latter half of the simulation. (i.e. without constraints).

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 2. Thompson J.D., Higgins D.G. & Gibson T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4680. 3. Neher E. (1994) How frequent are correlated changes in families of protein sequences?, Proc. Natl. Acad. Sci., 91, 98-102. 4. Shuichi K., Ogata H. & Kanehitsa M. (1999) AAindex: amino acid index database. Nucleic Acid Res. 27, 368-369. 5. Brooks B.R., Bruccoleri R.E., Olafson B.D., States D.J., Swaminathan S. & Karplus M. (1983). CHARMM: a program for macromoleuclar energy, minimization, and dynamics simulations. J. Comp. Chem. 4, 187-217.

A-117 keasar (P0429) - 90 predictions: 90 3D

Protein Structure Prediction with an Ant Lion Town Potential

N. Kalisman and C. Keasar Ben-Gurion Univ. of the Negev [email protected]

Due to the inherently rough energy landscape, conformational search simulations of proteins hardly ever converge to the global minimum. This so called “multiple minima problem” is generally considered a major obstacle for protein structure prediction. An Ant Lion Town Potential (ALTP) is an attempt to take advantage of this malady. Inspired by the impressing earthworks of the small insect [1] as well as a previous scientific work [2], an ALTP has local minima with wide basins of attraction. As a result, reduction of the protein’s conformation space is achieved by the convergence of large regions of the space into single points, namely the local minima.

Wide basins of attraction for local minima are achieved by using five types of energy terms: soft-atom van der Waals [2], long-range hydrophobic term, cooperative hydrophilic term, cooperative hydrogen bond term and soft distance constraint term. The cooperative hydrogen bonding term is used to bias the conformation search towards a predicted secondary structure. The soft distance constraint term is used to force disulfide bonds and (when available) structural insight from fold recognition or homology modeling. The ALTP allows a rapid generation of decoy sets for protein structure prediction by repeated torsion-angle energy minimization of random starting points [3].

During the CASP5 experiment our group predicted 17 targets. Due to computer time and memory limitations we restricted ourselves to proteins and protein fragments of up to 140 residues. For each target a decoy set of 10,000 to 60,000 models was built. The submitted models were selected from the lowest energy percent of the decoys by clustering and visual inspection.

We have predicted eleven targets which were given low fold recognition scores by the servers and meta-servers which participated in the CAFASP experiment [4] (T0131, 0135, T0136 fragment, T0148 fragment, T0149 fragment, T0157, T0170, T0172 fragment, T0173 two fragments, T0180 and T181 fragment). These were considered ab-initio targets.

Secondary structure was assigned to these targets based on the consensus of several secondary structure prediction sites. The predicted secondary structures of remote, but clearly related, homologs of the targets were used to confirm the prediction. Decoy sets were generated and models were chosen for submission as described above. When the secondary structure seemed ambiguous two decoy sets were generated independently. Submitted models were taken from both.

The ALTP approach to protein structure prediction was originally developed for ab-initio predictions. We believe, though, that predicting the structure of large insertions in fold recognition/homology modeling has much in common with ab-initio prediction. In this experiment we have tried for the first time to use the ATLP scheme for fold recognition/homology modeling targets.

A-118 We predicted six such targets (T0130, T0138, T0139, T140 fragment, T176 and T188). For each of them we used the most reliable parts of the top ranking CAFASP [4] top model as a template. The distances between the alpha carbons of the template structure were used as (soft) constraints to the energy minimization simulations. Otherwise, the prediction was performed as with the ab-initio targets.

While differing in quite a few details, the prediction scheme presented here is very similar to one used by the Levitt group in CASP4 [5].

1. http://waynesword.palomar.edu/pljuly97.htm 2. Head-Gordon T. and Stillinger F.H. (1993) Predicting polypeptide and protein structures from amino-acid-sequence – Antlion method applied to melittin. Biopolymers 33, 293-303. 3. Levitt M. (1983) Protein folding by restrained energy minimization and molecular dynamics. J. Mol. Biol. 170, 723-764 4. http://www.cs.bgu.ac.il/~dfischer/CAFASP3/ 5. http://predictioncenter.llnl.gov/casp4/abstracts/casp4-abstracts-ab.html#38

KGI-QMW (P0015) - 19 predictions: 19 3D

A Bayesian Network Model for Protein Fold and Superfamily Recognition

D.L.Wild1, A. Raval1,3 , and M. Saqi2 1 – Keck Graduate Institute, 2 – Queen Mary School of Medicine and Dentistry, London, 3 – Claremont Graduate University, Dept. of Mathematics [email protected]

A library of Bayesian network models based on SCOP superfamilies and homologous sequences in SWISS-PROT was constructed using the methods outlined in [1] and references therein. The Bayesian network approach is a framework which combines graphical representation and probability theory, which includes, as a special case, hidden Markov models. Our implementation is a Bayesian network which simultaneously learns amino acid sequence, secondary structure and residue accessibility for proteins of known three-dimensional structure. An awareness of the errors inherent in predicted secondary structure and residue accessibility may be incorporated into the model by means of a confusion matrix. The Bayesian network models we have utilized for CASP 5 can thus be seen as extensions of hidden Markov models to incorporate multiple observations and confusion matrices.

In preparation for CASP 5, we modeled 89 superfamilies from SCOP 1.59 using BN1-PRED models and 53 of these using both BN1-PRED and BN2 models [1]. The BN1-PRED models are trained and tested with predicted secondary structure and residue accessibility whilst the BN2 models are trained with DSSP-calculated secondary structure and residue accessibility and tested with predicted secondary structure and residue accessibility. For the BN2 models, a confusion matrix is applied to allow for errors in secondary structure and residue accessibility prediction.

A-119 A complete list of modeled superfamilies is given at http://public.kgi.edu/~wild/BN/CASPtrained.html. For the superfamilies that have a low number of representative domains in SCOP 1.59, only BN1-PRED models were trained, using the representative domain sequences themselves as well as sequence homologs of these domains in order to create a larger training set.

The recent release of targets for CASP 5 contained a list of 67 targets. Out of these, we submitted predictions for 18 targets (after filtering out Psi-Blast [2] detectable targets that we deemed suitable for comparative modeling). We first carried out a secondary structure and residue accessibility prediction for each target using JNET [3]. Barring some exceptional targets (targets 130 and 136, described below), the typical method of prediction for each target was as follows: 1) The target was scored against both BN2 and BN1-PRED model libraries by evaluating its posterior probability for belonging to each superfamily in the library. 2) The top 3 superfamilies (as ranked by the posterior score) according to both BN2 and BN1-PRED models were then identified and compared against predictions of other automated fold recognition methods. Any superfamily prediction that we considered to be in the wrong structural class (based on predicted secondary structure) was removed from the top-3 list. 3) The closest template of the target within each of the top ranking superfamilies was found using one of two procedures: (a) Comparing the posterior score of the target to the posterior scores of the representative domains within the superfamily, and identifying the domain whose posterior score was closest to the posterior score of the target as the template. (b) Comparing the Fisher score vector [4] of the target to the Fisher score vectors of the representative domains within the superfamily, and identifying the domain whose Fisher score vector was closest to that of the target (with ``closeness’’ defined in terms of Euclidean distance between Fisher score vectors). 4) After identifying the closest templates within each of the top-ranking superfamilies, the sequences were then individually aligned to the target using either a global (GCG program GAP) or local (GCG program SSEARCH) sequence alignment, depending on the length difference between the template and the target. 5) The template(s) that gave the best alignment were then reported as predictions in AL format.

Templates for targets 130 and 136 were identified using Psi-Blast, as follows. For target 130, Psi-Blast against the nr database identified the target as a nucleotidyltransferase. A search against SCOP for the keyword “nucleotidyltransferase” found the domain 1kny, which gave a good alignment to the target and was reported in the final prediction. For target 136, Blast detects a carboxyl transferase conserved domain and Psi-Blast reports a hit against 1bob, which is a N- acyltransferase and gives a good alignment to the target. 1bob was reported in the final prediction.

1. Raval A. et al. (2002) A Bayesian network model for protein fold and remote homologue recognition. Bioinformatics 18(6), 788-801 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 3. Cuff J.A. and Barton G.J. (2000) Application of multiple sequence alignment profiles to improve secondary structure prediction. Proteins, 40, 502-11 4. Jaakkola T. et al.. (2000) A discriminative framework for detecting remote protein homologies. J. Comput. Biol. 7 (1-2), 95-114

KIAS (P0531) - 479 predictions: 176 3D, 303 SS

A-120 Prediction of Protein Secondary Structure using PREDICT, a Novel Method Based on Pattern Matching

Keehyung Joo1 , Ilsoo Kim1 , Julian Lee1, Seung-Yeon Kim1, Sung Jong Lee1,2 , and Jooyoung Lee1 1 School of Computational Scineces, Korea Institute for Advanced Study 2 Department of Physics, Suwon University [email protected]

We introduce a novel method for the secondary structure prediction, PREDICT (PRofile Enumeration DICTionary). This method uses a concept of distance between patterns. For a given protein sequence, this method uses PSI-BLAST (Position Specific Iterative Basic Local Alignment Search Tool ) to generate profiles, which define patterns for amino acid residues. Each pattern is compared with those in the pattern database generated from PDB (Protein Data Bank), and the patterns close to the query pattern is selected to determine the secondary structure of the query residue. This method combines the idea of the nearest-neighbor method of Yi and Lander[1] with the profile generating technology of PSI-BLAST [2].

To elaborate our method, we first generate profiles for the query sequence and also for those in the database using the PSI-BLAST multiple sequence alignment. The profile defines pattern for each residue of these sequences by considering seven neighboring residues to the left and right of the given residue, which makes a window of size 15. The pattern is a 15 x 21 matrix where 21 stands for 20 amino acid types plus one indicating the N and C terminal ends of the protein sequence. The elements of ℓ - ℓ this matrix are the relative frequency of amino acids observed in the multiple sequence alignment. The distance between patterns is defined by Djk=∑ℓ Wℓ |P j P k|, ℓ where P j (ℓ= 1, … 315) are the ℓ-th components of the pattern j, and Wℓ are the weight parameters. For each pattern in the database, we calculate the secondary structure according to the DSSP (Dictionary of Protein Secondary Structure) [3]. We use the 3 state classification for the secondary structure : H (helix), E (extended), and C (coil). We compare the query pattern with those in the pattern database. This database is called the first-layer database, and consists of 7777 proteins selected from the PDB, with 1988085 residues. With a preset number N, we choose the N nearest patterns. A naïve way for the prediction would be to use the secondary structure of the majority of these N patterns as the prediction. We call it the first-layer prediction. However, instead of performing the first-layer prediction, we count the number of patterns corresponding to H, E, C, which defines a 15 x 4 matrix for the query residue. We construct this matrix for each residue in the set of non-homologous proteins CB513, consisting of 513 proteins with 84119 residues. These matrices constitute the second-layer pattern database. The 15 x 4 matrix pattern of a query residue is now compared with those in the second-layer pattern database, and the N closest ones are chosen. The secondary structure of the majority of these patterns is used as the final prediction.

Since we expect the pattern elements near the center residue is more important in defining the distance, we use an initial guess for the weights as Wℓ= |8-|8-j||^2, where j ( j = 1, …15) is the index labeling the residue corresponding to element ℓ. We call this parameter set W0. We then optimize the parameters. We use the first- layer prediction for this purpose, and the database used consists of the patterns from CB513 set only. We also use a different method of prediction in this case. We select three sets of 100 nearest patterns whose secondary structures are H, E, and C, respectively. We calculate the average distance of the patterns in each set from the query pattern. We use the secondary structure of the nearest group to the query residue as the prediction. The Q 3 value for W0 is 71.0%. For the residues whose predicted secondary structures are different from the correct one, the parameters are modified by a small amount in such a way that the set with correct secondary structure

A-121 becomes closer to the query pattern relative to the other groups. This procedure is iterated, with the final percentage of correct prediction being 73.1%. We call this optimized parameter set W300. W300 is used only in the step of the first-layer prediction.

The prediction with parameters W300 and N=200, W0 and N=200, W300 and N=100, W0 and N=200,300 were chosen as model 1,2,3,4,5 respectively. Preliminary result on the Q3 value of the secondary structure prediction of CB513 set using 7777 protein set as the database is about 80 %.

1. Yi T. et al. (1993) Protein Secondary Structure Prediction using Nearest-Neighbor Methods, J. Mol. Biol. 232, 1117-1129 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 3. Kabsch W. et al. (1983) Dictionary of Protein Secondary Structure: Pattern Recognition of Hydrogen-Bonded and Geometrical Features, Biopolymers 22, 2577- 2637.

KIAS (P0531) - 479 predictions: 176 3D, 303 SS

Prediction of Protein Tertiary Structure using PROFESY, a Novel Method Based on Pattern Matching and Fragment Assembly

Julian Lee , Seung-Yeon Kim , Keehyung Joo , Ilsoo Kim , Saejoon Kim , and Jooyoung Lee School of Computational Scineces, Korea Institute for Advanced Study [email protected]

We introduce a novel method for the tertiary structure prediction, PROFESY (PROFile Enumerating SYstem). This method utilizes secondary structure prediction information and fragment assembly. The secondary structure prediction is performed using the method PREDICT (PRofile Enumeration DICTionary) recently developed by our group, which uses a concept of distance between patterns. For a given protein sequence, this method uses PSI-BLAST to generate profiles, which define patterns for amino acid residues. Each pattern is compared with those in the pattern database generated from PDB, and the patterns close to the query pattern is selected to determine the secondary structure of the query residue. In order to construct the tertiary structure, we also collect the backbone dihedral angles along with these patterns. These constitute a library of fragments for a given protein sequence.

By construction, the secondary structure of the tertiary structure obtained from PROFESY agrees with the ones predicted from PREDICT. In order to obtain the optimal tertiary packing of these secondary structure elements, we define a score function based on the number of long-range hydrogen bonds, the radius of gyration, and the inter-residue Lennard-Jones interactions to avoid steric clashes. Replacement of fragments by the ones in the library is carried out, so that the score function is minimized. The score function minimization is performed by a powerful global optimization method, conformational space annealing (CSA) method [1], which enables one to sample diverse low lying minima of the score function. The square-well type function is used for the radius of gyration, that is, whenever the radius of gyration is greater than an upper bound Rmax, then only the radius of gyration was minimized. Otherwise, the other terms in the score function are used. R max= (3

A-122 Nres/0.026/3.14)1/3 was used for targets T147- T163, where Nres is the number of residues, and a smaller value of Rmax=(3 Nres/0.026/3.14)1/3/1.2 was used for targets T167 - T194. The hydrogen bonding was introduced to the score function for targets T167 – T194. Since the hydrogen bonding term favors alpha helices, we included the hydrogen bonding energy terms only between residues separated more than 5 in sequence. This restriction was implemented for targets T178-T194. SASA solvation terms in CHARMM were used for T129-T134, and ASAP solvation terms were used for T147-163, but the resulting conformations were not satisfactory. We realized that there are no side-chains in our models and consequently it is unphysical to use the solvation terms. Therefore, the solvation terms were discarded for targets T167 – T194.

To select five best conformations, we first performed clustering of the conformations, selecting five largest clusters and then choosing the conformation at the center of each cluster. For targets T167 – T194 , a score function was utilized as the selection criteria. The structure with the largest number of hydrogen bonds was chosen for each cluster, for T167-175. For T176 – T194 we introduced a new criterion based on burial of hydrophobic residues and exposure of hydrophilic residues. This term is based on the exposed volume with Reduced Radius Independent Gaussian Sphere (RRIGS) approximation [2]. This score function was not directly implemented into the conformation search algorithm due to the time constraints, but was used as the selection criteria for top structures.

1. Lee J. et al. (1997) New optimization method for conformational energy calculations on polypeptides : Conformational Space Annealing. J. Comp. Chem. 18 (9), 1222-1232 ; 2. Lee J. et al. (1998) Conformational analysis of the 20-residue membrane-bound portion of Melittin by Conformational Space Annealing. Biopolymers. 46, 103- 115 ; 3. Lee J. et al. (1999) Conformational Space Annealing by parallel computations: extensive conformational search of Met-enkephalin and the 20-residue membrane- bound portion of Melittin. Int. J. Quant. Chem. 75, 255-265 ; 4. Lee J. et al. (1999) Energy-based de novo protein folding by conformational space annealing and an off-lattice united-residue force field: Application to the 10-55 fragment of staphylococcal protein A and to apo calbindin D9K. Proc. Natl. Acad. Sci. USA 96, 2025-2030 ; 5. Liwo A. et al. (1999) Protein structure prediction by global optimization of a potential energy function. Proc. Natl. Acad. Sci. USA 96, 5482-5485 ; 6. Lee J. et al. (1999) Calculation of protein conformation by global optimization of a potential energy function. PROTEINS: Structure, Function, and Genetics 3:204- 208 ; 7. Lee J. et al. (2000) Hierarchical energy-based approach to protein-structure prediction: Blind-test evalutation with CASP3 targets. Int. J. Quant. Chem. 77, 90-117 8. Auspurger J. D. et al. (1996) An efficient, differentiable hydration potential for peptides and proteins. J Comp. Chem. 17 (13), 1549-1558.

Kim-Park (P0442) - 65 predictions: 65 SS

Protein Secondary Structure Prediction by Support Vector Machines and Position-specific Scoring Matrices

H. Kim and H. Park University of Minnesota, twin cities, MN 55455, U.S.A. [email protected], [email protected]

A-123 The prediction of protein secondary structure is important problem for the prediction of tertiary structure of a protein. The SVMpsi method using support vector machines (SVMs) and the position specific scoring matrix (PSSM) generated from PSI-BLAST is introduced, which achieves better prediction accuracy [1].

The final position-specific scoring matrix from PSI-BLAST [2] against the SWALL non-redundant protein sequence database is used. We applied PFILT [3] to mask out regions of low complexity sequences, the coiled coil region, and transmembrane spans. For PSI-BLAST, the E-value threshold for inclusion of 0.001 and three iterations were applied to search the non-redundant sequence database.

We designed two additional tertiary classifiers based on one-versus-one scheme and directed acyclic graph scheme [4]. The one-versus-one classifier for the secondary structure prediction chooses majority results based on three classifiers H/E, E/C, and C/H. Many test results show that one-versus-one classifiers are more accurate than one-versus-rest classifiers due to the fact that one-versus-rest scheme often need to deal with two data sets with very different sizes, i.e., unbalanced training data [5]. However, a potential problem of the one-versus-one scheme is that the voting scheme might suffer from incompetent classifiers. For example, while the test point is helix, the result from the one-versus-one classifier E/C that is not related to helix inappropriately contributes to the decision. We can reduce this problem by using the directed acyclic graph (DAG) scheme that can classify a new data point after 2 binary classifications for 3 class problems without influence from incompetent classifiers. For example, if the testing point is predicted to be E (not C) from E/C classifier, then H/E classifier is applied, while if the point is predicted to be not sheet (~E) from E/C classifier, C/H classifier is applied to tell if it is coil or helix.

A new protein secondary structure prediction method SVMpsi produces the performance measures of Q_3=76.1% and SOV94 = 79.6% on the RS126 set and Q_3=76.6% and SOV94 = 80.1% on the CB513 set through seven-fold cross validation, which outperforms other existing methods that we are aware of. We prepared KP480 set from CB513 set and the prediction accuracy of SVMpsi was Q_3=78.5% and SOV94 = 82.8%. Moreover, we built 136 protein sequences for blind test. The blind test results were Q_3=77.2% and SOV94 = 81.8%. It shows that the support vector machine approach is another good method to predict the protein secondary structure. The major improvement of the new SVMpsi method is obtained from the PSI-BLAST PSSM profile and the new optimization strategy in SVM for maximizing the Q_3 measure.

1. Kim H. and Park H. (2002) Protein Secondary Structure Prediction by Support Vector Machines and Position-specific Scoring Matrices. Submitted to Bioinformatics. 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402. 3. Jones D. T. and Swindells M. B. (2002) Getting the most from PSI-BLAST, TRENDS in Biochemical Science, 27, 161-164. 4. Heiler M. (2002) Optimization Criteria and Learning Algorithms for Large Margin Classifiers, Diploma Thesis, University of Mannheim. 5. Hsu C. W. and Lin C. J. (2002) A comparison of methods for multi-class support vector machines, IEEE Transactions on Neural Networks, 13, 415-425.

LAMBERT-Christophe (P0035) - 131 predictions: 131 3D

A-124 Evaluation of Different Methods for Comparative Modeling

C. Lambert, N. Léonard, B. Damien and E. Depiereux Unité de Recherche en Biologie Moléculaire, Facultés Universitaires Notre-Dame de la Paix, rue de Bruxelles 61, 5000 Namur, Belgium [email protected]

The aim of our work is to compare different approaches for comparative modeling.

Model 1 was build running ESyPred3D [1] on the best template chosen by CAFASP3 jury.

Model 2 was build by homology modeling using the MOE [4] program and the best template selected by the PDB-BLAST feature of MOE. No minimization or dynamics were done. This method was used to provide comparison with more complex modeling techniques used in our group.

Model 3 was build by homology modeling using the MOE [4] program and the best templates selected by the PDB-BLAST feature of MOE. Best template was chosen as primary template and 1-3 others were used to model parts not present in the primary template. The model obtained was compared with models obtained from Swiss- Model and ESyPred3D. Then model was thoroughly analyzed and corrected accordingly using restrains on angles distances and dihedrals. Model is finally going through energy minimization and molecular dynamics until a minimal amount of dihedral/angle and distance errors exist.

Model 4 and 5 were build by using the ESyPred3D[1] to find possible templates and pairwise alignments between the query sequence and each chosen template. A multiple structure alignment between all templates was build using the STAMP [3] program. The pairwise alignments between the query and each template, and the multiple structure alignment were combined to obtain a multiple alignment between the query and all templates. This multiple sequence alignment was used by MODELLER [2] to build the final model.

ESyPred3D web site: http://www.fundp.ac.be/urbm/bioinfo/esypred

1. Lambert C. et al. (2002) ESyPred3D: Prediction of proteins 3D structures. Bioinformatics. 18 (9), 1250-1256 2. Sali A. et al. (1993) Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234(3), 779-815. 3. Russell R. B. et al. (1992) Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels. Proteins 14(2), 309-323 4. Chemical Computing Group Inc. , Montreal, Quebec, Canada

LIBELLULA (P0230) - 216 predictions: 216 3D

A New Web Server For Fold Recognition Evaluation

A-125 O. Graña1, D. Juan1, F. Pazos2, P. Fariselli3, R. Casadio3 and A. Valencia1 1Protein Design Group, National Center for Biotechnology, CNB-CSIC. Spain 2ALMA Bioinformatica, Tres Cantos, Madrid, Spain 3CIRB Biocomputing Unit and Lab. of Biophysics, Dept. of Biology, University of Bologna. Italy

This approach improves the selection of correct folds from among the results of two methods implemented as web servers (SAMT99 [2] and 3DPSSM [3]).

LIBELLULA is based on the training of a system of neural networks with models generated by the servers and a set of associated characteristics such as the quality of the sequence-structure alignment, distribution of sequence features (sequence conserved positions and apolar residues) and compactness of the resulting models.

It has been implemented as an automatic system available as a public web server at http://www.pdg.cnb.uam.es/ servers/libellula.html.

1. Juan D., Graña O., Pazos F., Fariselli P., Casadio R. and Valencia A. A neural network approach to evaluate fold recognition results. Accepted in Proteins: structure, function and genetics. 2. Karplus K., Barrett C., Hughey R. Hidden Markov models for detecting remote protein homologies. Bioinformatics. 14, 846-856. 3. Kelley L.A., MacCallum R.M., Sternberg M.J. Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol. 299, 499-520.

A-126 Lomize-Andrei (P0288) - 76 predictions: 76 3D

Fold Recognition and Homology Modeling of Protein Cores

A.L Lomize, I.D Pogozheva, and H.I. Mosberg College of Pharmacy, University of Michigan, Ann Arbor, MI [email protected]

3D models of 66 CASP5 target proteins were generated using several different techniques. 30 targets were modeled simply by homology using PSI-BLAST searches, our in-house software, and QUANTA. The initial PSI-BLAST alignments were often corrected to superimpose hydrophobic residues of a target and the water- inaccessible positions of the experimental template, and to remove any gaps within regular secondary structures, which is a part of our threading approach. The applied side-chain packing procedure provided geometrical “tracing” of template side-chains, removal of significant hindrances, and optimization of hydrogen bonding.

The remaining 36 targets had no or very low sequence homology to experimental structures. They were modeled using the following steps: (1) prediction of secondary structure, -sheet topology, and general protein “architecture” based on the hydrophobicity patterns observed in multiple sequence alignments [1]; (2) fold recognition using the program MIMIC that is under development in our group; (3) selection of the most probable template also taking into account predictions of 3D-PSSM server and the biological function of the target; (4) refinement of target-template alignment and generation of the corresponding full atomic model. The modeling included human intervention. Only T0131 was modeled ab initio by a hierarchic assembly of -helices and -sheets [2].

Our fold recognition program, MIMIC, was designed to identify the lowest energy sequence-structure alignment, and the lowest energy experimental template. The optimal and suboptimal sequence-structure alignments are generated first using the dynamic programming algorithm and approximate energy functions. At this step, thermodynamic stability of the protein core (excluding nonregular loops) is estimated as the sum of backbone energy, secondary structure propensities, interactions between side-chains in -helices and -sheets, and transfer energies of side-chains from water to the protein interior [2]. Next, a set of 3D models, which correspond to the selected low-energy sequence-structure alignments, must be evaluated using more rigorous all-atom free energy functions that have been derived recently from mutagenesis data [3]. In the beginning of the CASP5 experiment, our program MIMIC was only at the initial development stage, with many important options (including all-atom threading) not presently implemented. Therefore, its actual performance still remains to be tested.

1. Lomize A.L. and Mosberg H.I. Thermodynamic model of secondary structure for alpha-helical peptides and proteins. (1997) Biopolymers., 42: (2), 239-269. 2. Lomize A.L., Pogozheva I.D., and Mosberg H.I. (1999) Prediction of protein structure: The problem of fold multiplicity. Proteins. 37 (Suppl. 3), 199-203. 3. Lomize A.L., Reibarkh M.Y., and Pogozheva I.D. (2002). Interatomic potentials and solvation parameters from protein engineering data for buried residues. Protein Sci. 11 (8), 1984-2000.

luethy (P0419) - 240 predictions: 240 3D

A-127 Unified Prediction Approach for Comparative Modelling and Ab-initio Predictions R. Luethy Amgen Inc. [email protected]

Sequence and structural similarities vary gradually between proteins in the same family. Here it was attempted to use the same overall approach of structure prediction for all target classes. The first step was to generate a multiple sequence alignment for the target, then the sequence profile method [2] was used to find similar sequences in PDB [1]. The multiple sequence alignment was checked and adjusted manually if needed. It was also used to guess domain boundaries in potential multidomain proteins. For potential multidomain proteins a separate multiple sequence alignment was made for each predicted domain. The highest scoring PDB sequences were then aligned with the target sequence profile and a database of structural fragments was generated from these alignments. A fragment was defined by a ungapped region in the aligment. These fragments were then used in a folding procedure which has the following components [3]: a simplified representation of protein structures that can be locally modified. A structure modification method based on selecting blocks from know 3D structures. Evaluation of structures and optimization. The simplified models are based on a sequence of internal coordinates: the torsion angles between four consecutive C atoms and angles between three Catoms. In order to generate different structures, fragments were randomly selected from the database of structural blocks which was derived from the profile alignments. Different structures were generated by randomly selecting blocks from this database and substituting them into the model. To evaluate structures cartesian coordinates for the C  and C atoms were reconstructed using constants for all distances and the angles needed to reconstruct the C positions. These structures are then evaluated using knowledge based potentials derived from know 3D structures. The potentials used were a residue specific pair-wise distance potential, a residue specific number of contacts potential, a compactness function and a penalty for too close contacts. For targets suitable for comparative modeling the following additions were made: If the sequence similarity between the target and its best match in PDB was significant the fragments from the corresponding structure were inserted into the starting model and were not allowed to change during the optimization. Their relative positions were constraint by a distance matrix derived from the know structure. After the structure optimizations all atom coordinates were reconstructed in the following way: first all coordinates from the PDB fragments were copied, then missing backbone atoms were inserted by looking up the closest 5 residue backbone fragment in PDB, finally missing side-chain atom were copied from the closest 5 residue fragment from PDB with the same residue in the middle. The structure was then minimized using TINKER [4] using a steepest descent method with fixed C atoms.

1. Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E., (2000) The Protein Databank. Nucleic Acids Research, 28 pp. 235-242 2. Gribskov M., McLachlan A.D. and Eisenberg D. (1987) Profile analysis: Detection of distantly related proteins. PNAS, 84, 4355-4358 3. Zhu J. and Luethy R. (2002) Three-dimensional structure prediction using simplified structure models and Bayesion block fragments. in: Protein structure prediction. Bioinformatics approach. ed. Igor F.Tsigelny, International University Line, pp. 85-107 4. Ponder J.W. and Richards F.M. (1987) An Efficient Newton-like Method for Molecular Mechanics Energy Minimization of Large Molecules. J. Comput. Chem., 8, 1016-1024 (http://dasher.wustl.edu/tinker/)

A-128 Lund-Ole (P0391) - 39 predictions: 39 3D

X3M – a Computer Program to Extract 3D Models

O. Lund, M. Nielsen, C. Lundegaard and P. Worning. Center for Biological Sequence Analysis, Biocentrum-DTU, Building 208, Technical University of Denmark, DK-2800 Lyngby, Denmark [email protected]

Summary A novel method was developed for fold recognition/homology modeling, in which a large sequence database is iteratively searched to construct a sequence profile until a template can be found in a database of proteins with known structure. The method differs from the PDB-BLAST method in that a sequence profile is only made if a template is not readily found in the database of known structures. A sequence profile is subsequently made for the template, using the same number of PSI-BLAST iterations that were used to identify it. Query and template sequences are subsequently aligned using a score based on profile-profile comparisons. The alignment score is modified so as to ensure unreliable parts of the alignment is discarded.

Background A problem often encountered when doing iterative sequence searches in a database is that the search may go astray and start picking up unrelated sequences often with hydrophobic or low complexity regions. It has been found that using PSI-BLAST [1] to build a profile using a sequence database and subsequently use this profile to search a database of proteins with known structures (PDB-BLAST) works better than searching one merged database [2]. We have developed a method related to PDB- BLAST where we only perform iterative searches against the sequence database if no match can be found in the database of proteins with known structure.

It has been shown that methods based on profile-profile alignment can produce more accurate alignments than methods based on sequence-profile or sequence-sequence alignment [3]. A number of different methods for scoring two profiles against each other have been suggested over the recent years: The average score between all amino acid pairs according to the probability distribution in each profile [2], the probability that the same amino acid is found in given positions in the two profiles (the dot product of the amino acid probability vectors) [4], the probability that two amino acid distributions are the same [5], or combinations of different profile-profile scores with other scoring terms [6]. Kelley et al. [7] use the average alignment score of the query profile with the template sequence and the query sequence with the template profile for fold recognition. Here we take that average for each residue pair in and use that as a scoring matrix for the alignment algorithm. This approach has the advantage that it reduces to the classical sequence-sequence alignment in the case that no homologous proteins can be found.

In CASP4 Venclovas [8] successfully selected correctly aligned regions by discarding regions which aligned differently in different blast searches. Another way to select for reliable parts of the alignment is to change the scoring matrix that is used to align the two proteins. It has been found that scoring matrices with low PAM values (corresponding to high BLOSUM values) are appropriate for making shorter alignments [9]. Subtracting a number from the scoring matrix also leads to shorter but more accurate alignments [10,3]. Blosum alignment scores S are often measured in half bits and derived from log odds scores S = 2*ln 2(Qij/PiPj) [11]. In this case subtracting

A-129 two from the alignment score corresponds to demanding that the probability Qij to find amino acids i and j aligned must be twice as big as the background probability PiPj in order for S to be positive. We have used this method in an attempt to make a reliable profile-profile alignment.

Databases A fasta file containing all pdb entries (pdb) was downloaded from NCBI (ftp://ftp.ncbi.nih.gov/blast/db/pdbaa.Z). A non redundant database of known protein sequences (sp) was compiled from files downloaded from Swiss-prot (ftp://ftp.ebi.ac.uk/pub/databases/sp_tr_nrdb/fasta/*.fas.gz). PDB entries were downloaded from RCSB (ftp://ftp.rcsb.org/pub/pdb/data/structures/all/pdb/).

Template identification The program blastpgp [1] was used to search the databases. In order to find a template, the query sequence was run against the pdb database. If a template could not be found with an E value of less than 0.05 the sequence was run two iterations against sp, and a binary checkpoint file was saved as well as the position specific scoring matrix in ASCII format (blastpgp does not update these files after the last iterations, so the saved files correspond to the profile obtained after the first iteration). The checkpoint file was used to restart a blastpgp search of the query sequence against the pdb database. The procedure of iteratively using the sp database to generate a profile that in turn is used to search the pdb database was continued until a template was found with a E value of less than 0.05 or a total number of five iterations against the pdb database had been performed.

Alignment If a template was identified, we attempted to improve the alignment by performing a profile-profile alignment. In order to make a sequence profile for the template sequence we ran the template sequence the same number of iterations as the query sequence against the sp database and saved the scoring matrix in ASCII format. If no sequence profile was generated for either the query or the template sequence, it was constructed from a blosum62 matrix [11]. A scoring matrix Sij was constructed based on the two profiles.

Sij = (QPi(TAj)+TPj(QAi))/2-1

Where QPi(TAj) is the score of residue j in the template sequence with the profile at position i in the query sequence, and TP j(QAi) is the score of residue i in the query sequence with the profile at position j in the template sequence. These two scores were averaged and 1 was subtracted to reduce the lengths of the alignments and make them more accurate. The query was then aligned to the template using a local alignment algorithm [12], with a maximum number of gaps set to 20, a first gap penalty of 11, and a gap elongation penalty of 1.

Modeling The corresponding atoms derived from the alignment can be extracted from the template file and used as a starting point for the homology modeling. Missing atoms were added using the segmod program [13] from the GeneMine package (www.bioinformatics.ucla.edu/genemine/). The structures can then refined using the encad program [14] also from the GeneMine package. The modeling step was not in place for CASP5 so only alignments were submitted.

Submissions Alignments were submitted for 41/67 (61%) of the targets (T0130, T0132, T0133, T0137, T0140, T0141, T0142, T0143, T0144, T0149, T0150, T0151, T0152, T0153, T0154, T0155, T0158, T0160, T0163, T0164, T0165, T0166, T0167, T0169, T0171, T0172, T0175, T0178, T0179, T0182, T0183, T0184, T0185, T0186, T0188, T0189, T0190, T0191, T0192, T0193, T0195). We only submitted alignments for targets where we estimated that it was at least 95 % certain that we had identified the

A-130 correct fold. We furthermore sought to perform the alignment in such a way that regions where a reliable alignment could not be made were excluded. We look forward to see if this strategy worked and to compare our results with those submitted by other groups.

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 2. Rychlewski L., Zhang B., Godzik A. (1998) Fold and function predictions for Mycoplasma genitalium proteins. Fold Des. 3 (4), 229-38. 3. Jaroszewski L, Rychlewski L, Godzik A. (2000) Improving the quality of twilight-zone alignments. Protein Sci. 9 (8), 1487-96. 4. Lyngsø R.B., Pedersen C.N., Nielsen H.R. (1999) Metrics and similarity measures for hidden Markov models. Proc Int Conf Intell Syst Mol Biol. 178-86. 5. Yona G., Levitt M. (2000) Towards a complete map of the protein space based on a unified sequence and structure analysis of all known proteins. Proc Int Conf Intell Syst Mol Biol. 8, 395-406. 6. Fischer D. (2000) Hybrid fold recognition: combining sequence derived properties with evolutionary information. Pac Symp Biocomput. 119-30. 7. Kelley L.A., MacCallum R.M., Sternberg M.J. (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol. 299 (2), 499- 520. 8. Venclovas C. (2001) Comparative modeling of CASP4 target proteins: combining results of sequence search with three-dimensional structure assessment. Proteins Suppl 5, 47-54. 9. Altschul S.F. (1991) Amino acid substitution matrices from an information theoretic perspective. J Mol Biol. 219 (3), 555-65. 10. Vogt G., Etzold T., Argos P. (1995) An assessment of amino acid exchange matrices in aligning protein sequences: the twilight zone revisited. J Mol Biol. 249 (4), 816-31. 11. Henikoff S., Henikoff J.G. (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 89 (22), 10915-9. 12. Smith T.F., Waterman M.S. (1981) Identification of common molecular subsequences. J Mol Biol. 147 (1), 195-7. 13. Levitt M. (1992) Accurate modeling of protein conformation by automatic segment matching. J. Mol. Biol. 226 (2), 507-533 14. Levitt M., Hirshberg M., Sharon R. and Daggett V. (1995). Potential energy function and parameters for simulations of the molecular dynamics of proteins and nucleic acids in solution. Computer Physics Comm. 91, 215-231.

MacCallum (P0393) - 130 predictions: 130 SS

Evolved Post-processing of PSIPRED Predictions

R. M. MacCallum Stockholm Bioinformatics Center, Stockholm University, Sweden [email protected]

I have submitted completely automated secondary structure (SS) predictions for every target using a novel post-processing method for PSIPRED[2] predictions. In outline, a set of conditional reassignments (rules) are applied to each predicted secondary structure element, based on local and global information in the raw PSIPRED output. The rules are evolved with genetic programming, an evolutionary search technique which operates on trees.

A-131 PSIPRED predictions were run locally because the .ss2 output files are needed. The database used during the PSI-BLAST[1] search phase of PSIPRED was “nr” from the NCBI [ ftp://ftp.ncbi.nih.gov/blast/db/ ]. This was downloaded once on 22 July 2002 and used for all targets. Complexity filtering of the database was performed exactly as recommended in the PSIPRED documentation. BLAST version 2.2.2 was used.

The .ss2 files are read into object oriented data structures which store for each residue the helix, strand and coil probabilities (neural network outputs). Consecutive residues of the same predicted secondary structure type are grouped into objects called “elements”. All subsequent operations are done at the element level, i.e. from the “element's eye view”.

The goal is to create an object method (a Perl subroutine) which will adjust the secondary structure prediction of each predicted element towards the “correct”, DSSP- based, assignment. This is done with a population based stochastic search in program tree space with fitness selection, crossover and mutation; or genetic programming as it is better known. Fitness is measured as Q3. The adjustment of the prediction is based on decisions made from information about neighbouring elements and the global prediction. The building blocks for the genetic programming can be summarised as: IF THEN flow control statements, numeric inequality conditions, functions which return numeric values, global constants, arithmetic operations and finally “action” methods, where the secondary structure prediction is altered in some way.

Many methods/functions and global constants are available to the genetic programming but not all of them are used, so to save space I just describe one of the “best of run” programs, in fact the one used for the CASP predictions, shown here in pseudo-code:

Global quantities from original PSIPRED prediction A = mean length of strands B = number of strands C = mean length of helices D = length of longest helix

IF ((fwd_helix(2)) AND (prev_strand(B) OR A<4)) THEN reassign_this_element_weighting(helices by 4, coils by 3, strands by 1.3*B/(this_element_length()–fwd_strand(A))) IF (this_element_lowest_helix_probability() > 0.29) THEN reassign_this_element_weighting( helices by 67*this_element_mean_coil_probability(), coils by 4, strands by 48.45 + 9.9*C – C*D)

Where fwd_helix(2) returns 1 if one of the next 2 elements is a helix and 0 otherwise; prev_strand(B) returns 1 if there's a strand in the previous B elements. Hopefully this_element_lowest_helix_probability() and this_element_mean_coil_probability() are self-explanatory, they perform calculations on the current element's PSIPRED per-residue network outputs. Finally, reassign_this_element_weighting() recalculates the secondary structure assignment on a winner-takes-all basis from the per- residue helix, strand and coil PSIPRED network outputs applying to each a specified weighting. As a result, element boundaries may change. Sometimes no changes are made to the PSIPRED prediction. What does the evolved subroutine do? The general strategy agrees quite nicely with common sense: it disregards strand predictions if there are globally very few strands. In other words orphan strands are not tolerated, because only in rare circumstances can they form sheets. In addition to the global number of strands information, it does also seem to be using “longer-range” information, i.e. the presence or absence of helices or strands a small number of elements away, which will

A-132 often be more than the +/-7 residues of the PSIPRED neural network window. These extra details may or may not be significant, and in any case the improvement in Q3 on the training and test sets (ASTRAL SCOP 10% identity) was only around +0.4%. This is preliminary work and I need to explore the fitness landscape and different ways to represent and manipulate the input data, and hopefully get at least +1% improvement in time for CASP6.

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402. 2. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202. Martin-Andrew (P0471) - 55 predictions: 55 3D

CASP5 Comparative Modelling

D. Talbot, N.W. Boxall, A.L. Cuff, H. Fooks, R.C. Gibson, E.G. Hutchinson, B.S. Lattimore, E.F. Murphy, S.J.Wills, A.C.R. Martin School of Animal & Microbial Sciences, The University of Reading, Whiteknights, PO Box 228, Reading RG6 6AJ, UK. [email protected]

We attempted only targets that could be addressed by comparative modelling with sequence identities between target and principle parent from 13% to 71%. As shown previously [1], sequence alignment is the primary factor in obtaining good quality models and the focus of our effort has been in getting good alignments. MODELLER- 6 [2] was used to generate the actual models with no additional refinement.

Our strategy proceeded as follows. (1) Determine whether a target was suitable for comparative modelling; (2) reject targets judged to be ‘too hard’; (3) determine the alignment using PSI-BLAST [2], structural alignment information and hand-modifications; (4) build models using MODELLER-6; (5) screen alternative models.

Target suitability and rejection. PSI-BLAST searches of the target sequence were made against non-redundant Genpept plus non-redundant PDB sequences. This database was regularly updated throughout the CASP prediction season. In the case of distant homologues, predictions were also made with GenThreader [3] and/or SAMT99 [4] to help confirm homology. Targets were judged as ‘too hard’ if they contained a large numbers of indels or if individual indels were longer than 5 residues.

Alignment. This was the most complex phase and the subject of most of our efforts. In general, the alignments used were those generated by PSI-BLAST. The initial maximal-coverage PSI-BLAST alignment and the final alignment were both used and examined in the light of the structure. In most cases, it was observed that the final alignment from PSI-BLAST placed indels in structurally more acceptable positions. Where necessary, alignments were hand-modified in light of the structure to minimize the structural impact of indels. Where more than one PDB parent was available, these were first aligned structurally using SSAP [6]. The PSI-BLAST alignments of the target against these parents were then applied and hand-modified to resolve conflicts between the alignments.

In the case of T0133 (13% sequence identity), a novel alignment procedure was also used in which secondary structure elements from the two parents, aligned using SSAP were converted into pseudo-profiles which were aligned against the target using global dynamic programming.

A-133 Model Generation. Models were generated using MODELLER-6 with the DO_LOOPS option set. In some cases multiple models were generated, but in the main, default options were chosen. No further energy refinement was performed on the resulting models.

Model Screening. Pseudo-energies were calculated using the RAM potential [7] (obtained from http://prostar.carb.nist.gov/) and percentage of residues in core Ramachandran areas, was obtained from PROCHECK [8]. A combination of these scores and visual inspection of the models was used to make a final selection.

1. Martin A.C.R, et al. (1997) Assessment of comparative modelling in CASP2. Proteins: Struct., Funct., Genet. Suppl 1, 14–28. 2. Marti-Renom M.A., et al (2000) Comparative protein structure modelling of genes and genomes. Ann. Rev. Biophys and Biomol. Struct., 29, 291–325. 3. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 4. Jones D.T. (1999) GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 287, 797–815. 5. Karplus K., et al. (1998) Hidden Markow Models for detecting remote protein homologies. Bioinformatics 14, 846–856. 6. Taylor W.R. and Orengo C.A. (1989) Protein structure alignment. J. Mol. Biol. 208, 1–22. 7. Samudrala R. and Moult J. (1998) An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J. Mol. Biol. 275, 895–916. 8. Laskowski et al. (1993) PROCHECK – a program to check the stereochemical quality of protein structures. J. Appl. Cryst. 26, 283–291.

Meller-Adamczak (P0441) - 23 predictions: 23 3D

Reading Weak Threading Signals for Difficult Fold Recognition Targets

J. Meller and R. Adamczak Pediatric Informatics, Children’s Hospital Research Foundation, University of Cincinnati, 3333 Burnet Avenue, Cincinnati, OH 45229 [email protected]

Challenging fold recognition targets, with significant sequence and structure variations with respect to known proteins, often result in largely correct matches of low statistical confidence. In other words, primary, secondary and tertiary structure signals, which are used to make predictions, are too weak compared to what is expected by chance, given certain background probability distributions associated with our NULL models. However, context dependent a priori knowledge (e.g. about binding partners) imposes additional constraints that may be utilized in order to postulate a better reference model and to estimate what is expected by chance among putative matches satisfying such constraints.

Many groups in CASP4 used strategies combining automated predictions with manually enhanced and further validated annotations. However, adding biological insights and a priori knowledge to recognition protocols proved to be difficult to automate. In our experience, one encouraging example of such an approach was an effective protein length filter used in the LOOPP server threading protocol during CAFASP2 to enhance prediction for difficult targets [1-3]. Without using family profiles and secondary structures, the LOOPP server provided best models for three difficult targets and was ranked as the third best server in the category of difficult targets [3].

A-134 Here, we attempted to combine the threading based LOOPP predictions with further biological insights and manual evaluation of putative matches, following the strategy of other groups and our own experience. We used CAFASP3 prediction server to estimate difficulty of a given target and (with one exception) only those resulting in low consistency among the servers were chosen to test our ability to improve upon the initial LOOPP prediction. We would like to stress that we did not use the new incarnation of the LOOPP server at the Cornell Theory Center [6], but the one used during previous CASP assessment [2] that we also continue using for annotations of divergent genomic sequences [4-5].

The initial, high scoring threading matches were used to build a library of related folds and their variations and the target sequences were realigned using our flavor of structurally biased sequence alignment [1]. Such alignments proved to be more reliable compared to “pure” threading alignments based on a “local” contact model (THOM2), developed by Meller and Elber [1]. Nevertheless, sequence alignments are often statistically insignificant for remote homologs and they were used here in the context of the initial threading matches.

The level of the observed inconsistency between the alternative alignments was used as one of the filtering criteria. Other criteria, used for some of the targets included analysis of amino acid residue packing in the models implied by the (local) alignments in terms of pair distribution functions [7] and consistency with strongly predicted secondary structure elements, using our novel protocol [8]. The alignments with putative matches were next analyzed for consistency with their biological role, investigating (using extensive literature searches) all known interactions of the matches and their structural implications.

In several cases our approach did not result in a clear winner. The decision whether a model should be submitted and what should be the ranking of the models was then made based on intuitive and esthetic comparisons of the models. We allowed ourselves to submit up to three models per target in order to evaluate our ranking in some cases. A detailed description accompanied each model submitted to the CASP server.

1. Meller J. and Elber R., (2001), Linear Programming Optimization and a Double Statistical Filter for Protein Threading Potentials, Proteins 45: 241 2. http://ser-loopp.tc.cornell.edu/loopp_old.html 3. http://www.cs.bgu.ac.il/~dfischer/CAFASP2; see also Proteins CASP Sup. 4. Frary A., Nesbitt T., Frary A., Grandillo S., vd Knaap E., Cong B., Liu J., Meller J., Elber R., Alpert K., Tanksley S.D., fw2.2: (2000) A Quantitative Trait Locus Key to the Evolution of Tomato Fruit Size, Science, 289: 85 5. Kuznetsova A., Meller J. et. al., PNAS, submitted 6. http://www.tc.cornell.edu/CBIO/loopp 7. http://sift.chmcc.org; manuscript under preparation 8. http://pressage.chmcc.org; manuscript under preparation

Levitt (P0016) - 350 predictions: 350 3D

The Levitt Group Comparative Modeling and Ab Initio Methods for Protein Structure Prediction

A-135 E. Lindahl, P. Koehl, R. Kolodny, T.M. Raschke, C.M. Summa, and M. Levitt Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305 USA [email protected]

All target sequences submitted to CASP5 were first screened using the results from the CAFASP3 servers and comparative modeling was only attempted for targets where at least one server showed intermediate or high scores. The remaining sequences were considered ab initio targets. For comparative modeling at CASP5, our group focused on improving sequence alignments and on the prediction of sidechains and loop regions in proteins. For ab initio targets, we generated decoys using fragment assembly followed by selection based on energy functions and clustering.

Comparative Modeling. A consensus secondary structure prediction was derived from all the servers available at the CAFASP3 website, giving additional weight to the PsiPred [1] method. For all significant fold recognition hits we extracted both the original structures and other structures in the same SCOP superfamily [2] with good SPACI scores [3] to get high quality templates. We computed a sequence profile based on the structural alignments of these templates, derived from the FSSP database [4]. Position-dependent gap penalties were introduced based on the experimental and predicted secondary structures, FSSP fragments, and the distance between endpoints in the template structures for deletions. We used both our alignments derived from the structural profiles and automated alignments from CAFASP3 to create a set of manually tweaked alignments for each target. The emphasis in this tuning process was not mainly on matching features, but rather on manual discrimination, correcting possible mismatches, and taking any additional knowledge about the sequence/structure into account. For large insertions or changes in secondary structure we first altered the backbone structure of the template and applied energy minimization with SEGMOD [5] and Gromacs [6] to get the structure to a reasonable state. Starting from the manual alignments, a model backbone framework was built by removing two residues on each side of insertions/deletions in the template. Candidate loop fragments were selected from a set of geometrically compatible backbone fragments. Similar fragment sets were generated for positions where there were PRO and GLY mutations between the template and query sequences. This approach is limited to insertions shorter than about 15 residues, and for a couple of cases we had to apply manual modeling using the O program [7] to generate potential loops. In the final modeling step, we select a set of rotamers for each sidechain, and use a self- consistent mean-field approach [8-9] to simultaneously optimize sidechains and the altered backbone fragments. Manual inspection and the energy of the resulting all- atom models were used to select which predictions to submit.

Ab Iinitio Modeling. We applied the following ab initio prediction method to target proteins that received low scores from the CAFASP3 comparative modeling servers. Models were generated by assembling regularized backbone segments of length 9 (derived from a 2000-protein library) using Monte Carlo swap moves, as per Jones’ method used in CASP2[10] and Baker’s method in CASP3 and CASP4 [11-12]. The energy function used for annealing consisted of terms representing cooperative hydrogen bonds (as done by Keasar & Levitt in CASP4), residue-based hydrophobic burial propensity, and residue-based hydrophobic pair interactions. After 50,000 steps of annealing with the segment replacement method, the models were annealed with “refinement moves,” consisting of small 2° rotations of the backbone torsion angles, for 10,000 steps. This process was used to model the native sequence and homologous sequences (where appropriate) using the predicted secondary structures from several automated servers [1,13-14]. For some targets, the most likely emitted sequence from a Hidden Markov Model built from the target sequence family was also used [15]. A set of 1000 decoys was generated for each sequence/secondary structure combination, and all models were combined into one large dataset for selection. This dataset was pruned to 3000 members using a “colony energy” score [16] that combined several energy functions (atom cluster energy, electrostatic energy, RAPDF [17], and the energy from the decoy generation procedure) with a measure of the structural similarities between the decoys. The 3,000 best models were clustered with a hierarchical clustering method using a Floyd distance metric (distance along the graph) [18]. Decoys in the top 5 clusters were evaluated by manual inspection, and typically one decoy from each of the top 5 clusters was submitted.

A-136 1. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202. http://bioinf.cs.ucl.ac.uk/psipred/ 2. Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536-540 3. Brenner S.E., Koehl P., Levitt M. (2000) The Astral compendium for protein structure and sequence analysis. Nucleic Acids Res., 28, 254-256 4. Holm L., Sander C., Mapping the protein universe. (1996) Science 273, 595-602 5. Levitt M., Hirshberg M., Sharon R., Daggett V. (1995) Potential Energy Functions and Parameters for Simulations of Molecular Dynamics of Proteins and Nucleic Acids in Solution. Comp. Phys. Comm. 91, 215-231 6. Lindahl E., Hess B., van der Spoel D. (2001) GROMACS 3.0: A package for molecular simulation and trajectory analysis. J. Mol. Mod. 7(8), 306 http://www.gromacs.org 7. Jones T. A, Kjeldgard M. (1998) Essential O, Software manual, Uppsala University. http://xray.bmc.uu.se/alwyn/o_related.html 8. Koehl P., Delarue M. (1994) Application of a self-consistent mean field theory to predict protein side-chains conformation and estimate their conformational entropy. J. Mol. Biol., 239, 249-275 9. Koehl P., Delarue M. (1995) A self consistent mean field approach to simultaneous gap closure and side-chain positioning in homology modeling. Nature Struct. Biol., 2, 163-170. 10. Jones D.T. (1997) Successful ab initio prediction of the tertiary structure of NK-lysin using multiple sequences and recognized supersecondary structural motifs. Proteins: Struct. Funct. Genet. S1, 185-191. 11. Simons K.T., Bonneau R., Ruczinski I. and Baker D. (1999) Ab initio protein structure prediction of CASP III targets using ROSETTA. Protein: Struct. Funct. Genet. S3,. 171-176. 12. Bonneau R., et. al. (2001) Rosetta in CASP4: Progress in ab initio protein structure prediction. Proteins: Struct. Funct. Genet. S5, 119-126. 13. PHD, http://www.embl-heidelberg.de/predictprotein/predictprotein.html 14. SAM-T02-STRIDE, http://www.cse.ucsc.edu/research/compbio/HMM-apps/T02-query.html 15. Gough J.and Madera M. (2002) The next generation of structural genome analysis. CASP5 Abstract. 16. Xiang Z.X., Soto C.S. and Honig B. (2002) Evaluating conformational free energies: The colony energy and its application to the problem of loop prediction. Proc. Natl. Acad. Sci. USA 99 (11), 7432-7437. 17. Samudrala R. and Moult J. (1998) An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J. Mol. Biol. 275 (5), 895-916 18. Tenenbaum J.B., de Silva V. and Langford J.C. (2000) A global geometric framework for nonlinear dimensionality reduction. Science. 290 (5500), 2319.

A-137 MPALIGN (P0135) - 327 predictions: 327 3D

MPALIGN: a Protein Threading Program Using Multiple Profiles

T. Akutsu1, M. Fujita1, H. Saigo1, J.-P. Vert2 and K. Horimoto3 1Bioinfomatics Center, Kyoto Univ., 2Ecole des Mines de Paris, 3 Human Genome Center, Univ. Tokyo [email protected]

MPALIGN combines various alignment methods. PSI-blast[1] is used for easy targets. For the others, dynamic programming based sequence-to-profile alignment is employed using profiles from (i) PSI-blast search from each representative sequence, (ii) multiple profiles from sequences in the same fold, (iii) combination of (i) and profiles from structurally similar fragments. In the following, we briefly describe the methods for (i)-(iii).

(i). For each sequence in the ASTRAL database (with less than 40% sequence identity) [2], PSI-blast search is performed against the nr database using ‘–Q’ option, which outputs a PSSM (position specific score matrix) as a result. For each PSSM, both local alignment and global alignment between the target sequence and the PSSM are computed. ASTRAL sequences are scored based on these alignment scores.

(ii). Multiple profiles from several sequences in different families but in the same fold are used, where each profile is obtained as in (i). An alignment between the target sequence and multiple profiles is computed for each fold class, where a simple dynamic programming algorithm is employed for alignment. Fold classes are scored based on these alignment scores.

(iii). In order to obtain profiles based on fragments of protein structures, fragments are classified of into several tens of groups based on structural similarities using the UPGMA clustering method, where each fragment consists of consecutive 9 C-alpha atoms [3]. For each group, we construct a profile based on residue frequency in each position (among 9 positions). For each protein structure in the ASTRAL database, we construct a PSSM by concatenating these short profiles, where profiles obtained as in (i) are also used for regions that do not correspond to any group of fragments. An alignment between the target sequence and each PSSM is computed using a simple dynamic programming algorithm. ASTRAL sequences are scored based on these alignment scores.

Finally, each candidate sequence is ranked based on weighted sum of the above scores and the result of secondary structure prediction by PSIPRED[4], where we only use information about the ratio of the number of residues in alpha-helices to the number of residues in beta-strands.

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 2. Chandonia J.M. et al. (2002) ASTRAL compendium enhancements. Nucleic Acids Res. 30, 260-263 3. Simons K.T. et al. (1997) Assembly of protein tertiary structure from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 268, 209-225 4. Jones D.T. (1999) Protein secondary structure prediction based on position-specific score matrices. J. Mol. Biol. 292, 159-202

A-138 Murzin (P0448) - 21 predictions: 21 3D

Knowledge-based Approach to Modelling of Homologous Structures with Low Sequence Similarity and Other Tricks

A.G. Murzin MRC Centre for Protein Engineering, Cambridge, UK [email protected]

Our semi-manual approach to protein structure prediction is based on the knowledge of all known structural and probable evolutionary relationships among proteins of known structure classified in the SCOP database. In CASP2 and CASP4, we successfully applied this approach to the recognition of probable distant homology, where it existed, between the target proteins and proteins of known structure and significantly improved the quality of our distant homology models between the two CASP experiments. In CASP5, I have applied this approach to distant homology modelling, a new emerging prediction subcategory.

Having been previously a subset of the fold recognition targets, the distant homology targets are now becoming a subject of comparative modelling. The coming of sequence profile-based methods combined with the enlargements of protein families resulted from the sequencing of many complete genomes have eased the detection of remote homology. In contrast, the modelling of distant homology targets remains a challenging problem. It is different from the classic comparative modelling where the similarity between the target and parent structures is likely to be extensive. In distantly related proteins, the amount of similar structure is generally smaller, about one half on the average. Thus, apart from the identification and alignment of the regions of similar structure, there is a problem of the prediction of remaining dissimilar regions. In theory, some of these regions could be “copied” from alternative parent structures, if they are available, that is, it may be possible to assemble a composite “parent” structure from fragments of several distantly related structures that would approximate the target structure better than any of currently known structures. This theory is sound as shown by our initial test of composite models in previous CASP experiments. My main objective in CASP5 was the further improvement of composite models. Ideally, such a model should provide a structural explanation of every conserved feature in the multiple sequence alignment of the target immediate family.

I have selected about 15 targets ranging across two prediction categories from difficult comparative modelling targets to easy to medium fold recognition targets. The selection criteria were the lack of a high sequence similarity to a known structure and the availability of at least two probably related structures with low sequence similarities to each other. Thus, each selected target has been assigned into a true SCOP superfamily (that is, containing more than one family of known structure) either by sequence similarity searches or by our distant homology recognition techniques. The composite “parent” structures have been assembled from manually selected fragments of the representative structures of different constituent families. In a few cases, my earlier predictions speeded up the modelling process. Having explored previously a number of true SCOP superfamilies, I assigned to them many sequence families of unknown structure and aligned the representative sequences of these families with the sequences of known structures. The CASP5 targets T0130, T0132, T0136, T0152 and T0169 appeared to be the members of already assigned sequence families that enabled my use of the prepared alignments. An original, post-CASP4 model of T0130, built in 2001 for a structural genomics project has been submitted without further refinement. This model suggests a novel topology not yet observed in the target superfamily. My other models are expected to improve on the prediction of local details of particular interest including variable irregular elements (alpha-helical caps, beta-bulges etc.) near the putative active sites. It should be

A-139 noted, that my method does not deal with the problems of classic comparative modelling, like the prediction of loop or side chain conformations. Any credit for a correct prediction of these details goes to MODELLER used to seal the gaps and fix the stereochemistry of the joints in the composite models.

My other CASP5 exercise was the exploration of a loophole in the design of CASP experiment. Ideally, for a given target, the prediction should have been done before the target structure is known to anybody. In reality, the most of the CASP5 targets have been submitted after their structures have been solved, so they have been known to at least their authors. Although officially unpublished, these structures could have been presented elsewhere, so some information on these structures could have “leaked” into the public domain. Indeed, I have found previously available information, mainly on the Internet, on the structures of several CASP5 targets in all prediction categories. It contained general descriptions of overall protein fold or similarity to a known structure. Such information has probably no or little effect on my distant homology models aimed to the prediction of fine details, but it can be crucial for the predictions of protein fold. To evaluate its effect on the quality of predictions, I have built and submitted the models utilising the collected structural information as recorded in the REMARK field of the corresponding predictions.

A-140 MZ-Brussels (P0246) - 54 predictions: 54 3D

Energy-based 3D Protein Structure Predictions

Koji Ogata12, Raphael Leplae2 and Shoshana J. Wodak2 1 – ZoeGene Corp., Japan, 2 Service de Conformation de Macromolecule Biologique et Bioinformatique, Université Libre de Bruxelles, CP 263, Blv. du Triomphe, Brussels, Belgium [email protected]

The ModzingerZ (MZ) package performs homology modelling and ab initio structure prediction respectively, depending on the presence or absence of template structures in the PDB. Templates are identified in the PDB by a two step procedure using Psi-BLAST [1] with the default options. When suitable templates are found, homology modelling is performed, by combining a profile-profile alignment and energy based loops modelling procedure. In absence of template structures in PDB, a fragments grafting method is applied. The grafted fragments are selected from a library of non-redundant fragment conformations, which is derived from known protein structures by structure superposition and clustering. In both approaches, generated conformations are scored using an approximate force-field derived from averaging main chain and side chain interactions in proteins computed using the AMBER force-field, and modelling residues by 2 interactions centers. Further details about these procedures are given in the following paragraphs:

Homology modelling procedure: To identify structural templates in the PDB a two step procedure was used. First the target sequence was aligned against a sequence database combining sequences from Genbank and PDB-sub (PDB-sub containing sequences with <90% sequence identity) by using Psi-BLAST with default parameters. Second, individual PDB structures identified by this search were re-run against PDB-sub to identify additional homologs with known 3D structure. All the identified PDB structures were then structurally aligned. A profile was derived from these structural alignments and the target sequence was aligned against this profile [2]. In addition a sequence profile was computed for each identified template protein, by running Psi-Blast against in Genbank [3] and pruning so as to leave highly similar sequences (with identity more than 50% and less than 100%). In performing these alignments, gaps inside the secondary structure elements (computed using DSSP [4]) were penalised.

Structurally conserved regions (SCR) in the target sequence were then defined as residues that aligned to those of the structural templates that display an RMSD≤1.0Å in the corresponding multiple structural alignment. For residues in the target corresponding to SCR’s, the backbone was built using the main chain coordinates of the template with the highest BLOSUM62 score to the target. Side chain coordinates from the same template were also used whenever the amino acid of the target and template were the same.

The remaining regions, called structurally variable regions (SVR), were built by using the main chain atom coordinates of template structures having the highest BLOSUM62 score computed without insertion/deletion regions, with different templates being used for different regions. For the insertions/deletions, an energy-based loop modelling method [5] was used to find suitable loop conformations. The force-field used for evaluating conformations, models each residue by two interactions centers positioned at the C and C atoms. The pairwise interactions energies between these centres was derived by computing the average of the potential energy of the AMBER force field [6] for main chains and side chains interactions for specific residue pairs in the PDB. We verified that this force-field yields rather accurate predictions for individual protein loops as well as several interacting loops [Ogata & Wodak in preparation]. But the maximum loop length amenable to this procedure is

A-141 22 residues. Longer loops were therefore simply not modelled. Residues without side chain coordinates from a template structure were generated using the Monte Carlo method with the AMBER force field. Models output by the above procedure were examined, and the alignment was adjusted (either manually or with alignment tools), whenever some inconsistencies (on the sequence, structure or biological level) were discovered. The new alignment was then re-fed to the model building method described above.

Ab initio modelling approach: When no suitable template was found by the search procedure described above, a fragment-grafting method was used. This method uses a library of eight-residue fragments with non redundant conformations (conformations with rmsd>1Å), derived from known protein structures by structure superposition and clustering. The information on the amino acid sequence and the secondary structure (computed by DSSP [4]) associated with each fragment cluster is also stored in the library.

The protein main chain was generated by chaining together fragments starting from the protein N-terminus in such a way that the 1st 4 residues of the following fragment and the last 4 residues of the preceding fragment overlap. To select fragments from the library the rmsd of the overlapping portion was required to be below 1Å. Since this still yielded a very large number of overlapping fragments, information on secondary structure was used to further reduce conformation space, as follows. The target secondary structure was predicted using PHD[7] and fragments with secondary structure more than 80% similar to the target were selected. The similarity of the secondary structures is defined as the percentage of identity between secondary structure elements of the same type (helix, strand and random coil). The remaining conformations were evaluated using the force-field described above.

Over one million main chain conformations were generated in this way and the main chain conformation with the lowest energy was selected as the optimal solution. Side chain conformations of the selected main chain were then generated using a Monte Carlo procedure with AMBER force field.

For the CASP5 predicted targets, we used two single processor machines. Due to the size of the conformation search space, each prediction has been limited to 24 hours CPU time.

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 2. Rychlewski L, Jaroszewski L, Li W, Godzik A. (2000) Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Sci. 9(2), 232-41. 3. Benson, Dennis A., Karsch-Mizrachi, Ilene, Lipman, David J., Ostell, James, Rapp, Barbara A., Wheeler, David L. (2002) GenBank. Nucleic. Acids. Res. 30, 17-20 4. Kabsch W. and Sander C. (1983) Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features, Biopolymers, 22, 2577-2637. 5. Ogata K., Leplae, R., Wodak, SJ. An Energy Based Predictions for Multi-loops of Proteins, in preparation. 6. Weiner, S. J., Kollman, P. A., Nguyen, D. T. and Case, D. A. (1986). An all atom force field for simulations of proteins and nucleic acids. J Comput Chem, 7, 230- 252. 7. Rost B. and Sander C. (1994) Combining evolutionary information and neural networks to predict protein secondary structure. Proteins, 19, 55-72.

A-142 nexxus-delrio (P0370) - 7 predictions: 7 3D

Protein Structure Assessment by Matching Residues Function and Centrality

G. del Rio1, A. Garciarrubio2 and D.E. Bredesen1 1 – Buck Institute, 2 – Biotechnology Institute (UNAM) [email protected]

Biological systems can be represented by their elements and their interactions in a graph or network. Graph theory analyzes systems represented by vertices (i.e. elements) and edges (i.e. interactions). From these graphs, central vertices or edges (nexuses) can be detected based on diverse criteria, including connectivity. Nexuses defined in terms of connectivity are those vertices upon which the connectivity relay on. Since graphs are models of biological systems then nexuses represent essential elements for biological systems [1, 2].

Several biological systems have been represented as graphs, including metabolism [3] and protein-protein interactions [2]. In these examples, the connectivity distribution presents a tail following a power-law distribution, that is, there are few vertices having many interactions while most vertices have few interactions. It has been proposed that in such distribution the most connected vertices play an essential function for the system being modeled [2].

However, none of the systems studied in this way are completely characterized in terms of vertices and edges. On the other hand, proteins (i.e. protein structures) can be modeled as graphs where the vertices are amino acid residues and edges are spatial-distance relationships. In this case, the system is better defined and can be used as a model to readily test mathematical methods to identify central/essential elements.

We have found that the connectivity distribution in protein structures do not follow a power-law distribution but an exponential one [4]. We develop a method to identify nexuses in power-law distribution or exponential ones referred as Minimum Interacting Networks, MIN [1, 4]. MINs are obtained by tracing the shortest path connecting all the vertices in a graph. From these paths, the most traversed vertices are identified as nexuses as opposed to those vertices having most interactions.

In proteins, essential residues tend to be evolutionarily conserved, and because of this feature, they have been used to classify and identify proteins by function and structure. In modeling protein structures, refinement and assessment of the models is systematically done by methods accounting for geometry and energy criteria. These models then are subject to other approaches to identify essential residues (e.g. residues in the active site) as a way to test them functionally. We present a method to assess protein structure models that incorporates functional validation. Our strategy is represented in figure 1:

Error: Reference source not found

Figure 1. The 3D structure of a protein sequence (line) is modeled by structural homology with a known 3D structure (cylinder). Multiple conformers (cylinders, cones) can be generated where geometry and energy criteria are satisfied. Nexuses can be identified and matched with known functionally essential residues. The models where the matching is the highest are identified as functionally/structurally correct.

A-143 As in sequence motifs, nexuses are proposed to serve as signature to identify a particular structure. In the figure, the central regions for a cone are not the same than for a cylinder, hence, cylinder can be filter out from cones using centrality as a criterion. This appears to be true for proteins since these do not present all the possible different shapes. Also, proteins with similar folds may have distinct essential residues hence these may be used to identify the corresponding protein structures.

We tested our method with three proteins (beta-lactamase (BL), HIV-1 protease (HIV1P) and T4 lysozyme (T4Lz)) for which the 3D structure is available and extensive mutagenesis studies have been conducted to identify essential residues for the wild-type activity. We found that in all three cases, we were able to identify the closest models to the crystal structure by matching nexuses with essential residues (matching ratio) (see figure 2).

Figure 2. RMSD of 40 models of HIV1P superimposed with the crystal structure (1HIV) is plotted against the specificity of our method. The models were built with MODELLER and the 1HIV structure as template. Specificity = No. of essential residues predicted as nexuses/No. of essential residues. Specificity is used here as the matching ratio.

We also used our method in fold recognition. That is, sequences with different folds may be identified as templates for a target sequence. In order to identify the best template for the target sequence, 3D models are built to detect nexuses and match these with essential residues. We tested this procedure with the death domain of the low affinity neurotrophin receptor (NTRP75) and PROSPECT as the fold recognition approach. The 5 best sequence alignments identified by PROSPECT were used to test our approach. As seen in Table 1, the fold with highest matching ratio was the correct fold.

Table 1. Fold recognition scores based on matching ratio (column CN/NCN) for NTRP75. Essential residues here are considered evolutionarily conserved ones. CN: Nexuses found to be conserved residues; NCN: Nexuses found to be non-conserved residues. Template CN NCN CN/NCN (%) Fold Classification Structure 1B3U (A) 19 39 48.71 Alpha-alpha super helix

A-144 1D2Z (A) 19 35 54.28 DEATH domain 1NGR (1) 18 22 81.81 DEATH domain 1QPZ (A) 19 47 40.42 Periplasmic binding protein- like I 1R69 17 45 37.77 Lambda repressor-like DNA-binding domains

For CASP5, we used PROSPECT to identify the closest sequence homologues and using our approach we detected those models containing the largest matching ratios (Essential residues were considered evolutionarily conserved ones). The models were built using MODELLER.

1. del Rio G., et al. (2001) Mining DNA microarray data using a novel approach based on graph theory. FEBS Lett. 509(2), 230-234. 2. Jeong H., et al. (2001) Lethality and centrality in protein networks. Nature 411 (6833), 41-42. 3. Jeong H., et al. (2000) The large-scale organization of metabolic networks. Nature 407 (6804), 651-654. 4. del Rio G. et al. Detecting central elements from protein derived networks to predict essential function. In preparation.

ORNL-PROSPECT (P0012) - 330 predictions: 330 3D

Fold Recognition Using PROSECT

D. Kim, D. Xu, J. Guo, S. Passovets, M. Shah, K. Ellrott, and Y. Xu* Protein Informatics Group, Oak Ridge National Laboratory * [email protected]

We have predicted the protein tertiary structures for all 67 targets. The predictions have been mainly carried out using our newly improved fold recognition program, PROSECT (http://compbio.ornl.gov/PROSPECT/), and a recently developed computational pipeline for automated protein structure predictions (http://compbio.ornl.gov/PROSPECT/PROSPECT -pipeline), with occasional human intervention.

PROSPECT uses both the sequential and structural information to recognize the correct sequence-structure relationship. Built on its previous unique features for rigorously treating pairwise contact energy and protein-specific data as threading constraints, the new version of PROSPECT has the following unique features: (1) the use of the evolutionary information not only in a profile-profile sequence alignment score but also in calculating the single and pairwise energies; and (2) the use of a combined z-score for a sequence-structure alignment for fold recognition, based on a z-score for the raw alignment score and a z-score for the pairwise interaction energy. In addition, by performing objective statistical analysis on a large data set for threading of 600 query proteins against the whole FSSP database, a prediction confidence index for measuring the prediction reliability is tabulated.

The tests on several benchmark sets indicate that the evolutionary information and other new features in PROSPECT greatly improve its alignment accuracy. We have also demonstrated that the PROSPECT’s performance on fold recognition is significantly better than any other methods publicly available at all levels of sequence

A-145 similarity. Improvement on the sensitivity of the fold recognition, especially at the superfamily and fold levels, makes PROSPECT a reliable prediction tool for large- scale applications of protein structures. We have developed a fully automated prediction pipeline for protein structures, centered around PROSPECT.

The pipeline consists of five key components: (a) preprocessing for identification of protein domains, identification and removal of signal peptides, and protein secondary structure prediction (using our own in-house program); (b) protein triage for classification of proteins into membrane proteins, soluble proteins with and without significant BLAST hits; (c) protein fold recognition and sequence-structure alignment using PROSPECT, using additional available information as prediction constraints (like secondary structure or known disulfide bonds, etc); (d) protein structure modeling, using MODELLER, based on threading alignments of PROSPECT; and (e) post-processing for structural quality check, using PROCHECK. A pipeline manager system was developed to automatically determine the process and prediction pathway based on a user specification, a set of preset conditions and related control flow, and the triage result. The whole pipeline is implemented as a client/server system, with a web interface. XML technology is used for data exchange between the web interface, the pipeline manager and the tools. Currently the pipeline is running on a 64-node linux cluster at Oak Ridge National Laboratory.

A general procedure for the CASP5 structure predictions starts with running the pipeline for each target. If both the prediction reliability score and the structure quality assessment score from WHATIF are above some pre-determined thresholds, a structure model generated by MODELLER was submitted as the final prediction without any human intervention. Most of the homology modeling targets belong to this case. For the cases with high reliability scores but low WHATIF scores, several alternative sequence-structure alignments for a chosen template were generated using different alignment schemes including global, global-local, and local alignments and/or by using different set of weighing factors for each energy term. The majority of the fold recognition targets belong to this case. For targets with low reliability scores, such as new fold targets and some of the difficult fold recognition targets, additional information such as predicted functions and consistency with the predicted secondary structures are used to select the templates and adjust the alignments. Overall, the predictions were made by maximally utilizing two automated programs, PROSPECT and Pipeline, while human intervention was kept at the minimal level.

Osgdj (P0292) - 100 predictions: 100 3D

The PROTSCAPE Protein Folding Algorithm

D.J. Osguthorpe University of Bath in Swindon [email protected]

The PROTSCAPE protein folding algorithm is based on a simplified geometry model of protein structure with a physics-based force field representing the interactions between the pseudo-atoms.

The model used to represent the solvation aspects of protein folding physics is a novel one unique to this algorithm. It is called the "Differential Dielectric Model" as it describes what happens to electrostatic interactions when there is a difference in dielectric between two regions. A consequence of this model is that there is, for the first time, a direct physical basis for a cooperative effect in protein structures which develops from the interaction between non-polar groups and electrostatic interactions. Further, it allows another mechanism for denaturation by solvents such as Guanidinium HCl which is not based on disrupting "hydrophobicity" but on reducing the difference in dielectric of the core and outer surface regions.

A-146 The Reduced Representation Model and Force Field Simplified Geometry Model The model involves representing the backbone of each residue by one sphere, or 'atom', and the side chains by up to 4 'atoms'. The side chains of Ala, Val, Ile, Ser, Thr and Pro are represented by 1 sphere, Leu, His, Asp, Glu, Asn, Gln, Cys and Met by two spheres and Phe, Trp, Lys and Arg by three spheres and Tyr by four spheres. The different number of spheres reflects the anisotropic nature of the average shape of the corresponding side chains. It also enables assigning different characteristics to parts of the side chain of a residue, for example, the side chain of Arg includes a hydrophobic chain and a polar/charged end. Although in this representation many residues have the same number of atoms, they do not lose their unique identity since they have different parameters.

Simplified Potentials The potentials required can be split into three major groups, the virtual internal potentials which stabilise the geometry of the protein, secondary structure stabilisation potentials and the global potentials, which deal with the effects of the environment but do not require the environment to be modeled explicitly. The potential energy function for the model is defined as: E total = E Internal + E Secondary Structure + E van der Waals + E Global Internal Potentials The values of the parameters were derived by fitting observed distributions of the corresponding internals in experimental structures and by emulating the energy surface calculated using a full atom model. The internal energy is defined in terms of virtual bond, angles and torsions (or out of plane). A number of functional forms are used, the standard full-atom model harmonic terms, quadratic functions and Gaussian functions plus combinations of these terms. Additionally an out of plane-virtual valence angle cross-term is defined. E Internal = E V. bond + E V. angle + E V. torsion + E V. oop + E V. oop X V. angle Virtual angle - virtual angle - virtual torsion angle (theta-theta-phi) cross-terms are defined for dealing with correlations between the two internal valence angles of a torsion angle in the backbone. These are particularly important for turn conformations.

Secondary structure energy/Backbone Hydrogen bonding Potentials With the simplified geometry model only C alpha atoms exist for the backbone and yet backbone hydrogen bonding is very important in the stabilisation of the standard secondary structures. However, the standard secondary structures have a fixed and specific set of distances between the C alpha atoms. Hence the basic approach was to determine the equilibrium distances between C alpha atoms in 3-10, alpha-helices and parallel and anti-parallel beta-sheets and to use Gaussian functions to stabilise these distances. E Secondary Structure = E Helix + E Sheet For the beta-sheets it was also necessary to include some vector terms as well to ensure only when the two strands were aligned was the potential strong. Further improvements were necessary to the sheet potentials as from trial folding runs it became clear that additions were needed to remove conformations created that are never seen in real proteins. It should be noted that in all cases the secondary structure potentials merely stabilise distances that are found, this is not a pre-imposition of secondary structure. The beta-sheet potentials do a full search of all residue pairs to find any that are close enough to form sheets in each energy calculation. Secondary structure "preference" energy This term accounts for the observation that certain residues prefer a particular secondary structure. This is required in this model as this preference is due to local interactions between side chain atoms and the backbone atoms which are missing in this model. Ala, Lys, Arg, Glu, Gln, Leu and Met are assigned a helix preference,

A-147 while Val, Ile, Thr, His, Phe, Tyr, Trp and Cys are assigned a strand preference. An overall preference for any residue has been added by stabilising virtual torsions and angles using i-i+2, i+1-i+3, and i-i+3 distances and Gaussian functions for both the helical and strand conformations. As individual residue conformations only affect the virtual valence angle, the overall preference is specifically increased only for contiguous pairs of residues which both prefer the helical conformation or both prefer the strand conformation. That is, the two central C alphas of a backbone virtual torsion must both prefer the helical or strand conformation to increase the secondary structure "preference" potential of the virtual torsion. E Secondary Structure Prediction = E Turn + E Strand Global/Solvation Potentials The remaining potentials are used to represent the non-bonded interactions of the residues with each other and the interactions with solvent. The fundamental idea behind the solvation potentials was to use fast approximations to the physical forces involved in real protein structures. Also, as molecular dynamics was seen as one of the primary tools to be used in the parameterisation procedure and for first attempts at protein folding the potentials had to have analytical derivatives for speed. Physical Model Solvation Potentials. In this potential model the physical forces of solvation were included using simple potential models. The main idea was that most protein atoms should not have an attractive interaction with other protein atoms, reflecting the fact that the real interactions with protein atoms would be replaced by solvent interactions if the atom became exposed, hence its overall energy would not change depending on whether it was buried or exposed. However, the atoms should still have excluded volume so a repulsion potential is required at short distances. An offset Lennard-Jones potential is used, where the well-depth is offset to 0 at the Lennard-Jones radius and the energy is set to 0 for distances between atoms greater than the Lennard-Jones radius. This potential is used for most atoms, in particular the C alpha backbone atoms and any atom which does not have a specific Lennard-Jones potential. E van der Waals = E Offset Lennard-Jones Physical Model Solvation Potentials - Hydrophobicity The next effect to consider is the "hydrophobic" effect. This is separated into two parts, the Van der Waal's potential between atoms (which is attractive) and effects due to interactions with water. Side chain atoms of hydrophobic groups were given a standard Lennard-Jones potential with an initial energy assignment for interactions between the same atom close to the enthalpy of vapourisation of the most similar hydrocarbon. This would reproduce the energy of the hydrophobic core when hydrophobic side chains are buried. This determined the potential between the same side chain atom types. For dissimilar side chain atom types an analysis of the distribution of side chain atoms around an atom in known protein structures showed to a first approximation little difference in preference between the atoms. This distribution is not that which is created by rules such as the geometric mean rules. A function was created which would give such a distribution and this was used to generate the mixed terms for the Lennard-Jones parameters of hydrophobic side chain atoms. Unlike previous CASP predictions in CASP5 no additional hydrophobicity term was included. A final adjustment to the "hydrophobicity" potential was to give certain groups in residues not normally considered hydrophobic a non-zero Lennard-Jones function so that an interaction existed between them and hydrophobic groups. Such groups were the Ala C beta, the Thr C beta (because of the methyl group), the C beta of the charged amino-acids Asp, Glu, Lys and Arg and Asn and Gln. It also included the C gamma atom of Lys and Arg. Observations of experimental structures and surface accessibility calculations show that these groups are as buried as any of the atoms in the classic hydrophobic side chains. E Global = E van der Waals + E hydrophobic sigmoid Physical Model Solvation Potentials - Electrostatics In these calculations an inverse Kirkwood-Tanford model is used, in which the electrostatic interactions between the charged groups are varied according to their local dielectric environment, defined by counting how many non-polar groups are surrounding them, using a sigmoid function. To take into account of ionic strength effects, which are assumed to have an affect at large distances between charges but not at short range (as the Debye-Huckel theory on which this aspect is based assumes an averaged ionic atmosphere around each charge which is certainly not true for charges on the surface of a folded protein), a distance dependent dielectric of the distance squared was used. Electrostatic interactions were computed using a distance cubed term and in addition the same term scaled by the sigmoid function of surrounding

A-148 non-polar groups. Note that the scaled term accounts for salt-bridges automatically, as interactions between charged pairs not surrounded by non-polars (high dielectric) will be weak but strong when surrounded by non-polars (low dielectric). The distance cubed term was chosen such that at 4.0 angstrom the charge effect was approximately equivalent to two unit charges with dielectric 80 or so (approx. 1 kcal) whereas at 10-12 angstrom the charge interaction was reduced to less than 0.1 kcal. E Global += E Electrostatic + E scaled Electrostatic The other feature of electrostatics that needs to be covered is the difficulty of burying charges, the self-energy. It is actually a much stronger rule of proteins that the charged group of charged residues is exposed than that the side chains of hydrophobic residues are buried. Charged groups are only buried if in a salt bridge or extensively hydrogen bonded. The simple electrostatic explanation for this is the self-energy of a charge which says it requires a lot of energy to move a charge from a high dielectric region into a low dielectric region. As there is a big difference in surface accessibility between the 4 charged residues, Lys, Glu, Asp, and Arg independent potentials are used for Lys, Asp/Glu and Arg. The Lysine charged end point is the most solvent exposed group of proteins, with an average relative surface accessible area greater than 50%. Glu is next followed by Asp, both in the 45% region, and Arg is the least exposed at around 35%. This is what you would expect based on charge density considerations, the self energy being much greater for a charge field which small and highly charged. The amine group of Lysine is the smallest charged group, with only one heavy atom, the carboxyl spreads the charge further while the guanidinium group charge is spread over a very large area (four heavy atoms). The same sigmoid function counting the number of non-polar groups surrounding a charge is used as before, scaled by a potential constant which gives a positive energy for burying a charge. E Global += E charge-non-polar sigmoid In CASP5 this treatment was extended to the polar side chains, Ser, Thr, Tyr, Trp, His, Asn and Gln. The same argument that applies to charges also applies to atoms with a partial charge, however the self-energy is much weaker because overall the group is neutral. Recent test folding simulations have suggested that these groups are too easily buried with a zero Lennard-Jones potential so have had an additional self-energy term added. E Global += E polar-non-polar sigmoid Another addition in CASP5 were Lennard-Jones terms between polar side chains (Asn, Gln, Tyr, Ser) and the backbone CA and charged residues (Asp, Glu, Lys, Arg) and the backbone CA. This attempts to represent side chain-backbone hydrogen bonding interactions. Further, Lennard-Jones terms between polar and charged side chains (Tyr and Arg, Asp, Glu, Lys) were introduced. This attempts to represent polar-charged side chain hydrogen bonding interactions, although relatively weakly at present. E Global += E polar-backbone + E charged-backbone + E polar-charged Physical Model Solvation Potentials - Differential Dielectric Model In the low dielectric environment of the folded protein the stability of the backbone-backbone hydrogen bonds is significantly enhanced as these hydrogen bonds are excluded from solvent and a hydrogen bond is essentially an electrostatic interaction. In the unfolded protein the stability of backbone-backbone hydrogen bonds is likely to be similar to that of backbone-water hydrogen bonds, hence there should be no energy stabilising backbone hydrogen bonds. This effect has been included by scaling the backbone hydrogen bond energy term (E Helix and E Sheet) by a sigmoid function counting the number of surrounding non-polar groups. Folding Simulations - Simulated Annealing procedure The starting conformation is an all-extended structure using a rigid geometry procedure based on a standard geometry for the RR model. A random Maxwell-Boltzmann distribution is used to assign initial velocities. The initial temperature is set such the average temperature initially is 380 K. The annealing protocol was first to reduce the total energy by the energy equivalent to 25 K in 84000 steps followed by 84000 steps at constant total energy. This was repeated three times. 84000 steps at constant total energy followed and then the energy was reduced by 12.5 K in 84000 steps. The total energy was then continuously reduced by 3.125 K in 84000 steps until the average temperature was around 175 K. This was followed by 5 runs at a constant temperature of 175 K. Final annealing to close to 0K was done in 4 runs of 84000 steps cooling by 25 K each run followed by a final 110 K cooling run of 84000 steps.

A-149 The average energies of the final run at the constant temperature of 175 K were used to determine which structures to select, the lowest 5 energy structures submitted in energy order, i.e. model 1 is the lowest energy structure. Pan (P0032) - 164 predictions: 99 3D, 65 SS

Secondary Structure Prediction

Yang Han and Xian-Ming Pan Institute of Biophysics, CAS [email protected]

Protein secondary structure predictions have been performed by various methods [1-17]. We have employed a multiple linear regression method [18] developed in our previous work to predict protein secondary structures of the targets in CASP5.

It is known that protein secondary structure prediction can be improved by exploiting the evolutionary information from protein families [17,19-22]. With the application of multiple sequence alignment profiles, we also hope to improve our prediction results.

Algorithms

For each of CASP5 sequences, a multiple sequence alignment was constructed using PSIBLAST at first. The protein sequence databases searched by PSIBLAST were SWISS-PROT 40, TrEMBL 21 and PDB with three iterations. Each group of sequences that we got were then screened in order to exclude the sequences which had too high or poor homology. Moreover, the gaps in the query sequences were deleted to improve the prediction. CLUSTALW was then executed with default parameters to generate multiple sequence alignments.

We took four profiles as inputs for algorithm of multiple linear regression, which were extracted from both alignments mentioned above:

1. Generating the FASTA format of multiple alignment files from the results of CLUSTALW.

2. Using HMMER2 [23] package to generate position specific profiles from alignments of CLUSTALW.

3. The simple frequency counts for each amino acid in the PSIBLAST alignment expressed as percentage of the total for a given column.

4. Each amino acid residue in PSIBLAST alignment was scored by its corresponding BLOSUM62 matrix score. The scores were averaged based on the number of sequences in that column.

Then, the prediction was executed with these four profiles as inputs. The outputs were assessed for consensus of each position. Positions where there was a full agreement in the predicted state were taken as the final prediction. To those positions where the predictions were not coincident, take the most popular prediction as the final prediction.

A-150 1. Maggio E.T. and Ramnarayan K. (2001). Recent developments in computational proteomics. Trends in Biotech. 19, 266-272. 2. Cuff J.A. and Barton G.J. (1999). Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins 34, 508-519. 3. Rost B. (1996). PHD: predicting one-dimensional protein structure by profile based neural networks. Methods Enzymol. 266, 525-539. 4. Przybylski D. and Rost B. (2002). Alignments grow, secondary structure prediction improves. Proteins 46, 197-205. 5. Ouali M. and King R.D. (2000). Cascaded multiple classifiers for secondary structure prediction. Protein Sci. 9, 1162-1176. 6. Jones D.T. (1999). Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292, 195-202. 7. Karplus K., Barrett C., Cline M., Diekhans M., Grate L. and Hughey R. (1999). Predicting protein structure using only sequence information. Proteins Suppl. 3, 121-125. 8. Baldi P., Brunak S., Frasconi P., Soda G. and Pollastri G. (1999). Exploiting the past and the future in protein secondary structure prediction. Bioinformatics 15, 937-946. 9. Rost B. and Sander C. (1996). Bridging the protein sequence-structure gap by structure preditions. Annu. Rev. Biophys. Biomol. Struct. 25, 113-136. 10. Rost B. and O’Donoghue S.I. (1997). Sisyphus and prediction of protein structure. CABIOS 13, 345-356. 11. Szent-Györgyi A.G. and Cohen C. (1957). Role of proline in polypeptide chain configuration of proteins. Science 126, 697. 12. Rost B. and Sander C. (1995). Progress of 1D protein structure prediction at last. Proteins 23, 295-300. 13. Rost B. (1997). Better 1D predictions by experts with machines. Proteins Suppl 1,192-197. 14. Eyrich V.A., Marti-Renom M.A., Przybylski D., Madhusudhan M.S., Fiser A., Pazos F., Valencia A., Sali A. and Rost B. (2001). EVA: continuous automatic evaluation of protein structure prediction servers. Bioinformatics 17, 1242-1243. 15. Cuff J.A., Clamp M.E., Siddiqui A.S., Finlay M. and Barton G.J. (1998). JPred: a consensus secondary structure prediction server. Bioinformatics 14, 892-893. 16. McGuffin L.J., Bryson K. and Jones D.T. (2000). The PSIPRED protein structure prediction server. Bioinformatics 16, 404-405. 17. Rost B. and Sander, C. (1993). Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232, 584-599. 18. Pan X.M., (2001). Multiple linear regression for secondary structure prediction. Proteins 43, 256-259. 19. Zvelebil M.J.J.M., Barton G.J., Taylor W.R. and Sternberg M.J.E. (1987) Prediction of protein secondary and active sites using alignment of homologous sequences. J. Mol. Biol. 195, 957-961. 20. King R.D. and Sternberg M.J.E. (1996) Identification and application of the concepts important for accurate and reliable protein secondary structure prediction. Protein Sci. 5, 2298-2310. 21. Salamov A.A and Solovyev V.V (1995) Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignments. J. Mol. Biol. 247,11-15. 22. Frishman D and Argos P. (1996) Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence. Protein Eng. 9,133- 142. 23. Eddy S.R. (1999) HMMer2. http://hummer.wustl.edu/.

Pas (P0513) - 73 predictions: 73 3D

Multimethod Protein Structure Prediction

A-151 J. Pas Independent predictor [email protected]

To determine whether the structure of a target protein can be predicted using homology modeling PSI-BLAST [1] search was carried out against the sequences of proteins in the non-redundant protein sequence. PSI-BLAST iterations were performed using manual inclusion/exclusion procedure. After that multiple sequence alignment was built using clustalw [2] program using selected proteins from PSI-BLAST profile. All alignments were manually inspected.

Selection of template was confirmed using structure prediction METASERVER [3]. METASERVER was also used to choose template when no significant hits were found using PSI-BLAST searches.

In addition other available information was used in an attempt to link the target with a protein with known structure. It was mainly literature search, known metabolic pathways, gene expression data, position on the chromosome, distribution of folds in the organism and secondary structure prediction.

Selected target–template structural alignments were visually inspected in SWISS PDB Viewer and if necessary modified. Molecular 3D models were then built 3D using both SWISS-MODEL [4] and MODELLER [6] programs. Initial models were subjected to detailed evaluation, mainly by addition visual inspection of structural consistency and using Verify 3D program [5]. The same evaluation procedure was performed for final models.

More than one template protein was used if possible after superimposition of their molecular structures using 3d hit program [Plewczynski in press]. During the modeling procedure superimposition of initial models were used to find best possible backbone conformation

The overall quality of each modeled structure was evaluated in detail with the Verify 3D program.

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 2. Thompson J.D. et al (1994) CLUSTAL W: improving the sensivity of progressive multiple sequence alignment through sequence weighting. Nucleic Acids Res. 22, 4673-4680 3. Bujnicki J.M., Elofsson A., Fischer D., Rychlewski L. (2001) Structure prediction meta server. Bioinformatics. 17(8),750-751 4. Guex N., Peitsch M.C. (1997) SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis. 18(15), 2714-2723 5. Luthy R., Bowie J.U., Eisenberg D. (1999) Assessment of protein models with three-dimensional profiles. Nature 356, 83-85 6. Sali A., Blundell T.L. (1993) Comparative protein modeling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815

PILOT (P0378) - 146 predictions: 146 3D

PILOT: a Fold Recognition Server Based on PSI-BLAST, IMPALA and Libra-Rotamer

A-152 K. Tomii1, M. Ota2, T. Noguchi1, and Y. Akiyama1 1 – Computational Biology Research Center National Institute of Advanced Industrial Science and Technology 2 – Global Scientific Information and Computing Center Tokyo Institute of Technology [email protected]

We have constructed the automated fold recognition server, PILOT, that integrates the prediction results of PSI-BLAST [1], IMPALA [2], and Libra-rotamer [3]. To estimate the prediction accuracy of this server we have submitted the prediction results for CASP5 targets.

The PILOT server has the following two characteristic features. (1) 1D-PSSMs and 3D-PSSMs are combined in IMPALA. (2) Candidates are re-evaluated by using the threading potentials through the ‘remount’ procedure [4]. By using these methods, we have slightly improved the recognition rate of protein structural similarity. The template sequences whose E-values estimated by PSI-BLAST and IMPALA are within or equal 10 are selected as candidates.

The sequences of templates are derived from the SCOP [4] database of release 1.59 (15 May 2002). These templates are of 50% sequence identity with one another, of length 40 residues and of no chain break in the structure. These templates are selected according to the priority specified with our automatic selection system “PDB-REPRDB” [5].

From the preliminary results we have assigned seven regions in the E-value-Z-score plane according to their confidence level. Here, the E-value is estimated by PSI- BLAST and/or IMPALA. Z-score is calculated through the ‘remount’ process. Final results are decided and reported by using this plane.

The server is available at http://www.cbrc.jp/pilot/.

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402. 2. Schäffer A.A. et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics. 15 (12), 1000-1011. 3. Ota M. et al. (2001) Knowledge-based potential defined for a rotamer library to design protein sequences. Protein Eng. 14 (8), 557-564. 4. Ota M. et al. (1999) Feasibility in the inverse protein folding protocol. Protein Science. 8 (5), 1001-1009. 5. Lo Conte L. et al. (2002) SCOP database in 2002: refinements accommodate structural genomics. Nucleic Acids Res., 30(1), 264-267. 6. Noguchi T. et al. (2001) PDB-REPRDB: a database of representative protein chains from the Protein Data Bank (PDB). Nucleic Acids Res., 29(1), 219-220. POMI (P0465) - 46 predictions: 46 3D

Building Full Atom Models by Using Alignment from Fold Recognition Methods.

Rubinstein Rotem Ben-Gurion University

A-153 [email protected]

Full atom models were filled using the CAFASP fold-recognition servers' results and SwissPdb Viewer. The goal was to test the accuracy of a modeling strategy based on alignments taken from the CAFASP servers. The focus was on those targets that can't be automatically modeled by Swiss-Model because of their low sequence similarity to known structures, but that had relatively confident scores reported by the fold-recognition servers.

Comparative modeling methods, give us models for less than 20% of protein sequences. The fully automatic Swiss Model server supplied results for twenty targets in CASP5. Therefore it is important to use other methods, such as fold recognition and ab initio. These methods enable us to broaden the amount of targets for which we get results. However many of these methods do not supply us with full atom models. I tried to address this problem by modeling with alignment from fold recognition servers. Alignments were selected by comparing the scores of the servers’ models with the confidence thresholds, compiled from LiveBench4. The modeling was carried out with the SwissPdb Viewer.

1. Guex N., and Peitsch M.C., (1997) SWISS-MODEL and the Swiss-PdbViewer: An environment for comparative protein modeling, Electrophoresis 18, 2714-2723. 2. http://www.expasy.org/spdbv/ 3. http://www.cs.bgu.ac.il/~dfischer/CAFASP3/summaries/thresholds.html

Preissner (P0488) - 20 predictions: 20 3D

Loop Modeling with the Aid of LIP

E. Michalsky, A. Goede and R. Preissner Berlin Center of Genome Based Bioinformatics, Charité, Medical Faculty of the Humboldt University Berlin, Germany [email protected]

We participated in this year’s CASP experiment to evaluate the applicability and quality of our new loop construction procedure. Here we focus on this algorithm and sketch the remainder of the conventional homology modeling procedure.

One of the most important and challenging parts in protein modeling is the prediction of loops, as can be seen in the large variety of existing approaches. Van Vlijmen et al. [1], e.g., present a knowledge based approach, where a set of loops is selected from a database, followed by a constrained optimization of the loop orientation and ranking by means of an energy function. Another approach are the so called ab inito methods. Fiser et al. [2] optimize the positions of all nonhydrogen atoms with respect to a pseudo energy function, supplemented with statistical preferences for dihedral angles and for nonbonded atomic contacts. The algorithm of Tosatto et al. [3] is based on a divide and conquer approach where the target loop is recursively decomposed until the conformations of the resulting segments can be compiled analytically and uses a database of possible conformations for loop segments. The conformations were anticipated using a list of list of (phi,psi)-angle pairs extracted from the PDB. CODA, an algorithm presented by Deane et al. [4] combines a knowledge based and an ab initio method by clustering the predictions of the two algorithms and making a consensus prediction using a set of filters.

A-154 To handle gaps, i.e. insertions as well as deletions in the alignment, we used the tool LIP (Loops In Proteins) [5]. The program LIP is based on a comprehensive compilation of approximately 108 backbone conformations from a recent version of the Protein Data Bank [6]. In the first step protein segments are selected that fit approximately into the gap in the protein structure and that have the required number of amino acids. In order to evaluate the fitting, for each segment a goodness is calculated. The goodness is defined as the RMSD between a loop candidate and the gap in the protein structure with respect to the distance between the stem residues and several certain dihedral angles. This extraction procedure takes at most 15 seconds on a usual PC.

Thereafter, the selected protein segments are evaluated using an optimized scoring function. Besides the goodness, it includes additional values, i.e. the RMSD between the stem residues as well as a sequence alignment score based on a modified BLOSUM mutation matrix. In particular, exchanges of glycine and proline with other amino acids are treated individually. Clashes of the new loop with the core of the protein are avoided. The best-ranked segment is inserted into the gap between adjacent secondary structures, followed by a local geometry optimization. For the homology modeling approach LIP is combined with several publicly available tools.

The first step in homology modeling is to identify suitable templates. For this purpose, we performed searches with the alignment search tools BLAST and PSI-BLAST, respectively, in the Protein Data Bank [6-7]. A proper alignment between the target and template amino acid sequences is one of the major components in a structure constructed by comparative modeling. To obtain reasonable alignments using entire available protein family information, we used STRAP, which is a tool for generating multiple structure based alignments, developed in our research group [8]. Mutations, side chain rotamer selection and energy minimizations were performed by means of the protein visualization and modeling tool Swiss-PdbViewer, version 3.7b [9].

LIP is embedded in a graphical user interface and will be available after publication. A demo version for Windows can be downloaded from http://www.protein- design.com.

1. van Vlijmen H.W.T. et al. (1997) PDB-based Protein Loop Prediction: Parameters for Selection and Methods for Optimization. J. Mol. Biol. 267 (4), 975-1001. 2. Fiser A. et al. (2000) Modeling of loops in protein structures. Protein Sci. 9 (9), 1753-1773.

3. Tosatto S.C.E. et al. (2002) A divide and conquer approach to fast loop modeling. Protein Eng. 15 (4), 279-286.

4. Deane C.M. et al. (2001) CODA : A combined algorithm for predicting the structurally variable regions of protein models. Protein Sci. 10 (3), 599-612.

5. Michalsky E. et al. (in preparation).

6. Berman H.M. et al. (2000) The Protein Data Bank. Nucleic Acids Res. 28, 235-242. 7. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402. 8. Gille C. et al. (2001) STRAP: editor for STRuctural Alignments of Proteins. Bioinformatics 17 (4), 377-378. 9. Guex N. et al. (1997) SWISS-MODEL and the Swiss-PdbViewer: An environment for comparative protein modeling. Electrophoresis 18, 2714-2723.

A-155 Protfinder (P0282) - 222 predictions: 222 3D

Sequence-Structure Alignments With the Protfinder Algorithm

U. Bastolla Centro de Astrobiologia (INTA_CSIC), Madrid, Spain [email protected]

The Protfinder algorithm predicts protein structures by aligning the query sequence to candidate structures in the PDB. Alignments are evaluated through a minimal model of protein folding, which reproduces approximately some key features of protein thermodynamics and is very convenient for rapid computation.

Information on sequence homology is not used in the scoring function. Nevertheless, when sequence homologs are present in the structure database, they are in almost all cases predicted as the best scoring alignment.

Protein structures are represented as contact maps and their effective intramolecular interactions are modeled as a sum of contact interactions. We use the contact energy function optimized in Ref. [1], which assigns lowest energy to the experimentally known native structure for almost every sequence of monomeric protein whose structure has been determined by X-ray crystallography, except small fragments and chains with large cofactors. Moreover, it generates well-correlated energy landscapes, in the sense that structures very dissimilar from the native one have energies much higher than the native energy. This property is crucial for protein structure prediction. The effective energy function is also able to estimate the folding free energies of a set of small proteins folding with two-state thermodynamics, with reasonable agreement with experimental data [2].

The scoring function consists of three elements: the effective energy function described above, a chain entropy term estimated in Ref. [2] and a term penalizing gaps in the alignment. Gaps in secondary structure elements are strictly forbidden. Gaps in the structure are allowed only if the two residues that are shortcut are close in space and the angles characterizing their pseudo-peptidic bond lie within a predefined range. Gaps in the sequence are allowed only on the surface of the protein, which is identified by the fact that the number of contacts per residue is smaller than a threshold. Allowed gaps receive an energetic penalty G0 plus a penalty G1 for each residue in the gap.

To speed up the computation, each structure in the PDBSELECT [3] non-redundant subset of the PDB was preprocessed to produce its contact map and the list of allowed shortcuts in the structure. Secondary structure was obtained from the DSSP file [4] when available, otherwise from the PDB file. The few structures for which no secondary structure assignment could be obtained were discarded. Preprocessing, together with the fact that the code uses mostly integer arithmetic, speed up considerably the computation.

To search for the optimal alignment, we use a stochastic version of the deterministic Build-up algorithm developed by Park and Levitt to look for low energy configurations of discrete protein models [5]. The algorithm is very efficient at finding high-scoring alignments, although it is not guaranteed to find the best optimum.

A-156 The algorithm starts by generating all possible gapless alignments of length l between the query sequence and the test structure and stores the M alignments with maximum score. At each subsequent step, an attempt is made to add the residue at position k in the sequence to the M alignments. There are three possibilities: either the residue is aligned to the next structural position, or it is aligned introducing a gap in the structure (if allowed), or the residue is not aligned, initiating a gap in the sequence. All possible continuations are generated, and the M best scoring alignments are stored in memory and used as seeds for the next step. The algorithm is iterated until no other residue can be added.

Some tricks are used to improve the efficiency of the algorithm: 1) The algorithm is first applied using a small value M=50 to scan rapidly the whole database. The 200 proteins with the best alignments are then stored in memory and used for a second more accurate search with M=800. 2) Instead of using the deterministic algorithm described above, we select the M alignments at each step based on the sum of their score plus a random number. The relative importance of the randomness is large in the first steps, allowing the algorithm to visit a larger fraction of the alignment space instead of constructing very similar alignments. Then the randomness decreases as the alignments get longer, so that the choice of the complete alignment is made on the basis of the deterministic score. 3) Since the construction of the starting fragment is the most delicate step, the algorithm is applied using two or three different values of l.

Each candidate structure receives the score of its best alignment. The best scoring structure is used as prediction. The goodness of the prediction is estimated through the normalized energy gap, a parameter measuring the difference between the best score and the score of an alternative structure in units of the best score, divided by the structural distance between the best scoring structure and the alternative structure. If the minimal value of the normalized energy gap over all alternative structures is large (larger than 0.2), the prediction is reliable, if it is small alignments with very different structure have scores quite similar to the best one and reliability is very low.

1. Bastolla U. et al. (2000) A statistical mechanical method to optimize energy functions for protein folding. Proc. Natl. Acad. Sci. USA 97, 3977-3981 2. Bastolla U. Testing the thermodynamics of a minimal model of protein folding, in preparation 3. Hobohm U. and Sander C. (1994) Enlarged representative set of protein structures. Protein Sci. 3, 522-524 4. Kabsch W. and Sander C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22 (12), 2577-2637 5. Park B.H. and Levitt M. (1995) The complexity and accuracy of discrete state models of protein structure. J. Mol. Biol. 249, 493-507

PROTINFO-AB (P0140) - 260 predictions: 260 3D

An Automated Approach for De Novo Structure Prediction

Ram Samudrala, Shing-Chung Ngan University of Washington [email protected], [email protected]

A-157 Our general paradigm for predicting structure involves sampling the conformational space (or generating "decoys") such that native-like conformations are observed, and then selecting them using a hierarchical filtering technique using many different scoring functions. Our goal was to devise a method that would combine the best aspects of the more successful ab initio methods at the previous CASP experiments. There are three stages to our approach:

1. SECONDARY STRUCTURE PREDICTION: The consensus of the secondary structure predictions from the various servers at the CAFASP meta-server was used as the secondary structure prediction.

2. SEARCHING PROTEIN CONFORMATIONAL SPACE: We initially start with an all-atom conformation where residues predicted to be in helix/sheet by the consensus secondary structure prediction are set to idealised helix and sheet values. The remaining / values are set in an extended conformation. Side chain conformations are predicted by simply using the most frequently observed rotamer in a database of protein structures [1]. New conformations are generated by perturbing the existing conformation at a random residue position using either a value from a 14-state / model, or replacing three / values for three residues with identical sequence which are obtained from a database of known structures. The optimization function used is primarily a combination of an all-atom distance- dependent conditional probability discriminatory function (rapdf) and a hydrophobic compactness function (hcf) [2,3]. The fitness of the conformations were optimised by using two different protocols: a straight-forward Monte Carlo/simulated annealing approach [4] combined with a Genetic Algorithm strategy, and a conformational space annealing approach [5]. A combination of minimisation parametres and scoring functions were used to generate a large pool of conformations.

3. SELECTION OF NATIVE-LIKE CONFORMATIONS: The conformations generated were minimised using ENCAD [6] and scored using a combination of scoring functions that hierarchically reduces the total number of conformations produced to five which are used for the final submissions. The scoring functions used for the final filtering include a simple residue-residue contact function (Shell), a density-scoring function that is based on the distance of a conformation to all its relatives in the conformation pool, a secondary structure based scoring function that evaluates the match between the predicted structure and the secondary structure of a final energy- minimised conformation, and standard physics-based electrostatics and Van der Waals terms.

This work is an attempt at combining the best de novo prediction methods from the previous CASP experiments [2-5]. In addition, there are components that are unique to this approach primarily in the form of the hierarchical filtering methodology employed, the density scoring function, and in subtle variations of each of the search methods.

1. Samudrala R., Huang E.S., Koehl P., Levitt M. (2000) Side chain construction on near-native main chains for ab initio protein structure prediction. Prot Eng 7: 453- 457. 2. Xia Y., Huang E.S., Levitt M., Samudrala R. (2000) Ab initio construction of protein tertiary structures using a hierarchical approach. J Mol Biol 300: 171-185. 3. Samudrala R., Levitt M. (2002) A comprehensive analysis of 40 blind protein structure predictions. BMC Structural Biology 2: 3-18. 4. Simons K.T., Kooperberg C., Huang E., Baker D. (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. J Mol Biol 268: 209-225. 5. Lee J., Liwo A., Scheraga H.A. (1999) Energy-based de novo protein folding by conformational space annealing and an off-lattice united-residue force field: application to the 10-55 fragment of staphylococcal protein A and to apo calbindin D9K. Proc Natl Acad Sci USA 96: 2025-2030. 6. Levitt M., Hirshberg M., Sharon R., Daggett V. (1995) Potential energy function and parameters for simulations of the molecular dynamics of proteins and nucleic acids in solution. Comp Phys Comm 91: 215-231.

A-158 PROTINFO-CM (P0138) - 251 predictions: 251 3D

An Automated Approach for the Comparative Modeling of Protein Structure

Ram Samudrala University of Washington [email protected]

The interconnected nature of interactions in protein structures, thorough sampling of side chain and main chain conformations, and devising a discriminatory function that can distinguish between correct and incorrect conformations are the major hurdles preventing the construction of accurate homology models. We present an algorithm that uses graph theory to handle the problem of interconnectedness. Sampling of side chain and main chain conformations is accomplished by exhaustively enumerating all possible choices using a discrete state model, including fragments from a database of protein structures. The optimal combination of these possibilities is selected using an all-atom scoring function aided by the graph-theoretic approach.

Following is a brief description of the components and steps of this method, which can be divided into: discriminatory function, identification of template and generation of alignment, initial model building, construction of variable main chain and side chain regions, and moving models closer to the native conformation.

1. DISCRIMINATORY FUNCTION: the function used throughout generally is an all-atom distance-dependent conditional probability discriminatory function based on a statistical analysis of known protein structure. The negative log of the conditional probability of observing two atoms interact given a particular distance is used as a ``pseudo-energy'' term [1].

2. IDENTIFICATION OF TEMPLATE AND GENERATION OF ALIGNMENT: The CAFASP meta-server data (http://bioinfo.pl/cafasp) were used to identify the template proteins that a given target sequence was related to (based on a consensus of all the hits produced by the different servers). The templates were then fed into a multiple sequence alignment method (CLUSTALW [2]) and the pairwise alignments between the target and each of the templates were used to construct initial models. The initial models were then ranked by our discriminatory function and the models that ranked highest were used for further model-building. In addition to these initial models, a model based on the alignment derived from a structure comparison of the best scoring model output from our de novo fold generation method (see the abstract for PROTINFO-FR) to the corresponding template structure was also used.

3. INITIAL MODEL BUILDING: Following the sequence alignment, for each parent structure, an initial model was generated by copying atomic coordinates for the main chain (excluding any insertions) and for the side chains of residues that are identical in the target and parent structures. Residues that differ in type were constructed using a minimum perturbation technique. The MP method changes a given amino acid to the target amino acid preserving the values of equivalent chi angles between the two side chains, where available. The other chi angles are constructed by the MP method using an internally developed library based on residue type.

A-159 4. CONSTRUCTION OF VARIABLE MAIN CHAIN AND SIDE CHAIN REGIONS: Main chain sampling is performed using an exhaustive enumeration technique based on discrete states of / angles. For longer main chain regions, we use fragments (3-tuples) from a database of protein structures to generate the discrete / angles.

Side chains possibilities are generated by selecting the most probable side chain rotamers based on the interactions of a given rotamer with the local main chain (evaluated using the discriminatory function above) [3]. Side chains possibilities were also constructed using the program SCWRL [4].

We then use a graph-theoretic approach to assemble the sampled side chain and main chain conformations together in a consistent manner. Each possible conformation of a residue is represented using the notion of a node in a graph. Each node is given a weight based on the degree of the interaction between its side chain atoms and the local main chain atoms. The weight is computed using a all-atom conditional probability discriminatory function. Edges are then drawn between pairs of residues/nodes that are consistent with each other (i.e., clash-free and satisfying geometrical constraints). The edges are also weighted according to the probability of the interaction between atoms in the two residues. Once the entire graph is constructed, all the maximal sets of completely connected nodes (cliques) are found using a clique-finding algorithm. The cliques with the best probabilities represent the optimal combinations of mixing and matching between the various possibilities, taking the respective environments into account [5]. Clique-finding is accomplishing using the Bron and Kerbosch algorithm [6]. All models used were refined using ENCAD [7].

5. MOVING MODELS CLOSER TO THE NATIVE CONFORMATION:

Once we had generated a final model for each parent, we used an off-lattice 14-state / model and a sequential build-up algorithm to generate structures around the conformational space of the final model. We then used our scoring function to select the best ranking ones. The goal here is that some of the conformations sampled would actually be closer to the native conformation and that our scoring function will be able to select it.

We test how the above approach works in a comparative modelling scenario and assess the predictive power of this method by applying it to properly controlled blind tests as part of the fifth meeting on the Critical Assessment of protein Structure Prediction methods (CASP5). Compared to CASP2-4, where a similar approach was used [8], we have improved the method used to sample main chains and have made minor enhancements to the other components of this approach including the scoring function. It remains to be seen how the improvements in methodology correlate with model accuracy.

1. Samudrala R., Moult J. (1998) An all-atom distance dependent conditional probability discriminatory function for protein structure prediction. J Mol Biol 275: 893- 914. 2. Thompson J.D., Higgins D.G., Gibson T.J. (1994) CLUSTALW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673-4680. 3. Samudrala R., Moult J. (1998) Determinants of side chain conformational preferences in protein structures. Prot Eng 11: 991-997. 4. Bower M.J., Cohen F.E., Dunbrack R.L. (1997) Prediction of side-chain orientations from a backbone-dependent rotamer library: A new homology modelling tool. J Mol Biol 267: 1268-1282. 5. Samudrala R., Moult J. (1998) A graph-theoretic algorithm for comparative modelling of protein structure. J Mol Biol 279: 287-302. 6. Bron C., Kerbosch J. (1973) Algorithm 457: Finding all cliques of an undirected graph. Communications of the ACM 16: 575-577.

A-160 7. Levitt M., Hirshberg M., Sharon R., Daggett V. (1995) Potential energy function and parameters for simulations of the molecular dynamics of proteins and nucleic acids in solution. Comp Phys Comm 91: 215-231. 8. Samudrala R., Levitt M. (2002) A comprehensive analysis of 40 blind protein structure predictions. BMC Structural Biology 2: 3-18.

PROTINFO-FR (P0139) - 325 predictions: 325 3D

An Automated Approach for De Novo Fold Generation

Ram Samudrala University of Washington [email protected]

This is a completely novel and automated approach, based on the idea of driving a particular protein folding simulation towards a particular fold. The idea was derived from the observation that among distant homology recognition programs, at least could identify the correct template for every CASP target, even if the alignment was incorrect. The logic here is that if a template fold could be identified, we can use our de novo simulation approach to guide the conformation towards the fold, in conjunction our scoring functions.

The advantage of this approach is that it completely does away with the issues of alignment, the building of non-conserved side chains and main chains, and the use of a fixed template to construct a model. This enables us to circumvent explicit bias to the homologous parent structure (usually a problem in comparative modelling/fold recognition methods since there is no easy approach to move a model based on a template closer to its native structure).

We expect that this approach will perform well on cases where the sequence similarity between two proteins is not very high, in terms of improving the alignment, as well as obtaining a better conformation for the global fold.

Our general paradigm for predicting a fold involves sampling the conformational space (or generating "decoys") such that native-like conformations are observed, and then selecting them using a hierarchical filtering technique using many different scoring functions. There are four stages to our approach:

1. IDENTIFICATION OF THE TEMPLATE: The CAFASP meta-server data were used to identify the template proteins that a given target sequence was related to (based on a consensus of all the hits produced by the different servers). The templates were then fed into a multiple sequence alignment method (CLUSTALW; [1]) and the pairwise alignments between the target and each of the templates were used to construct initial models. The initial models were then ranked by our discriminatory function and the models that ranked highest were considered candidates for the template structure.

2. SECONDARY STRUCTURE PREDICTION: The consensus of the secondary structure predictions from the various servers at the CAFASP meta-server was used as the secondary structure prediction.

A-161 3. FITTING THE TARGET SEQUENCE TO THE TEMPLATE FOLD: We initially start with an all-atom conformation where residues predicted to be in helix/sheet by the consensus secondary structure prediction are set to idealised helix and sheet values. The remaining / values are set in an extended conformation. Side chain conformations are predicted by simply using the most frequently observed rotamer in a database of protein structures [2]. New conformations are generated by perturbing the existing conformation at a random residue position using either a value from a 14-state / model, or replacing three / values for three residues with identical sequence which are obtained from a database of known structures. The optimization function used is primarily the CA RMSD between the conformation being generated and the template structure, along with a combination of an all-atom distance-dependent conditional probability discriminatory function (rapdf) and a hydrophobic compactness function (hcf) [3,4]. The fitness of the conformations were optimised by using two different protocols: a straight-forward Monte Carlo/simulated annealing approach [5] combined with a Genetic Algorithm strategy, and a conformational space annealing approach [6]. A combination of minimization parametres and scoring functions were used to generate a large pool of conformations.

3. SELECTION OF NATIVE-LIKE CONFORMATIONS: The conformations generated were minimised using ENCAD [7] and scored using a combination of scoring functions that hierarchically reduces the total number of conformations produced to five which are used for the final submissions. The scoring functions used for the final filtering include a simple residue-residue contact function (Shell), a density-scoring function that is based on the distance of a conformation to all its relatives in the conformation pool, a secondary structure based scoring function that evaluates the match between the predicted structure and the secondary structure of a final energy- minimised conformation, and standard physics-based electrostatics and Van der Waals terms.

As we note above, this is a completely novel approach that combines aspects of all three major modelling approaches (comparative modelling, fold recognition, de novo prediction) to handle the most difficult targets. This method can also be used to generate alignments based on a structure comparison between the final models and the template structures, which we can feed into a more traditional comparative modelling procedure (see abstract for PROTINFO-CM). We expect that this approach will perform best on proteins where the evolutionary relationship between two proteins is not apparent from sequence comparison methods.

1. Thompson J.D., Higgins D.G., Gibson T.J. (1994) CLUSTALW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673-4680. 2. Samudrala R., Huang E.S., Koehl P., Levitt M. (2000) Side chain construction on near-native main chains for ab initio protein structure prediction. Prot Eng 7: 453- 457. 3. Xia Y., Huang E.S., Levitt M., Samudrala R. (2000) Ab initio construction of protein tertiary structures using a hierarchical approach. J Mol Biol 300: 171-185. 4. Samudrala R., Levitt M. (2002) A comprehensive analysis of 40 blind protein structure predictions. BMC Structural Biology 2: 3-18. 5. Simons K.T., Kooperberg C., Huang E., Baker D. (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. J Mol Biol 268: 209-225. 6. Lee J., Liwo A., Scheraga H.A. (1999) Energy-based de novo protein folding by conformational space annealing and an off-lattice united-residue force field: application to the 10-55 fragment of staphylococcal protein A and to apo calbindin D9K. Proc Natl Acad Sci USA 96: 2025-2030. 7. Levitt M., Hirshberg M., Sharon R., Daggett V. (1995) Potential energy function and parameters for simulations of the molecular dynamics of proteins and nucleic acids in solution. Comp Phys Comm 91: 215-231.

Pushchino (P0203) - 263 predictions: 263 3D

Threading Using Multiple Homology and Secondary Structure Prediction Information

A-162 M.Yu.Lobanov1, I.Litvinov2, N.S.Bogatyreva1, O.V.Galzitskaya1, M.S.Kondratova1, S.A.Garbuzynskiy1, D.N.Ivankov1, M.A.Roytberg2, A.V.Finkelstein1 1 - Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia, 2 - Institute of Mathematical Problems of Biology, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia [email protected]

For dividing long targets into domains we used our new program PROFILE (to be published) and the results of PSI-BLAST [1].

To predict secondary structure of targets we used programs JPRED [2], PSIPRED [3] and ALB [4].

To obtain an initial information on the possible target's fold we used standard programs PSI-BLAST, HMMer (on HMM profile libraries PFAM [5] and SUPERFAMILY [6]) and PROSITE [7].

Threading (of bunches of reliably homologous sequences onto 3D templates) was done by our program SCF_THREADER [8] with the scoring function that takes into account: homology of sequences, homology of predicted (for target) and real (for template) secondary structures, 3D-structure dependent gap penalties, 3D constrains of gaps in sequences threaded onto a template.

When the target-template homology was unambiguously detected by PSI-BLAST or HMMer, the SCF_THREADER usually gave a confident prediction of the same template. In such cases we generated the final model from all available alignments. When PSI-BLAST and HMMer gave nothing, we checked the best SCF_THREADER models for compact structures having hydrophobic cores.

Finally, visual inspection of the best results presented by programs and selection of the most reasonable ones was done. This inspection sometimes gave us also a possibility to merge several good predictions and make a common model. We preferred to take as templates the representatives of the most frequent fold in the programs' outputs, i.e. we did a kind of clustering of obtained predictions. In all cases we corrected the final alignments manually to get the resulting set of models.

The authors acknowledge support of the Howard Hughes Medical Institute, the Russian Foundation for Basic Research, INTAS, and the Netherland Organization for Scientific research.

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402. 2. Cuff J.A. et al. (1998) JPred: a consensus secondary structure prediction server. Bioinformatics. 14 (10), 892-893. 3. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202. 4. Ptitsyn O.B., Finkelstein A.V.(1983) Theory of protein secondary structure and algorithm of its prediction. Biopolymers. 22 (1), 15-25. 5. Sonnhammer E.L.L., Eddy S.R., Durbin R. (1997) Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins: Structure, Function and Genetics. 28 (3), 405-420. 6. Gough J., Karplus K., Hughey R., Chothia C. (2001) Assignment of homology to genome sequences using a library of hidden markov models that represent all proteins of known structure J. Mol. Biol. 313 (4), 903-919.

A-163 7. Falquet L. et al. (2002) The PROSITE database, its status in 2002. Nucleic Acids Res. 30 (1), 235-238. 8. Rykunov D.S. (2000) Search for the most stable folds of protein chains: III improvement in fold recognition by averaging over homologous sequences and 3D structures. Proteins. 40 (3), 494-501.

Raghava-Gajendara (P0054) - 482 predictions: 224 3D, 258 SS

RPFOLD: Recognition of Protein Fold from Sequence and Predicted Secondary Structure Using Standard Sequence Searching Methods

G. P. S. Raghava Institute of Microbial Technology, Chandigarh, INDIA [email protected]

RPFOLD uses the following steps for fold recognition: (1) Secondary structure of query protein is predicted using PSIPRED; (2) predicted secondary structure was searched using FASTA against database of secondary structure generated from PDB by DSSP; (3) query protein sequence was searched against non-redundant database using 3 iterations of PSIBLAST and profile was generated; (4) the profile was used to search similar sequences in PDB using PSIBLAST; (5) SSEARCH was used to search query sequence against PDB; (6) All the hits obtained from above were ranked based on score and weightage; (7) Clustal-W was to align query sequence with predicted secondary structure information and target protein in PDB with assigned secondary structure information to get final alignment and re-ranking of hits.

APSSP/Raghava-Gajendra (P0137) - 65 predictions: 65 SS

APSSP: Automatic Method for Protein Secondary Structure Prediction

G. P. S. Raghava Institute of Microbial Technology, Chandigarh, INDIA [email protected]

This method is similar to APSSP2 where it uses the three steps for protein secondary structure. First, JNET was used for predicting secondary structure of proteins. In second step it predicts the secondary of proteins using modified example based learning (EBL) technique. Modification of standard EBL is major step in this study. In third step secondary structure predicted from above two steps are combined in order to predict the final structure. The combination of two is based on reliability score. The modified EBL approach is same as APSSP2 except that it uses only one distance matrix instead of 8000 distance matrices were used in APSSP2.

A-164 APSSP2/Raghava-Gajendra (P0055) - 65 predictions: 65 SS

APSSP2: A Combination Method for Protein Secondary Structureprediction Based on Neural Network and Example Based Learning

G. P. S. Raghava Institute of Microbial Technology, Chandigarh, INDIA [email protected]

This method uses the three steps for protein secondary structure. First, it uses the standard neural network and multiple sequence alignment generated by PSIBLAST instead of single sequence. In second step it predicts the secondary of proteins using modified example based learning (EBL) technique. Modification of standard EBL is major step in this study. In third step secondary structure predicted from above two steps are combined in order to predict the final structure. The combination of two is based on reliability score.

In order to implement EBL, we first select all proteins that have resolution better than 2.8A in PDB and minimum length of 50. Than we assigned secondary structure of protein using DSSP and we generated pattern of length 17 residues with secondary structure state of central residue. Thus, we got more than 6,000,000 patterns, which have more than 1,300,000 non-redundant patterns. We trained our EBL method on these 1,300,000 unique patterns for prediction of secondary structure. One of the major limitations of standard EBL method is its speed, because one needs to compare query pattern against all 1,300,000 pattern. For example in order to predict secondary structure of protein having 200 amino acid one need to compare nearly 200x1,300,00 at pattern level and 17x200x1,300,000 at residue level (considering pattern length 17). Thus it takes hours on reasonable powerful machine so it’s not practical advisable to use the standard EBL method in fully automatic servers such as APSSP2.

We divide the pattern for training and prediction in 8000 sets based on three central residues (central and left and right residue to central). This rehashing is similar to BLAST. Than we create the distance metric for each of 8000 sets thus improve the speed 8000 times. This may affect the performance of method which was compensated because we used 8000 matrices instead of one matrix.

Author, believe that in future we have no alternate except to combine the generalized methods (such as neural network) with homology based methods (such as EBL), because in future we will have more and more known examples. In that case EBL method is more accurate in comparison to generalized methods. It has been observed in EVA that on number of proteins APSSP2 were able to predict secondary structure with 100% accuracy where other methods fail to do so.

RAPTOR (P0144) - 227 predictions: 227 3D

A-165 Protein Threading by Integer Programming

J. Xu and M. Li Department of Computer Science, University of Waterloo [email protected], [email protected]

Protein three-dimensional structure prediction through threading method has been extensively studied and various models and algorithms have been proposed. In order to further explore ways to improve accuracy and efficiency of the threading process, our program RAPTOR investigates the effectiveness of a new method: protein threading via integer programming. RAPTOR minimizes the energy function (i.e. seeks for the optimal alignment between sequence and template) by integer programming method. The energy function used by RAPTOR rigorously takes the pair-wise contact potential into account. Based on the contact map models of protein 3D structural templates, we formulate the threading problem as a large scale integer programming problem, then relax to a linear programming problem, and finally solve the integer program by the branch-and-cut method. In solving the linear programs, 99% real data generate the integral linear optimal solution directly without branching, which means 99% instances could be solved within polynomial time although the problem itself is NP-hard. The final solution is optimal with respect to energy functions incorporating pair-wise interaction potential and allowing variable gaps. After optimal alignments, raw z score is calculated by randomly shuffling the query sequence. Fourteen features including the raw z score are extracted out from the optimal alignment. Support Vector Machine (SVM) method is employed to do fold recognition using these features. SVM method could recognize much more folds than the raw z score according to the experimental tests. The algorithm has been implemented as software package RAPTOR (Rapid Protein Threading predictOR). Experimental results for fold recognition show that RAPTOR significantly outperforms other programs at the fold similarity level. The RAPTOR server is available at http://www.cs.uwaterloo.ca/~j3xu/RAPTOR_form.htm. For more detailed description, please refer to our paper [1-2]. No manual intervention is used to adjust the final models. All submissions are generated directly from RAPTOR.

1. Jinbo Xu, Ming Li, Ying Xu et al. (2003) Protein Threading By Linear Programming. PSB2003. 2. Jinbo Xu, Ming Li, Ying Xu. (2003) On the Power of Integer Programming Approach to Protein Threading. Submitted to RECOMB2003.

Rokko (P0327) - 109 predictions: 109 3D

De Novo Protein Structure Prediction Using Simfold; Physico-Chemical Approach

Yoshimi Fujitsuka1, George Chikenji1, Nobuyasu Koga1, Akira R. Kinjo2, and Shoji Takada12 1Kobe University, 2Japan Science and Technology Corporation [email protected]

For CASP5, we use SimFold, a protein simulation program that we have been developing recently [1,2]. We briefly describe a) the energy function, b) the sampling method in SimFold, and c) how we did in CASP5.

A-166 a) SimFold uses a coarse-grained protein model that has explicit backbone atoms and a sphere at the center of mass of sidechain. Each sidechain can take one of several rotamer states. The energy function is based on physico-chemical consideration and consists of many terms such as hydrophobic interaction, hydrogen bonds, vdW interactions, and so on. In particular, hydrogen bond interactions include dependence on local dielectric constant and correlation in neighboring two bonds in beta sheet. Many of length-parameters are determined from database survey. For the energetic parameters that need to be accurate, we optimized them on the basis of the energy landscape theory. For each of a 40 training protein structure set, we maximize |Z| score, the normalized difference between native energy and average energy in decoy structures. b) For conformational sampling, SimFold uses either the fragment assembly (FA) method or the replica exchange MD method. We emphasize that, very uniquely, both FA and MD methods are available in a single program SimFold. Our FA is different from what has been developed by Baker's group in two respects. First, for most of calculation, we only use three-residue-fragments, instead of nine residue ones. Second, we have developed an algorithm of "reversible FA method" (Chikenji, Fujitsuka, & Takada unpublished). We note that the typical FA protocol does not obey the detailed balance, but our algorithm does. Thanks to this property, we could even combine the FA with the multi-canonical ensemble approach, which is indeed used in CASP5 and helps conformational sampling very significantly. Structures either with the lowest energy or at the center of large clusters are chosen as predicted models. We also perform MD-based replica exchange simulation, where each replica has the same protein with different temperature and exchanges of replicas are tried at a certain frequency. The lowest energy structure is searched in the replica at the lowest temperature, while high temperature replica is useful for escaping from misfolded traps. c) In CASP5, for all targets that have no homologous sequences of known structures, we submitted structures predicted by SimFold. For chains shorter than ~120, starting from random structures, we performed FA sampling either with multi-canonical ensemble method or with simulated annealing. We chose either structures with low energies or those at the center of large clusters. For longer sequences, we started from models in CAFASP server and performed replica exchange MD for sampling and chose structures in the lowest temperature replica. For some targets, other information such as annotation was used too.

1. Takada S. (2001) Protein Folding Simulation With Solvent-Induced Force Field: Folding Pathway Ensemble of Three-Helix-Bundle Proteins, Proteins 42, 85-98. 2. Fujitsuka Y., Takada S., Luthey-Schulten Z.A., & Wolynes, P.G. (2002) Optimizing Physical Energy Functions for Protein Folding submitted.

Ron-Elber (P0300) - 259 predictions: 259 3D

Protein Structure Prediction With Threading Using the LOOPP2 Algorithm

T. Galor1, C. Lowe1, J. Meller2, J. Pillardy3, O. Teodorescu1 and R. Elber1 1 –Department of Computer Science, Cornell University, Ithaca, N.Y., 14853; 2 –Cincinnati Children’s Medical Center, Pediatric Informatics, 3333 Burnet Avenue, Cincinnati, OH 4522; 3– Computational Biology Service Unit, Cornell University, Ithaca, N.Y., 14853 [email protected]

A-167 The structures of the target proteins were predicted using a new version of our LOOPP (Learning, Observing and Outputting Protein Patterns) threading algorithm [1]. The algorithm is centered on threading, while sequence similarity and secondary structure prediction were used in order to improve search scope and accuracy.

Compared to the earlier version of LOOPP that participated in CASP4 the following enhancements were introduced: (i) deeper Z-score searches, (ii) use of multiple sequence alignment, and (iii) secondary structure filtering.

Calculations of the Z score (for global and local alignments) are expensive and in LOOPP we limited our Z score calculations to the first 50 global and 250 local lowest energy scoring sequences, only the sequences occurring in both lists were considered for final scoring. Here we extend the depth of the search to include all the sequences occurring in any (local or global) of the alignments. We use BLAST [2] to detect other sequences of significant similarity (between 40 and 80 percent sequence identity). At most 10 homolog sequences are used and compared against the database of annotated protein sequences and structures. Every hit of the related sequences count as a prediction. We use a secondary structure predictor JNET [3] to eliminate false positive by removing prediction with negative correlation to JNET prediction.

An extensive test of 68 probe sequences against a threading database will be presented. The threading database includes a list of 132 homologous proteins to probe sequences and 692 decoys. On average there are 2 homolog sequences for each probe sequence in the database. The set was prepared using the CE program [4] ensuring structural similarity of homologous pair and structural dissimilarity of decoys. It is used to compare the performance of the old LOOPP with the new fully automated protocol of LOOPP2. In brief, the old LOOPP found 30 sequences in the top 5 while LOOPP2 found 57, which is an increase of 90 percent. When LOOPP2 is used with only one sequence, 54 pairs are found at the top five. Hence, the multiple sequence approach gave an enhancement of about 6 percent. If the top 10 hits are considered, the single sequence version of LOOPP2 found 70 pairs while the multi sequence version found 83, an increase of 18.6 percent.

1. Elber R, Meller J., (2001) Linear programming optimization and a double statistical filter for protein threading protocol. Proteins: structure, function and genetics 45 (3) 241-261. 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 3. Cuff J.A., Barton G.J. (2000) Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins: structure, function and genetics 40 (3) 502-511. 4. Shindyalov I.N., Bourne P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering 11 (9) 739- 747.

Rykunov-Reva-Tarakanov (P0529) - 198 predictions: 198 3D

Fold Recognition by Threading with Energy Averaging over Homologous Sequences and 3D Structures

D. Rykunov1, B. Reva2, and A. Tarakanov1 2Institute of Mathematical Problems of Biology, Russian Academy of Sciences, Puschino, Moscow region, 142292; 1Institute of Theoretical and Experimental Biophysics, Russian Academy of Sciences, Puschino Moscow region, 142292

A-168 [email protected]

In fold predictions, we tried to implement the results of recent publications on developing threading method [4-6]. For threading database, we determined a diverse set of protein domain structures for available protein families and super-families basing on SCOP [1] classification. Each of the SCOP families was represented by sequences with pairwise similarity less than 80%. This analysis resulted in 5887 individual protein structures representing 1790 protein families. We used PSI-BLAST [2] for determining homologs of target sequences; only those of sequence similarity 80-50% were retained for computations. Each of the target sequences and the corresponding homologs were threaded over 5887 individual protein backbones using the threading model of [6] with distance-dependent phenomenological potentials of [3]. Sequence-to-structure alignments were computed in approximation of “external field” [6], i.e. interactions between residues remote along a chain were substituted by interactions with a template protein; short-range interactions between neighbor residues were taken into account explicitly. For obtained sequence-to- structure alignments the actual energies were computed [6]. These energies were averaged both for target homologs and for templates within structural families [4,5]. The sequence-to-structure alignments for top ranking families were selected for human expertise that included comparison of secondary structure prediction given by threading with the one obtained from PredictProtein [7] and visual inspection of computed 3D structures. The less contradictive models were chosen for submissions.

1. Murzin A. et al. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536-540. 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 3. Reva B.A. et al. (1997) Residue-residue mean-force potentials for protein structure recognition. Protein Engineering 10 (8), 865-876. 4. Reva B.A. et al. (1999) Averaging interaction energies over homologs improves protein fold recognition in gapless threading. Proteins 35, 353-359. 5. Rykunov D.S. et al. (2000) Search for the most stable folds of protein chains. III. Improvement in fold recognition by averaging over homologous sequences and 3D structures. Protein 40, 494-501. 6. Reva B.A. et al. (2002) Threading with Chemostructural Restrictions Method for Predicting Fold and Functionally Significant Residues: Application to Dipeptidylpeptidase IV (DPP-IV). Proteins 47, 180-193. 7. Rost B. (1996) Methods in Enzymology 266, 525-539.

SAM-T02-human (P0001) - 203 predictions: 138 3D, 65 SS

SAM-T02: Protein Structure Prediction with Neural Nets, Hidden Markov Models, and Fragment Packing

Kevin Karplus, Rachel Karchin, Richard Hughey, Jenny Draper, Yael Mandel-Gutfreund, Jonathan Casper, and Mark Diekhans Center for Biomolecular Science and Engineering, University of California, Santa Cruz [email protected]

The SAM-T02 human predictions start with the same method as the SAM-T02 server:

A-169 Use the SAM-T2K method for finding homologs of the target and aligning them.

Make local structure predictions using neural nets and the multiple alignment. We currently have 5 local-structure alphabets:

DSSP STRIDE STR an extended version of DSSP that splits the beta strands into multiple classes (parallel/antiparallel/mixed, edge/center) ALPHA a discretization of the alpha torsion angle: CA(i-i), CA(i), CA(i+1), CA(i+2) DSSP_EHL2 CASP's collapse of the DSSP alphabet

DSSP_EHL2 is not predicted directly by a neural net, but is computed as a weighted average of the other 4 networks (each probability vector output is multiplied by conditional probability matrix P(E|letter) P(H|letter) P(L|letter)). The weights for the averaging are the mutual information between the local structure alphabet and the DSSP_EHL2 alphabet in a large training set.

We make four 2-track HMMs (1.0 amino acid + 0.3 local structure) and use them to score a template library of about 6200 templates. We also used a single-track HMM to score not just the template library, but a non-redundant copy of the entire PDB. [Difference from server: the web server did not include the ALPHA alphabet in either the DSSP_EHL2 computation or the 2-track HMMS.]

One-track HMMs built from the template library multiple alignments were used to score the target sequence.

All the logs of e-values were combined in a weighted average (with rather arbitrary weights, since we did not have time to optimize them), and the best templates ranked.

Alignments of the target to the top templates were made using several different alignment methods (all using the SAM hmmscore program).

After the large set of alignments were made the "human" methods and the server diverge significantly. The server just picks the best-scoring templates (after removing redundancy) and reports the local posterior-decoding alignments made with the 2-track AA+STR target HMM.

The hand method used SAM's "fragfinder" program and the 2-track AA+STR HMM to find short fragments (9 residues long) for each position in the sequence (6 fragments were kept for each position).

Then the "undertaker" program (named because it optimizes burial) is used to try to combine the alignments and the fragments into a consistent 3D model. No single alignment or parent template was used, though in many cases one had much more influence than the others. The alignment scores were not passed to undertaker, but were used only to pick the set of alignments and fragments that undertaker would see.

A-170 A genetic algorithm with about 16 different operators was used to optimize a score function. The score function was hand-tweaked for each target (mainly by adding constraints to keep beta sheets together, but also by adjusting what terms were included in the score function and what weights were used). Undertaker was undergoing extensive modification during CASP season, so may have had quite different features available for different targets.

Bower and Dunbrack's SCWRL was run on some of the intermediate conformations generated by undertaker, but the final conformation was chosen entirely by the undertaker score function.

Optimization was generally done in many passes, with hand inspection of the best conformation after each pass, followed (often) by tweaking the score function to move the conformation in a direction we desired. In a few cases, when we started getting a decent structure that did not correspond well to our input alignments, we submitted the structure to VAST to get structure-structure alignments, to try to find some other possible templates to use as a base. In some cases, when several conformations had good parts, different conformations were manually cut-and-pasted, with undertaker run to try to smooth out the transitions.

Because undertaker does not (yet) handle multimers, we often added "scaffolding" constraints by hand to try to retain structure in dimerization interfaces. This is a crude hack that we hope to get rid of when we have multimers implemented. Because undertaker does not (yet) have a hydrogen-bond scoring function, we often had to add constraints to hold beta sheets together. In some cases where the register was not obvious, we had to guess or try several different registers. In some cases, when we got desperate for initial starting points, we threw the Robetta ab-initio models into the undertaker pool, and optimized from them as well as the ones undertaker started with.

For multiple-domain models, we generally broke the sequence into chunks (often somewhat arbitrary overlapping chunks), and did the full SAM-T02 method for each subchain. The alignments found were all tossed into the undertaker conformation search. In some cases, we performed undertaker runs for the subchains, and cut-and- pasted the pieces into one PDB file (with bad breaks) and let undertaker try to assemble the pieces.

SAM-T02-server (P0189) - 221 predictions: 221 3D

SAM-T02 Protein Structure Prediction Webserver

Kevin Karplus, Rachel Karchin, and Richard Hughey Center for Biomolecular Science and Engineering, University of California, Santa Cruz [email protected]

SAM-T02 predicts the fold and secondary structure of a target protein sequence using multi-track hidden Markov models and neural nets trained on multiple alignments generated by the SAM-T2K iterated search procedure [1-4].

A-171 As a first step, we build a multiple alignment of homologs to the target sequence. Next, neural nets and the multiple alignment are used to make local structure predictions. SAM-T02 is currently using these local-structure alphabets: DSSP [5], STRIDE [6], STR [1] an extended version of DSSP that splits the beta strands into multiple classes (parallel/antiparallel/mixed/edge/center), and DSSP-EHL2, CASP’s collapse of the DSSP alphabet into three states. DSSP-EHL2 is not predicted directly by a neural net, but is computed as a weighted average of the other 3 networks (each probability vector output is multiplied by conditional probability matrix P(E|letter) P(H|letter) P(L|letter)). The weights for the averaging are the mutual information between each local structure alphabet (DSSP, STRIDE, STR) and the DSSP-EHL2 alphabet in a large training set.

We make four 2-track HMMs (1.0 amino acid + 0.3 local structure) and use them to score a template library of about 6200 templates. We also use a single-track HMM to score not just the template library, but a non-redundant copy of the entire PDB. One-track HMMs built from the template library multiple alignments are also used to score the target sequence.

The HMM scores for each sequence, with respect to the four 2-track HMMs and one template HMM, are converted to e-values, and the logs of all e-values are combined in a weighted average (with rather arbitrary weights, since we did not have time to optimize them). The combined scores are used to rank the best templates.

Alignments of the target to the top templates are made using several different alignment methods (all using the SAM hmmscore program [8]).

After the large set of alignments is made, the server picks the best-scoring templates (after removing redundancy) and reports the local posterior-decoding alignments made with the 2-track AA+STR target HMM. The AA+STR HMM has produced the best quality alignments in our tests on sets of structurally similar protein pairs with low to moderate sequence identity [1].

The server also provides secondary structure predictions for the target sequence in a variety of formats and sequence logos [7] for the multiple alignment and secondary- structure predictions.

SAM-T02 is available at: http://www.soe.ucsc.edu/research/compbio/HMM-apps/T02-query.html

Stand-alone SAM programs (free for academics, government labs and non-profits) are available at: http://www.soe.ucsc.edu/research/compbio/sam.html

1. Karchin R. et. al. (2002) Hidden Markov models that use predicted local structure for fold recognition: alphabets of backbone geometry. Submitted to Proteins: Structure, Function, and Genetics 2. Karplus K. et. al. (2001) What is the value added by human intervention in protein structure prediction?. Proteins: Structure, Function, and Genetics, 45 (S5), 86- 91. 3. Karplus K. et. al. (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14 (10), 846-856.

A-172 4. Park J. et. al. (1998) Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol., 284 (4), 1201-1210 5. Kabsch W. and Sander C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22 (12), 2577-2637 6. Frishman D. and Arogs P. (1995) Knowledge-based Protein Secondary Structure Assignment. Proteins: Structure, Function, and Genetics, 23 (4), 566-579 7. Schneider T.D. and Stephens R.M. (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res, 18 (10), 6097-100 8. Hughey R. and Krogh A. (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. CABIOS, 2 (12), 95-107

SAMUDRALA-COMPARATIVE-MODELLING (P0053) – 248 predictions: 248 3D

An Automated Approach for the Comparative Modeling of Protein Structure

Ram Samudrala University of Washington [email protected]

The interconnected nature of interactions in protein structures, thorough sampling of side chain and main chain conformations, and devising a discriminatory function that can distinguish between correct and incorrect conformations are the major hurdles preventing the construction of accurate homology models. We present an algorithm that uses graph theory to handle the problem of interconnectedness. Sampling of side chain and main chain conformations is accomplished by exhaustively enumerating all possible choices using a discrete state model, including fragments from a database of protein structures. The optimal combination of these possibilities is selected using an all-atom scoring function aided by the graph-theoretic approach.

Following is a brief description of the components and steps of this method, which can be divided into: discriminatory function, identification of template and generation of alignment, initial model building, construction of variable main chain and side chain regions, and moving models closer to the native conformation.

1. DISCRIMINATORY FUNCTION: the function used throughout generally is an all-atom distance-dependent conditional probability discriminatory function based on a statistical analysis of known protein structure. The negative log of the conditional probability of observing two atoms interact given a particular distance is used as a ``pseudo-energy'' term [1].

2. IDENTIFICATION OF TEMPLATE AND GENERATION OF ALIGNMENT: The CAFASP meta-server data (http://bioinfo.pl/cafasp) were used to identify the template proteins that a given target sequence was related to (based on a consensus of all the hits produced by the different servers). The templates were then fed into a multiple sequence alignment method (CLUSTALW [2]) and the pairwise alignments between the target and each of the templates were used to construct initial models. The initial models were then ranked by our discriminatory function and the models that ranked highest were used for further model-building. In addition to these initial

A-173 models, a model based on the alignment derived from a structure comparison of the best scoring model output from our de novo fold generation method (see the abstract for SAMUDRALA-FOLD-RECOGNITION) to the corresponding template structure was also used.

3. INITIAL MODEL BUILDING: Following the sequence alignment, for each parent structure, an initial model was generated by copying atomic coordinates for the main chain (excluding any insertions) and for the side chains of residues that are identical in the target and parent structures. Residues that differ in type were constructed using a minimum perturbation technique. The MP method changes a given amino acid to the target amino acid preserving the values of equivalent chi angles between the two side chains, where available. The other chi angles are constructed by the MP method using an internally developed library based on residue type.

4. CONSTRUCTION OF VARIABLE MAIN CHAIN AND SIDE CHAIN REGIONS: Main chain sampling is performed using an exhaustive enumeration technique based on discrete states of / angles. For longer main chain regions, we use fragments (3-tuples) from a database of protein structures to generate the discrete / angles.

Side chains possibilities are generated by selecting the most probable side chain rotamers based on the interactions of a given rotamer with the local main chain (evaluated using the discriminatory function above) [3]. Side chains possibilities were also constructed using the program SCWRL [4].

We then use a graph-theoretic approach to assemble the sampled side chain and main chain conformations together in a consistent manner. Each possible conformation of a residue is represented using the notion of a node in a graph. Each node is given a weight based on the degree of the interaction between its side chain atoms and the local main chain atoms. The weight is computed using a all-atom conditional probability discriminatory function. Edges are then drawn between pairs of residues/nodes that are consistent with each other (i.e., clash-free and satisfying geometrical constraints). The edges are also weighted according to the probability of the interaction between atoms in the two residues. Once the entire graph is constructed, all the maximal sets of completely connected nodes (cliques) are found using a clique-finding algorithm. The cliques with the best probabilities represent the optimal combinations of mixing and matching between the various possibilities, taking the respective environments into account [5]. Clique-finding is accomplishing using the Bron and Kerbosch algorithm [6]. All models used were refined using ENCAD [7].

5. MOVING MODELS CLOSER TO THE NATIVE CONFORMATION:

Once we had generated a final model for each parent, we used an off-lattice 14-state / model and a sequential build-up algorithm to generate structures around the conformational space of the final model. We then used our scoring function to select the best ranking ones. The goal here is that some of the conformationssampled would actually be closer to the native conformation and that our scoring function will be able to select it.

We test how the above approach works in a comparative modelling scenario and assess the predictive power of this method by applying it to properly controlled blind tests as part of the fifth meeting on the Critical Assessment of protein Structure Prediction methods (CASP5). Compared to CASP2-4, where a similar approach was used [8], we have improved the method used to sample main chains and have made minor enhancements to the other components of this approach including the scoring function. It remains to be seen how the improvements in methodology correlate with model accuracy.

Note: This method is completely automated and the models are generated using the same process as the corresponding PROTINFO server (http://protinfo.compbio.washington.edu). The difference between the predictions submitted as part of the server registration and the ones submitted under this group

A-174 code is that because of lack of the time limits that we had for CAFASP (48 hours), we can make more predictions. Also, in cases we noticed clearly egregious output, we re-ran the automated methods with different parametre weights (i.e., there was an additional step involving interactive observation for a small number of the targets).

1. Samudrala R., Moult J. (1998) An all-atom distance dependent conditional probability discriminatory function for protein structure prediction. J Mol Biol 275: 893- 914. 2. Thompson J.D., Higgins D.G., Gibson T.J. (1994) CLUSTALW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673-4680. 3. Samudrala R., Moult J. (1998) Determinants of side chain conformational preferences in protein structures. Prot Eng 11: 991-997. 4. Bower M.J., Cohen F.E., Dunbrack R.L. (1997) Prediction of side-chain orientations from a backbone-dependent rotamer library: A new homology modelling tool. J Mol Biol 267: 1268-1282. 5. Samudrala R., Moult J. (1998) A graph-theoretic algorithm for comparative modelling of protein structure. J Mol Biol 279: 287-302. 6. Bron C., Kerbosch J. (1973) Algorithm 457: Finding all cliques of anundirected graph. Communications of the ACM 16: 575-577. 7. Levitt M., Hirshberg M., Sharon R., Daggett V. (1995) Potential energy function and parameters for simulations of the molecular dynamics of proteins and nucleic acids in solution. Comp Phys Comm 91: 215-231. 8. Samudrala R., Levitt M. (2002) A comprehensive analysis of 40 blind protein structure predictions. BMC Structural Biology 2: 3-18.

SAMUDRALA-FOLD-RECOGNITION (P0052) - 315 predictions: 315 3D

An Automated Approach for De Novo Fold Generation

Ram Samudrala University of Washington [email protected]

This is a completely novel and automated approach, based on the idea of driving a particular protein folding simulation towards a particular fold. The idea was derived from the observation that among distant homology recognition programs, at least could identify the correct template for every CASP target, even if the alignment was incorrect. The logic here is that if a template fold could be identified, we can use our de novo simulation approach to guide the conformation towards the fold, in conjunction our scoring functions.

The advantage of this approach is that it completely does away with the issues of alignment, the building of non-conserved side chains and main chains, and the use of a fixed template to construct a model. This enables us to circumvent explicit bias to the homologous parent structure (usually a problem in comparative modelling/fold recognition methods since there is no easy approach to move a model based on a template closer to its native structure).

We expect that this approach will perform well on cases where the sequence similarity between two proteins is not very high, in terms of improving the alignment, as well as obtaining a better conformation for the global fold.

A-175 Our general paradigm for predicting a fold involves sampling the conformational space (or generating "decoys") such that native-like conformations are observed, and then selecting them using a hierarchical filtering technique using many different scoring functions. There are four stages to our approach:

1. IDENTIFICATION OF THE TEMPLATE: The CAFASP meta-server data were used to identify the template proteins that a given target sequence was related to (based on a consensus of all the hits produced by the different servers). The templates were then fed into a multiple sequence alignment method (CLUSTALW; [1]) and the pairwise alignments between the target and each of the templates were used to construct initial models. The initial models were then ranked by our discriminatory function and the models that ranked highest were considered candidates for the template structure.

2. SECONDARY STRUCTURE PREDICTION: The consensus of the secondary structure predictions from the various servers at the CAFASP meta-server was used as the secondary structure prediction.

3. FITTING THE TARGET SEQUENCE TO THE TEMPLATE FOLD: We initially start with an all-atom conformation where residues predicted to be in helix/sheet by the consensus secondary structure prediction are set to idealised helix and sheet values. The remaining / values are set in an extended conformation. Side chain conformations are predicted by simply using the most frequently observed rotamer in a database of protein structures [2]. New conformations are generated by perturbing the existing conformation at a random residue position using either a value from a 14-state / model, or replacing three / values for three residues with identical sequence which are obtained from a database of known structures. The optimization function used is primarily the CA RMSD between the conformation being generated and the template structure, along with a combination of an all-atom distance-dependent conditional probability discriminatory function (rapdf) and a hydrophobic compactness function (hcf) [3,4]. The fitness of the conformations were optimised by using two different protocols: a straight-forward Monte Carlo/simulated annealing approach [5] combined with a Genetic Algorithm strategy, and a conformational space annealing approach [6]. A combination of minimization parametres and scoring functions were used to generate a large pool of conformations.

3. SELECTION OF NATIVE-LIKE CONFORMATIONS: The conformations generated were minimised using ENCAD [7] and scored using a combination of scoring functions that hierarchically reduces the total number of conformations produced to five which are used for the final submissions. The scoring functions used for the final filtering include a simple residue-residue contact function (Shell), a density-scoring function that is based on the distance of a conformation to all its relatives in the conformation pool, a secondary structure based scoring function that evaluates the match between the predicted structure and the secondary structure of a final energy- minimised conformation, and standard physics-based electrostatics and Van der Waals terms.

As we note above, this is a completely novel approach that combines aspects of all three major modelling approaches (comparative modelling, fold recognition, de novo prediction) to handle the most difficult targets. This method can also be used to generate alignments based on a structure comparison between the final models and the template structures, which we can feed into a more traditional comparative modelling procedure (see abstract for SAMUDRALA-COMPARATIVE-MODELLING). We expect that this approach will perform best on proteins where the evolutionary relationship between two proteins is not apparent from sequence comparison methods.

Note: This method is completely automated and the models are generated using the same process as the corresponding PROTINFO server (http://protinfo.compbio.washington.edu). The difference between the predictions submitted as part of the server registration and the ones submitted under this group

A-176 code is that because of lack of the time limits that we had for CAFASP (48 hours), we can make more predictions. Also, in cases we noticed clearly egregious output, we re-ran the automated methods with different parametre weights (i.e., there was an additional step involving interactive observation for a small number of the targets).

1. Thompson J.D., Higgins D.G., Gibson T.J. (1994) CLUSTALW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673-4680. 2. Samudrala R., Huang E.S., Koehl P., Levitt M. (2000) Side chain construction on near-native main chains for ab initio protein structure prediction. Prot Eng 7: 453- 457. 3. Xia Y., Huang E.S., Levitt M., Samudrala R. (2000) Ab initio construction of protein tertiary structures using a hierarchical approach. J Mol Biol 300: 171-185. 4. Samudrala R., Levitt M. (2002) A comprehensive analysis of 40 blind protein structure predictions. BMC Structural Biology 2: 3-18. 5. Simons K.T., Kooperberg C., Huang E., Baker D. (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. J Mol Biol 268: 209-225. 6. Lee J., Liwo A., Scheraga H.A. (1999) Energy-based de novo protein folding by conformational space annealing and an off-lattice united-residue force field: application to the 10-55 fragment of staphylococcal protein A and to apo calbindin D9K. Proc Natl Acad Sci USA 96: 2025-2030. 7. Levitt M., Hirshberg M., Sharon R., Daggett V. (1995) Potential energy function and parameters for simulations of the molecular dynamics of proteins and nucleic acids in solution. Comp Phys Comm 91: 215-231.

SAMUDRALA-NEWFOLD (P0051) - 410 predictions: 410 3D

An Automated Approach for De Novo Structure Prediction

Ram Samudrala, Shing-Chung Ngan University of Washington [email protected], [email protected]

Our general paradigm for predicting structure involves sampling the conformational space (or generating "decoys") such that native-like conformations are observed, and then selecting them using a hierarchical filtering technique using many different scoring functions. Our goal was to devise a method that would combine the best aspects of the more successful ab initio methods at the previous CASP experiments. There are three stages to our approach:

1. SECONDARY STRUCTURE PREDICTION: The consensus of the secondary structure predictions from the various servers at the CAFASP meta-server was used as the secondary structure prediction.

2. SEARCHING PROTEIN CONFORMATIONAL SPACE: We initially start with an all-atom conformation where residues predicted to be in helix/sheet by the consensus secondary structure prediction are set to idealised helix and sheet values. The remaining / values are set in an extended conformation. Side chain conformations are predicted by simply using the most frequently observed rotamer in a database of protein structures [1]. New conformations are generated by perturbing the existing conformation at a random residue position using either a value from a 14-state / model, or replacing three / values for three residues with identical sequence which are obtained from a database of known structures. The optimization function used is primarily a combination of an all-atom distance-

A-177 dependent conditional probability discriminatory function (rapdf) and a hydrophobic compactness function (hcf) [2,3]. The fitness of the conformations were optimised by using two different protocols: a straight-forward Monte Carlo/simulated annealing approach [4] combined with a Genetic Algorithm strategy, and a conformational space annealing approach [5]. A combination of minimisation parametres and scoring functions were used to generate a large pool of conformations.

3. SELECTION OF NATIVE-LIKE CONFORMATIONS: The conformations generated were minimised using ENCAD [6] and scored using a combination of scoring functions that hierarchically reduces the total number of conformations produced to five which are used for the final submissions. The scoring functions used for the final filtering include a simple residue-residue contact function (Shell), a density-scoring function that is based on the distance of a conformation to all its relatives in the conformation pool, a secondary structure based scoring function that evaluates the match between the predicted structure and the secondary structure of a final energy- minimised conformation, and standard physics-based electrostatics and Van der Waals terms.

This work is an attempt at combining the best de novo prediction methods from the previous CASP experiments [2-5]. In addition, there are components that are unique to this approach primarily in the form of the hierarchical filtering methodology employed, the density scoring function, and in subtle variations of each of the search methods.

Note: This method is completely automated and the models are generated using the same process as the corresponding PROTINFO server (http://protinfo.compbio.washington.edu). The difference between the predictions submitted as part of the server registration and the ones submitted under this group code is that because of lack of the time limits that we had for CAFASP (48 hours), we can make more predictions. Also, in cases we noticed clearly egregious output, we re-ran the automated methods with different parametre weights (i.e., there was an additional step involving interactive observation for a small number of the targets).

1. Samudrala R., Huang E.S., Koehl P., Levitt M. (2000) Side chain construction on near-native main chains for ab initio protein structure prediction. Prot Eng 7: 453- 457. 2. Xia Y., Huang E.S., Levitt M., Samudrala R. (2000) Ab initio construction of protein tertiary structures using a hierarchical approach. J Mol Biol 300: 171-185. 3. Samudrala R., Levitt M. (2002) A comprehensive analysis of 40 blind protein structure predictions. BMC Structural Biology 2: 3-18. 4. Simons K.T., Kooperberg C., Huang E., Baker D. (1997) Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and bayesian scoring functions. J Mol Biol 268: 209-225. 6. Lee J., Liwo A., Scheraga H.A. (1999) Energy-based de novo protein folding by conformational space annealing and an off-lattice united-residue force field: application to the 10-55 fragment of staphylococcal protein A and to apo calbindin D9K. Proc Natl Acad Sci USA 96: 2025-2030. 7. Levitt M., Hirshberg M., Sharon R., Daggett V. (1995) Potential energy function and parameters for simulations of the molecular dynamics of proteins and nucleic acids in solution. Comp Phys Comm 91: 215-231.

Sasson-Iris (P0265) - 66 predictions: 66 3D

Full-Atom Modeling using INBGU and 3D-SHOTGUN

I. Sasson

A-178 Ben-Gurion University [email protected]

Full atom models were generated using initial alignments from the INBGU server as multiple template input to Modeller.

The goal was to test the power of the 3D-SHOTGUN selection procedure carried out by the SHGU method and at the same time to generate protein-like, full-atom models, without the fragmentation and collisions present in the C-alpha only SHGU models.

Most of the models were generated in a fully automated manner.

We describe the preliminary approach used. We are in the process of incorporating a fully automated procedure into the INBGU server.

SBC (P0084) - 94 predictions: 94 3D

The Pcons and Pmodeller Consensus Fold Recognition Servers

Björn Wallner, Fang Huisheng and Arne Elofsson Stockholm Bioinformatics Center, Stockholm University, 106 91 Stockholm, Sweden [email protected]

In the CASP and CAFASP processes it has been shown that manual experts are better to predict the fold of an unknown protein than fully automated methods. The best manual predictions seem to be performed by authors using a wide-range of different methods, and the most obvious similarity between them is that they have worked on fold recognition for years. Several of these experts also develop methods, however these methods do not perform as well as the experts them self. What are the secrets that the manual experts possess, but are not able to put into a computer?

We have recently showed that one such secret is the use of a ``consensus'' approach in fold recognition. By using several different methods, the same method with different parameters or searching using several homologous sequences a ``consensus'' prediction can be made. The consensus analysis can also be done using only a single sequence and a single method, by searching for similar hits among the top-scoring hits. In contrast, most automatic methods do only use a single sequence, a single set of parameters and do not use the top-scoring hits to search for ``consensus'' predictions. We will describe a new method for fold recognition, Pcons[1], that utilizes the ``consensus analysis'' to improve automatic fold recognition.

Further, the ability to separate correct models of protein structures from less correct models is of the greatest important for protein structure prediction methods. Several studies have examined the ability of different types of energy function to detect the native, or native-like, protein structure from a large set of decoys. In contrast to

A-179 earlier studies we examine here the ability to detect models that only show some structural similarity to the native structure. These correct models are defined by the existence of a fragment that show significant similarity between this model and the native structure. Further, it has been shown that the existence of such fragments are useful for comparing the performance between different fold recognition methods and that this performance correlate well with performance in fold recognition.

We developed a neural network based method to predict the quality of a protein model (ProQ). ProQ extracts structural features, such as frequency of atom-atom contacts, and predicts the quality of a model, as measured either by LGscore or MaxSub. We show that ProQ performs at least as good as other measures when identifying the native structure and better at the detection of correct models. This performance is maintained over several different test sets.

ProQ [2] can also be combined with the Pcons[1] fold recognition predictor to increase its performance. However, the improvement is quite marginal, with the main advantage being the elimination of a few high-scoring false positive models.

ProQ is freely available as a standalone web server on http://www.sbc.su.se/~bjorn/ProQ/, and is incorporated into Pcons consensus server, available at http://www.sbc.su.se/~arne/pcons/ as Pmodeller. Current results in LiveBench indicates that Pmodeller performs significantly better than Pcons.

1. Lundström et al. (2001) Pcons: A neural network based consensus predictor thatimproves fold recognition. Protein Science 10(11):2354-6 2. Wallner and Elofson (2002) Can correct protein models be identified ? submitted

Scheraga-Harold (P0314) - 135 predictions: 135 3D

Physics-Based Protein-Structure Prediction Using the UNRES and ECEPP/3 Force Fields - Test on CASP5 Targets

C. Czaplewski1,2, D.R. Ripoll1,3, St. Ołdziej1,2, R. Kaźmierkiewicz1,2, J.A. Vila1,4, A. Liwo1, J. Pillardy1,3, J.A. Saunders1, M. Chinchio1, M. Nanias1, M. Khalili1, Y.A. Arnautova1, A. Jagielska1, Y. K. Kang1,5, K.D. Gibson1 and H.A. Scheraga1* 1 – Baker Laboratory of Chemistry and Chemical Biology, Cornell University, Ithaca, NY, 14853-1301, 2 – Faculty of Chemistry, University of Gdańsk, ul. Sobieskiego 18, 80-952 Gdańsk, Poland, 3 – Cornell Theory Center, Cornell University, Ithaca, NY, 14853-1301,4 - Escuela de Fisica, Facultad de Ciencias Fisico Matematicas y Naturales, Universidad Nacional de San Luis, Ejercito de los Andes 950, 5700 San Luis, Argentina, 5 - Department of Chemistry, Chungbuk NationalUniversity, Cheongju, Chungbuk 361-763, Korea *[email protected]

The structures of the target proteins were predicted using a hierarchical algorithm consisting of three major stages, in which the tertiary structure is predicted at low resolution and then refined.

In stage 1, the protein is represented by a simplified low-resolution united residue (UNRES) model, in which the atoms of the peptide group and side chain of each amino-acid residue are replaced with two centers of interactions: the united peptide group (p) located in the middle between two consecutive -carbon atoms and the

A-180 united side chain (SC). The lengths of the virtual C…C and C…SC bonds are held fixed, but the virtual-bond angles and the orientations of the C…SC virtual bonds are variable. The interactions of this simplified model are described by the UNRES potential derived from the generalized cumulant expansion of restricted free energy (RFE) function of polypeptide chains [1]. The cumulant expansion enabled us to determine the functional forms of the multibody terms in the UNRES potential.

The UNRES potential was parameterized using RFE surfaces of systems modeling interacting fragments of polypeptide chains calculated at quantum-mechanical ab initio level [using the Möller-Plesset perturbation theory up to the second order (MP2) with 6-31G* basis set], as well as correlation and distribution functions determined from the Protein Data Bank (PDB). The folding property of the potential function was achieved by applying our novel hierarchical optimization method targeted at decreasing the energy while increasing the native-likeness of a structure of benchmark protein(s) [2].

Our conformational space annealing (CSA) method [3] was used to search for the lowest-energy families of UNRES conformations. To speed up the search in the case of larger proteins, information from secondary structure prediction was used in the generation of the initial structures and/or to restrict the conformational search. However, unrestricted search was also performed in most of the cases. The five families with the lowest UNRES energy were chosen as models 1-5; the structures of these models were then refined in stages 2 and 3, as described below.

In stage 2, the low-resolution UNRES conformations of a target protein were converted to all-atom models by using our energy-based method for the reconstruction of an all-atom polypeptide chain from its C-trace and side-chain-centroid coordinates [4,5]. Finally, in stage 3, the all-atom structures were refined by minimizing their energies with the all-atom ECEPP/3 force field [6] subject to C-distance constraints of the parent UNRES models.

1. Liwo A. et al. (2001) Cumulant-based expressions for the multibody terms for the correlation between local and electrostatic interactions in the united-residue force field. J. Chem. Phys. 115 (5), 2323-2347. 2. Liwo A. et al. (2002) A method for optimizing potential-energy functions by a hierarchical design of the potential-energy landscape: Application to the UNRES force field. Proc. Natl. Acad. Sci. USA. 99 (4), 1937-1942. 3. Lee J. et al. (1999) Conformational space annealing and an off-lattice united-residue force field: application to the 10-55 fragment of staphylococcal protein A and to apo calbindin D9K. Proc. Natl. Acad. Sci. USA. 96 (5), 2025-2030. 4. Kaźmierkiewicz R. et al. (2001) Energy-based reconstruction of a protein backbone from its -carbon trace by a Monte Carlo method. J. Comput. Chem. 23 (7) 715-723. 5. Kaźmierkiewicz R. et al. (2002) Addition of side chains to a known backbone with defined side-chain centroids. Biophys. Chem. in press. 6. Nemethy G. et al. Energy parameters in polypeptides. 10. Improved geometrical parameters and nonbonded interactions for use in the ECEPP/3 algorithm with application to proline-containing peptides. J. Phys. Chem. 96 (15) 6472-6484.

Schulten-Wolynes (P0093) - 118 predictions: 118 3D

Bioformatics Based Threading for Protein Structure Prediction

A-181 P. O’Donoghue1, F. Autenrieth1, R. Amaro1, M. Januszyk1, T. Pogorelov1, P. G. Wolynes2, and Z. Luthey-Schulten1 1 - University of Illinois at Urbana-Champaign 2 - University of California, San-Diego [email protected]

We used a combination of methods to select scaffolds for the target sequences. These methods included: 1) using our in-house sequence-structure threading alignment algorithm (termed the local Hamiltonian) [4] to thread the target sequence against a PDB select database, PDB25 and/or PDB90, [2]; 2) a PSI-BLAST simultaneous search against the PDB database and sequence databases from several organisms from NCBI or the Biology Workbench [1,9]; 3) a PubMed literature search [12] for functionally related proteins to the target sequence. If CAFASP3 [11] reported scaffolds not related to those found by the first three search methods, then these scaffolds were included in our analysis. Large proteins were subdivided into putative domains using a variety of methods including analysis of multiple sequence alignments and exon prediction algorithms.

Three procedures were used to generate sequence alignments of the target sequence to the scaffold: our in-house sequence-structure threading alignment algorithm [4], a sequence to structure profile-profile alignment procedure as described in [5], and a hybrid method that uses the threading alignment over some regions and the profile- profile alignment over other regions of the target sequence. This hybrid method is also described in [5]. Sequence profiles were generated using Clustal-W [8], on sequences obtained from a PSI-BLAST search of Swiss-Prot or the Non-Redundant sequence databases. Structure profiles were generated using the CE algorithm for structural alignment [7]. We manually checked complete alignments for agreement between the PsiPred [3] secondary structure prediction for the target and the secondary structure of the scaffold. We also checked for correct alignment of homologous functional sites.

The complete 3-dimensional models of the target proteins with side chains were made using the Modeller package as implemented in Insight II [6,10]. Starting from the alignments mentioned above, three models were generated using the highest level of optimization.

Our in-house sequence-structure threading alignment algorithm [4] was used to rank the model structures constructed for a given target sequence. After threading the target sequence onto the model structures of the target, the model with the best local Hamiltonian energy was chosen as our “MODEL 1”. Additional models followed this energy ranking.

1. Altschul F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402. 2. Hobohm U. et al. (1992) Selection of representative protein data sets. Prot. Sci. 1, 409-417. 3. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202. 4. Koretke K.K. et al. (1996) Self-consistently optimized statistical mechanical energy functions for sequence structure alignment. Prot. Sci. 5, 1043-1059. 5. O'Donoghue P. et al. (2001) On the Structure of hisH: Protein Structure Prediction in the Context of Structural and Functional Genomics. J. Struct. Biol. 134, 257- 268. 6. Sali A. et al. (1993) J. Mol. Biol. 234, 779-815. 7. Shindyalov I.N. et al. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Prot. Eng. 11, 739-747. 8. Thompson J.D. et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4680. 9. http://workbench.sdsc.edu

A-182 10. http://www.accelrys.com 11. http://www.cs.bgu.ac.il/~dfiischer/CAFASP3/ 12. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi

SDSC2:Reddy-Bourne (P0347) - 54 predictions: 54 3D

Using Combination of PSI-BLAST, Expdb, 3D-PSSM, SAM-T02, FUGUE, Multalin, Swissmodel and Swiss-PDB Viewer Web Servers with Human Input to Model Protein Structures.

V. Boojala B. Reddy1 and Philip E. Bourne1,2 1San Diego Supercomputer Center, 2Department of Pharmacology, University of California, San Diego, CA 92093 - 0537

Template Selection:

Target sequences are taken from the CASP5 target site in the FASTA format and submitted for PSI-BLAST [1] search at the NCBI Blast site using the pdb as search sequence database for template identification. If template(s) is identified with expectation score less than 10-5, one best resolved structure with good sequence coverage has been selected as basis structure to model the target sequence. ExPDB[6] template search is used for this purpose.

If no template is identified in the NCBI-pdb sequences we have submitted the sequence to fold recognition servers, 3D-PSSM [5], SAM-T02 [4] and FUGUE [7]. We have used the top five suggested templates from each of these servers for further consideration as possible templates. We have also considered their structural homologues from FSSP as possible template set. We have submitted the target sequence for PSI-BLAST [1] search with NR sequence database to identify its available sequence homologues. Functional information from the NR homologues sequences was matched with the possible templates from above fold recognition servers to decide on a single basis structure to be used to model the target structure.

Template Target Alignment:

We have randomly chosen few (at least 4 each) distant homologues (e-value 10 -20 or more) of template and target sequences (a total of 8 including the template and the target sequences). All these sequences were multi aligned using the MultAlin [2] server. The resultant alignment between template and the target sequences along with their homologs is used as final alignment between template and target sequence.

Model Building and Visualization:

A-183 We have used Swiss-PDB viewer (Deep View) [3] to load the template structure and the target sequence. Both the sequences were aligned as per the MultAlin [2] and the resultant alignment is submitted to Swiss Model through web submission form. The model built by Swiss-Model is viewed and the coordinate file was edited appropriately as per the CASP5 TS submission format. Only one model is submitted for each target. In all our models we have used only one best possible template to model a target structure.

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389 - 3402. 2. Corpet F. (1988) Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. 16 (22), 10881 - 10890. 3. Guex N., Peitsch M.C. (1997) SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling. Electrophoresis. 1997 18 (15): 2714 - 2723. 4. Karplus K., Karchin R., Barrett C., Tu S., Cline M., Diekhans M., Grate L., Casper J., Hughey R. (2001) What is the value added by human intervention in protein structure prediction? Proteins. Suppl 5: 86 - 91. 5. Kelley L.A., MacCallum R.M., Sternberg M.J. (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J Mol Biol. 299 (2): 499 - 520. 6. Peitch M.C., Schwede T. and Guex N. (2000) Automated protein modeling – the proteome in 3D. Pharmacogenomics 1: 257 – 266. 7. Shi J., Blundell T.L., Mizuguchi K. FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J Mol Biol. 310 (1): 243 - 257.

Shakhnovich-Eugene (P0459) - 26 predictions: 26 3D

Structure Prediction: a Synthesis of Threading and Folding

W. Chen1, E. Kussell1, F. Zhang2, B. Shakhnovich3, I. Hubner2, E.I. Shakhnovich2 1 - Dept of Biophysics, Harvard University , 2 - Dept of Chemistry and Chemical Biology, Harvard University, 3 –Bioinformatics Program, Boston University [email protected]

We present a method for protein structure prediction that is a synthesis of bioinformatic and all-atom physical approaches. Our method is a combination of threading, model-building, specific potential derivation, and full-atom folding. This novel structure prediction protocol is flexible in that it provides for varying levels of detail in both input and output. At the same time, the method is efficiently automated and human intervention is controllable at each step. The combination of informatic and physical concepts in this method also means that it draws upon strengths inherent to each approach.

A-184 We begin with a query sequence, the structure of which is unknown. Fold recognition begins with either of two methods: threading or ELISA query. [1] The ELISA database is a database of protein domains sorted by structural and sequence similarity, and further annotated by function. Each set of domains is connected to others in a graph structure on the basis of a threshhold Z-score. We submitted the query sequence to ELISA, which yielded domain hits and a set of graph neighbors, the number of which can be selected by adjusting the threshhold Z-score. This small set of structures was then subjected to threading for fold recognition refinement or direct modeling if the initial ELISA was significant enough.

Threading was also used for fold recognition, or for generating an alignment if a significant ELISA hit was retrieved. The query sequence is subjected to Monte Carlo threading through a very large set of templates. [2] Appropriate bioinformatic constraints in the form of structural profiles or alignment fragment number restrictions gives accuracy and significance to threading hits. The most significant template hit and alignment is judged by the magnitude of the  -parameter, and is selected as a template for model building. [3-4]

Using the template and threading alignment, a full-atom backbone is generated with appropriate proline geometries. The backbone secondary structure is obtained directly from the alignment and the template. Loop regions (non-aligned regions) are of appropriate length and are random in conformation. The backbone is minimized with a dRMS function to reflect the tertiary structure of the template. Sidechains with random rotameric states are built onto the final dRMS-minimized model.

To refine the model in both rotamer and backbone states, we derive a family-specific full-atom potential for use in full-atom refinement. The fold family is obtained from relatives of the threading template hits. These related proteins are used for derivation of a full-atom potential. We use an atom-typing scheme based on six residue types: aliphatic, aromatic, positively charged, negatively charged, polar, and special. Atoms within each residue type then have unique types based loosely on similarity of chemical connectivity. This results in 28 atom types for the 20 amino acids. The potential is derived using the -potential procedure [5], allowing for the parameters Nab and Ñab to be summed across the family.

The dRMS-minimized model with sidechains is refined using the derived potential and a generic geometric hydrogen bond potential. Refinement is done by Monte Carlo annealing at low temperatures to bring the protein to low free energy conformations. [6] For next-stage refinement, conformations can be automatically selected on the basis of low free energy, or by hand after visual examination with a molecule viewer. The penultimate structure is put through a final round of refinement by fixing the backbone and allowing sampling of sidechain rotamers from a rotamer library, using the derived potential. This "repacking" of sidechains ensures both improved packing of the interior of the model and also realistic rotamer conformations. [7] The final structure is inspected using the Protein Health Facility in Quanta.

The method we have outlined draws its strengths from the combination of bioinformatic and realistic full-atom approaches. Threading constrains the query chain to a small subset of possible conformations, and full-atom folding further refines the coarse results of threading. The method, being multi-stage and open-ended to iterations, is flexible and can be adapted to other sorts of input at each stage. Ab initio models can be generated by elimination of the threading stage, or BLAST and ELISA and other fold-selection criteria can be substituted for threading as the input to the model-building stage. [8]

1. Shakhnovich B. et al. (2002) Functional fingerprints of folds: evidence for correlated structure-function evolution. (JMB, submitted). 2. Mirny L.A. & Shakhnovich E.I. (1998) Protein structure prediction by threading. Why it works and why it does not. J. Mol. Biol. 283 (2), 507-526. 3. Chen W. et al. (2002) Fold recognition with minimal gaps. (submitted). 4. Mirny L.A. et al. (2000) Statistical significance of protein structure prediction by threading. Proc. Natl. Acad. Sci. 97 (18), 9978­9983.

A-185 5. Kussell E. et al. (2002) A structure­based method for derivation of all­atom potentials for protein folding. Proc. Natl. Acad. Sci. 99 (8), 5343­5348. 6. Shimada J. et al. (2001) The folding thermodynamics and kinetics of Crambin using an all­atom Monte Carlo simulation. J. Mol. Biol. 308 (1), 79­95. 7. Kussell E. et al. (2001) Excluded volume in protein side-chain packing. J. Mol. Biol. 311 (1), 183-193. 8. Shakhnovich B. et al (2002). In prepration

SHESTOPALOV (P0044) - 159 predictions: 79 3D, 80 SS

Protein Fold Recognition and Secondary Structure Prediction by Doublet Code Method in the CASP5 Experiment

B.V. Shestopalov, G.R. Mavropulo-Stolyarenko, A.M. Lebedev Institutute of Cytology, Russian Academy of Sciences [email protected]

It is presented the new version of the doublet code model of protein secondary structure and its application for fold recognition and secondary structure prediction in the CASP5 experiment. The basis of the model see in [1-2]. This version allows more accurate prediction (about 4-5%). The prediction is obtained for 98% of the protein chain. On the basis of the model the hypothesis is formulated that secondary structure predicted by doublet code method is the secondary structure for unfolded state of protein chain. On the basis of this hypothesis it is possible to search parents for a target using predicted secondary structures of the target and proteins from Protein Data Bank [3]. If secondary structures for unfolded state are similar, secondary structures for folded state and three-dimensional structures are similar also. To avoid mistakes in prediction, secondary structures of homologs of these proteins are used. For easy targets BLAST results [4-5] and Conserved Domain Database and Search Service v1.58 [6] are used. Secondary structure has been predicted for all 65 proteins and parents have been suggested for 48 proteins.

1. Shestopalov B.V. (1990) Prediction of protein secondary structure by doublet code method. Mol. Biol. (Moscow), Engl. Transl. 24 (4) 900-907 2. Shestopalov B.V. (2000) Doublet Code of Protein Secondary Structure and its application for Secondary Structure Prediction and Fold recognition. Abstracts submitted to the CASP4 meeting 3. Berman H.M. et al. (2000) The Protein Data Bank. Nucleic Acids Res. 28(1), 235-242 4. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 5. Gish W. (1996-1999) http://blast.wustl.edu 6. Marchler-Bauer A. et al. (2002) CDD: a database of conserved domain alignments with links to domain three-dimensional structure. Nucleic Acids Research 30 (1), 281-283

Shortle (P0349) - 32 predictions: 32 3D

A-186 Protein Structure Prediction Using Fragment Ensembles with Highly Favorable Ramachandan / Rotamer Propensities

Qiaojun Fang and David Shortle Department of Biological Chemistry The Johns Hopkins University School of Medicine [email protected]

Three dimensional models of the backbone plus CB atom for targets in the fold recognition and new fold categories were constructed in three steps: (i) prediction of secondary structure; (ii) definition of turn directions between elements of secondary structure; (iii) recombination of helix/strand – turn –helix/strand fragments to generated longer pieces (60-120 amino acids) of the target protein. At each step, extensive use was made of ensembles of fragments from the PDB to identify the predominant low resolution patterns.

Secondary structure was predicted by threading overlapping segments of 6 to 12 residues from the target sequence through known protein structures, selecting for conformations that optimize Ramachandran, rotamer, and solvent exposure propensities [1]. Each position was assigned the most common secondary structure after averaging over approximately 100 fragments. Final decisions on ambiguous secondary structure, which could not be resolved by analysis of a small number of homologues, were often arbitrated by the results from PSIPRED and other CAFASP servers.

Segments of the target sequence corresponding to helix/strand – turn – helix/strand elements were threaded through approximately 5000 PDB structures, selecting for conformations that optimize Ramachandran, rotamer and solvent exposure propensities and that also have negative energies as assessed by empirical pair potentials. A small number of homologues (3 to 10) were also threaded at this step, with clustering of conformations between conformational sets from different homologous sequences to reduce noise from individual sequences and to identify the most common turn features (i.e., those with the highest entropy,[2]). A final set of 20 to 40 conformations, recovered from one or more homologues, was saved for recombination to form larger fragments.

Conformation sets overlapping by one helix/strand were recombined, selecting for relatively compact, relatively bump free chains with low energies (empirical pair potentials). Visual inspection of superposed sets of recombinants by both investigators was used extensively to infer the most common topology of these long chain fragments. In the last step, individual recombinant chains were manually reworked to reduce bumps, achieve greater compactness, and enforce protein-like patterns of tertiary interactions within the inferred topology.

1. Shortle D. (2002) Composites of local structural propensities: evidence for local encoding of long range structure. Protein Science 11, 18-26. 2. Shortle D., Simons K.T., Baker D. (1998) Clustering of low energy conformations near the native structures of small proteins. PNAS 95, 11158-11162.

sk-lab (P0403) - 2 predictions: 2 3D

Knowledge based consensus method for structure prediction

A-187 S. Krishnaswamy, Preeti Mehta, P.D. Kumar, A.V.S.K. Mohan Katta Bioinformatics Centre, School of Biotechnology, Madurai Kamaraj University, Madurai 625 021, India [email protected]

The modeling of three-dimensional structures of proteins is rendered difficult due to the sequence-structure-function degeneracy. Thus the most successful methods have relied upon comparative modeling techniques [1-2]. These comparative modeling techniques rely on the convergence of structure based on sequence identity [3]. The sequence to structure degeneracy has lead to the use of fold prediction or threading methods [4-5]. These methods rely on the idea that known structures can be used as a knowledge base from which one extracts information for modeling. There are many examples of groups of proteins with a similar fold but with no sequence similarity [6]. The method that we have adopted assumes that sub-structures can be stitched together to form larger structures [7]. The method we refer to as the knowledge based consensus method was developed [8] and refined [9] for the prediction of a protein called McrA from E.coli, which we are in the process of structure determination. McrA has less than 20% identity to known proteins in the database. We have used this method in the CASP5 experiment to predict the structure of two targets T0192 and T0176. The modules InsightII, Homology and Discover of the Biosym package were used for the model building and energy minimization. The CVFF forcefield available in the Biosym software was used for the energy minimization. The method involves fragment selection based on searches against the PDB. The matches are not selected only based on the E-value but based on predicted secondary structure matches, possible functional similarity of the template protein and hydrophobic characteristics. The final decision took into account the need to preserve contiguity of secondary structures and was arrived at on the basis of consensus of these selection criteria. Wherever possible preference was given to selection from the same set of templates in order to minimize the number of template structures used. Thus this selection has a certain amount of subjectivity. The longest and region with the best consensus was chosen for the start. A homology- based approach was then used to assign coordinates to the sub-sequence (termed here as ‘pseudo SCR’) based on the template structure. This process was continued with the selection of a new region until all the possible pseudo SCRs were assigned coordinates. Each time it was ensured that the coordinates of the previous and new fragments were in the same reference frame. The template was discarded if there were contact or topology problems and a new template was chosen at the next consensus level. Once this process is complete, the intervening regions were assigned coordinates using the loop search algorithm. The joint or splice regions were repaired, side chain conformations were optimized based on rotamer libraries and the model was subjected to cycles of energy minimization using Steepest Descent and Conjugate Gradient methods. The resulting model was examined in the graphics for inconsistencies such as buried charge residues and distribution in the allowed regions of the Ramachandran plot. These were corrected, if possible, or the model was discarded and a different template structure was chosen for the problem region and the modeling process was re-started from that position.

1. Johnson N.S. et al (1994) Knowledge based protein modeling. Crit. Rev. Biochem. Mol. Biol. 29, 1-68. 2. Tramontano A. (1998) Homology modeling with low sequence identity. Methods: A companion to Methods in Enzymology. 14, 293-300. 3. Chothia C. and Lesk A.M. (1986) The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823-826. 4. Jones D.T. et al (1992) A new approach to protein fold recognition. Nature 358, 86-89. 5. Godzik A. et al (1992) Topology fingerprint approach to inverse protein folding problem. J. Mol. Biol. 227, 227-238. 6. Murzin A.G. et al (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J . Mol. Biol. 247, 536-540. 7. Jones A.T. and Thirup S. (1986) Using known substructures in protein model building and crystallography EMBO J. 5, 819-822. 8. Krishnaswamy S. et al (1995) Knowledge based consensus approach to molecular modeling of McrA. Protein Science 4 (suppl) 2, 86. 9. Deva T. (2000) Structural analysis of type II restriction endonucleases And the atypical modified cytosine restriction endonuclease McrA from E.coli Ph.D. thesis submitted to Madurai Kamaraj University, Madurai, India.

A-188 Skolnick-Kolinski (P0010) - 361 predictions: 361 3D

TOUCHSTONE: A Unified Approach To Protein Structure Prediction

Y. Zhang1, A. Arakaki1, D. Kihara1, M. Boniecki2, A. Szilagyi1, A. Kolinski1,2 and J. Skolnick1 1Center of Excellence in Bioinformatics, University at Buffalo, 2Faculty of Chemistry, Warsaw University, Poland [email protected]

We have applied the TOUCHSTONE [1] folding algorithm that spans the range from homology modeling to ab initio folding to all the protein targets in CASP5. Using our threading algorithm PROSPECTOR [2], one first threads against a representative set of PDB templates. If a template is significantly hit, generalized comparative modeling using a number of variants is done. Among these variants involve freezing the aligned regions either on (the CABS model a new lattice model, CABS, that represents each residue by the C, C, and the side chain center of mass) or off lattice and relaxing the remaining structure to accommodate insertions or deletions with respect to the template. Alternatively, if multiple templates are identified, both local and long range distant restraints are extracted and used in the CABS lattice based structure assembly algorithm. The generalized comparative modeling component is designed to span the range from closely to distantly related proteins from the template. If a significant template is not identified, then consensus contacts from weakly threading templates are pooled and incorporated into our ab initio folding algorithm. In addition for both generalized comparative modeling and ab initio cases, predicted secondary structure from PSIPRED [3] as well as consensus local distance restraints from PROSPECTOR are used. For ab initio folding, the CABS model is used exclusively. In all cases, conformational space is sampled by replica exchange Monte Carlo [1,4-5]. The resulting structures are clustered [6-7] and ranked according to cluster diversity, population and where applicable functional considerations. On this basis, the top five candidates are submitted to CASP.

1. Kihara D., et al. (2001) TOUCHSTONE: an ab initio protein structure prediction method that uses threading-based tertiary restraints. Proc Natl Acad Sci U S A 98(18), 10125-10130. 2. Skolnick J. and Kihara D. (2001) Defrosting the frozen approximation: PROSPECTOR--a new approach to threading. Proteins 42(3), 319-331. 3. McGuffin L.J., Bryson K., and Jones D.T. (2000) The PSIPRED protein structure prediction server. Bioinformatics 16(4), 404-405. 4. Swendsen R.H. and Wang J.S. (1986) Replica Monte Carlo simulations. Phys. Rev. Lett. 57, 2607-2609. 5. Ferrenberg A.M. and Swendsen R.H. (1988) New Monte Carlo technique for studying phase transitions. Phys. Rev. Lett. 61, 2635-2637. 6. Betancourt M.R. and Skolnick J., (2001) Universal similarity measure for comparing protein structures. Biopolymers 59(5), 305-309. 7. Betancourt M.R. and Skolnick J. (2001) Finding the needle in a haystack: Educing native folds from ambiguous ab initio protein structure prediction. J Comput Chem. 22, 339-353.

SMD-CCS (P0249) - 4 predictions: 4 3D

A-189 Protein Modeling at CASP5

F.Giordanetto1, M. Saqi2, S. Jha1 and P.V. Coveney1 1 – Centre for Computational Sciences, Department of Chemistry, Queen Mary, University of London, Mile End Road, E1 4NS, London, 2 – Bioinformatics, Dept. of Medical Microbiology, Barts and The London, Queen Mary’s School of Medicine and Dentistry, University of London, 32 Newark St., London E12AA [email protected]

PSI-BLAST [1] and GenTHREADER [2] were employed in order to identify possible three-dimensional templates for the target sequences. Search and evaluation of structural neighbours and structural comparison was carried out with DALI [3]. Multiple sequence alignments between the target and the probable templates were performed using T-COFFEE [4] and CLUSTALW [5]. Secondary structure predictions were carried out using PSIpred V2.0 [6] and PredictProtein [7]. Three- dimensional models of the targets were built using MODELLER 6v2 [8]. Loop fragments or uncertain regions arising from the alignment were sampled using the loop- searching routines implemented in MODELLER [9].

All the molecular mechanics calculations have been performed employing the Amber 98 force field [10], as previously ported to the Large-scale Atomistic/Molecular Massively Parallel Simulator (LAMMPS) [11]. The final homology-built structures were subjected to 1000 steps of energy minimization in vacuo. Subsequently, the systems were neutralized by adding sodium counter-ions and solvated by TIP3P water. Solvent and ions were energy-minimized and then evolved using molecular dynamics for 20 ps holding the protein atoms fixed. Both solvent and solute were energy-minimized again and sampled for 300 ps with positional constraints on the main chain atoms of the residues which displayed a “good” alignment with the template structures. Structural evaluation of the overall three-dimensional structures was accomplished using the package Procheck [12].

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 2. Jones D.T. (1999) GenTHREADER: An efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 287, 797-815. 3. Holm L. et al. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol. 233, 123-128. 4. C. Notredame et al. (2000) T-Coffee: A novel method for multiple sequence alignments. J. Mol. Biol. 302, 205-217. 5. Higgins D. et al. (1994) Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673-4680. 6. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202. 7. Rost B. et al. (1993) Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol. 232, 584-599. 8. Šali A. et al. (1993) Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815. 9. Fiser A. et al. (2000) Modeling of loops in protein structures. Protein Sci. 9, 1753-1773. 10. Cornell W.D. et al. (1995) A second generation force field for the simulation of proteins, nucleic acids and organic molecules. J. Am. Chem. Soc. 117, 5179-5197. 11. Plimpton S.J. et al. (1996) A New Parallel Method for Molecular-Dynamics Simulation of Macromolecular Systems. J. Comput. Chem. 17, 326-337. 12. Laskowski R.A. et al. (1993) PROCHECK: a program to check the stereochemical quality of protein structures. J. Appl. Cryst. 26, 283-291.

A-190 Solovyev-Softberry (P0270) - 242 predictions: 177 3D, 65 SS

SoftPM: Softberry tools for protein structure modelling

V. Solovyev, D. Affonnikov, A. Bachinsky, I. Titov Ivanisenko and Y. Vorobjev Softberry Inc., 116 Radio Circle, Suite 400 Mount Kisco, NY 10549, USA [email protected]

A suit of new programs SoftPM: Software for protein modeling and prediction of 3-D protein structure has been developed recently by Softberry research team (www.softberry.com). It includes: Ffold, Getatoms, Hmod3Dmm, Hmod3Dmd, Cover3D, Abini3D programs and upgraded SSPALm program developed earlier. The programs were designed to cover all aspects of analyzing new sequences and elucidation of their 3D structures.

SSPALm (Secondary Structure Prediction by Alignments) is a new version of secondary structure prediction program SSAPL that is based on nearest-neighbor approach. This m-version is using local alignments [1] with non-redundant database of ~ 4000 proteins of known tertiary structure and multiple alignment of target sequence as input. Knowledge database of secondary structure and environment parameters for known 3D structures were computed by Softberry SSENVID program. Ffold (Find fold) is a fold recognition program that identifies a ranked list of structurally closest proteins with known 3D structure by aligning target sequence with a database of these proteins. Alignment score is a combination of environment potential, secondary structure (predicted by SSPALm) and amino acid sequence similarity. Getatoms is a program for modeling atomic coordinates of a protein with unknown 3D structure. It uses main chain coordinates from 3D structure of similar protein, which sequence is aligned with a sequence of query protein. Restoration of loops in alignment will be added later. The program has an option to provide coordinates of H-atoms. Getatoms computes 3D coordinates of a query protein and estimates quality of produced 3D structure using several scores. Initially, Getatoms selects most typical conformations of side chains, and then the conformations are optimized using soft sterical potential. Using Monte Carlo generation of initial coordinates and set of rules, Getatoms can restore loop structures and adjust gaps boundaries. Hmod3Dmm (Homology MODeling of 3D with Molecular Mechanics) finds the geometry with a minimum energy of a protein structure derived by Getatoms on the base of similar protein. It uses AMBER-like force field and conjugate-gradient method of energy optimization. The program can be useful to remove large forces on atoms before applying molecular dynamic optimization programs. The current version of Hmod3DMM is taking into energy computation a model of water environment. Hmod3Dmd (Homology MODeling of 3D with Molecular Dynamics) does final refinement of the protein structure via the MD simulation of the protein model structure in an implicit solvent with a simulated annealing protocol. The AMBER force field [2] was used to calculate internal protein energy, i.e. covalent bond/angle deformation torsion and improper torsion energies, the van der Waals and electrostatic non-bonded interactions. The water solvent has been modeled implicitly via the solvation energy density model of Lazaridis and Karplus [3]. The final protein models have been ranked according to total free energy in the implicit solvent, which has been calculated with averaging taken for a series of snapshots. Cover3D uses Ffold results to generate coverage of target sequence by similar protein fragments with known 3D structure. It outputs several variants of such coverage to be used in Abini3D to compute a putative 3D model of target sequence. Abini3D finds optimal conformation of a set of 3D-fragments representing target sequence. It uses simplified model of amino acid residues and contact potentials derived from statistics on known tertiary structures.

Any CASP5 target sequence having similarity with known 3D structure (found by Ffold program) has been modeled by Getatoms program. Next we have used Hmod3Dmm and Hmod3Dmd (in many cases) to generate submitted structure. Other target sequences (without found significant long similarity) were analyzed by Cover3D program. After that we applied Getatoms and Hmod3Dmm (in most cases) programs to generate submitted 3D coordinates.

A-191 1. Salamov A.A., Solovyev V.V. (1997) Protein secondary structure prediction using local alignments. J. Mol. Biol. 268, 1, 31-36. 2. Cornell W. et al. (1995) A second generation force field for the simulation of proteins, nucleic acids and organic molecules. J. Am. Chem. Soc. 117, 5179-5197. 3. Lazaridis T., Karplus M. (1999) Effective energy function for proteins in solution. Proteins 35, 133-152. 4. Vorobjev Y.N., Almagro J.C., Hermans J. (1988) Discrimination between native and intentionally misfolded conformations of proteins. Proteins 32, 399-413

SPAM1 (P0400) - 87 predictions: 87 3D

Protein Structure Prediction Using Multiple Methods in the Advanced Selectivity/Sensitivity Benchmarking Protocol

S. Veretnik1, W. Li 1, P.E. Bourne1,2 and I.N. Shindyalov1 1 – San Diego Supercomputer Center, UCSD, MC0537, 9500 Gilman Dr, La Jolla, CA 92093-0537, 2 – Department of Pharmacology UCSD, MC0537, 9500 Gilman Dr, La Jolla, CA 92093-0537 [email protected]

We present a new approach for structure prediction SPAM1 – “Systematic Protein Annotation and Modeling 1”. SPAM1 is a protein annotation pipeline comprising a number of structure prediction methods and rigorous benchmarking technology [1]. SPAM1 was developed to be used in genome annotation. Predictions obtained from SPAM1 were further refined by human experts based on biological systematic, functional properties and combining non-overlapping predictions (Fig 1).

The following methods were incorporated into the pipeline: WU-BLAST [2], NCBI-BLAST [3], PSIBLAST [3], 123D [4], TMHMM [5], COILS [6], SIGNALP [7].

Recognition of sequence similarity between uncharacterized protein (target) and characterized protein (template) is the main principle of structure prediction. The abovementioned methods characterize reliability of similarity between the target and the template by estimating the probability of by-chance occurrence of such similarity. The principal problem is that statistical models embedded in these methods are not adequate to the actual statistics describing relationship between targets and templates involved in annotation. Thus, so-called “recommended” thresholds, e.g. BLAST e-values are often used. Sometimes thresholds are obtained from benchmark performed on some “golden standard” which is unrelated to real targets and templates which are consequently used. In [1] it was demonstrated how misleading reliability estimates can be when they are relying on abovementioned concepts. The new approach was introduced [1] for benchmarking of methods used in annotation, providing substantially more accurate reliability estimates based on the principle of prediction consistency evaluated for a given library of templates and targets (typically complete set of proteins from a given genome).

1. Alexandrov N.N. et. al. Reliability of sequence comparison assessed by functional, structural, and expression benchmarks, in preparation. 2. Gish W, and States D. T (1993). Identification of protein coding regions by database similarity search. Nature Genetics 3, 266-72. 3. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402. 4. Alexandrov N.N, and Luethy R. (1998) Alignment algorithm for homology modeling and threading. Protein Sci. 7(2), 254-258. 5. Sonnhammer E.L. et al. A hidden Markov model for predicting transmembrane helices in protein sequences. (1998) Proc Int Conf Intell Syst Mol Biol., 175-182.

A-192 6. Lupas A. et al. (1991) Predicting Coiled Coils from Protein Sequences. Science 252, 1162-1164. 7. Nielsen H. et al. (1997) Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Engineering 10, 1-6.

Figure 1. Predictions and their reliabilities (A-0.999, B-0.99, C-0.9, D-0.5, E-0.1) by SPAM1 for CASP5 targets.

SRBI (P0331) - 109 predictions: 109 3D

Bayesian Fold Recognition

P. Cherukuri, G. McAllister, and J. Bienkowska Serono Reproductive Biology Institute One Technology Pl., Rockland, MA 02370 [email protected]

Our method uses a set of structural Hidden Markov Models automatically designed for protein domains present in SCOP. Models for all non-redundant domains are built and grouped according to the fold classification. All models representing a given fold constitute a fold model. Bayesian statistics is used for solving the first problem of protein structure prediction: the recognition of the correct fold for a given sequence. The probability of observing a given structural model for a sequence is

A-193 not associated with the lowest free energy. According to Bayes, fold recognition is measured by an a posteriori probability of a model given the query sequence. In our approach alternative models of protein structure are regarded as generators of a protein sequence and for a given model the a priori probability of generating a sequence is equal to a sum over probabilities of all sequence-to-model alignments [1]. Fold recognition is reported only if the top ranking fold has a posterior probability higher than 0.5. The next 4 alternative folds are also reported if their probability is higher than 0.01. Once the fold model is identified for a sequence, the optimal alignment to the sequence of the target functional domain is generated using the sequence profile alignment software PIMA [2]. Sequence profiles for each functional domain are generated automatically by selecting a set of diverse functional homologs (profile defining set) and creating a profile using PIMA. We add the query sequence to the profile defining set and generate the alignment. In case this attempt at generating the alignment fails, we align just a pair of sequences – the query sequence and the target sequence.

1. Bienkowska J, He H., Rogers R.G. Jr. and Yu L. (2002), Bayesian Approach to Fold Recognition Protein Structure Prediction: Bioinformatics Approach edt. I.Tsigielny. IUL 2. Das S. and Smith T.F. (2000) Identifying Nature’s Protein Lego Set in Analysis of Amino Acid Sequences. Adv. Prot. Chem. 54 159-183. Sternberg (P0105) - 71 predictions: 71 3D

Fold Recognition Using 3D-PSSM and Human Intervention and Its Application to Comparative Modeling

L.A. Kelley and M.J.E. Sternberg Structural Bioinformatics Group, Imperial College of Science, Technology and Medicine, London, United Kingdom [email protected]

Our program 3D-PSSM [1-2] was developed for fold recognition and our main objective at CASP5 was to test its performance both as a fully automated server and in combination with human intervention. In addition, the performance of fold recognition methods in generating alignments at comparable levels of accuracy with those from comparative modeling was observed at CASP4 and consequently we used 3D-PSSM to generate models for comparative modeling at CASP5. The methodologies we have used for each type of target overlap significantly. Regardless of target type we initially run the target sequence through our fold recognition system 3D-PSSM [1-2] which uses a weekly updated representative fold library containing approximately 8000 structures or domains at the time of writing.

The 3D-PSSM hits are examined for highly confident or high sequence identity hits. If a structural template has been confidently found by the PSI-Blast [3] component of 3D-PSSM it is treated as a comparative modeling target. Otherwise it is treated as a fold recognition target. In addition, matches over subsequences of the target are examined and the target is manually chopped into separate domains if required. Each domain is subsequently treated and modeled separately, with the exception of cases where a highly similar protein with the same sequence of domains is present in the structure database. Often such determination of domain boundaries is guided by either PSI-Blast multiple alignments or highly confident hits from 3D-PSSM over a region of a target sequence.

Comparative modeling targets: The specific template used for modeling is chosen by manually evaluating the length and quality of the alignment and the percentage sequence identity between template and target. Targets are also scanned against the non-redundant sequence database to detect closer homologues not present in the 3D- PSSM fold library. The alignment is adjusted as described later in this abstract, and insertions and deletions are treated using the Loopy [4] algorithm. Sidechains are automatically modeled using SCWRL [5]. Generally, large insertions are not modeled as the accuracy of modeled loops decays rapidly with length.

A-194 Fold recognition: In cases where no confident 3D-PSSM hit has been detected, the top 20 highest scoring 3D-PSSM hits are manually examined, homologues of the target are submitted to the server, and the results from other automatic servers participating in the CAFASP experiment are investigated.

Importantly, we would make judgments about lower scoring matches from the 3D-PSSM top 20 based on the SAWTED [6] text score and on the keywords shared between query and template. Although this feature is automatically included in the server results, below threshold SAWTED scores were often taken into account when choosing a fold or superfamily. Also, literature related to the target sequence and potential structural templates was retrieved and examined.

Once a fold had been chosen on the basis of the above analysis, the automatic alignments produced by the 3D-PSSM server were often manually adjusted to meet a variety of criteria: 1 Maintenance of a hydrophobic core based on three-dimensional models generated from the alignments. 2 Equivalencing of known core residues (as precalculated by using a mutual contact algorithm) with hydrophobic residues in the target. 3 Preservation of the continuity of secondary structure elements. 4 Maintenance of the spatial arrangements of residues suspected to form the active site. 5 Alignment of known motifs (such as the Walker A and B motifs in P-loops, or known conserved residue types in specific folds as determined by literature searches). 6 Maintenance of the spatial distances between cysteine residues believed to form disulfide bridges. In the relatively few cases in which we were presented with more than one high scoring, or otherwise viable template from the same fold or superfamily, we would analyze and interactively adjust the alignment to each template, looking for the template that fulfills as many of the above criteria as possible.

1. Kelley L.A. et al. (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol. 299(2), 501-522 2. Bates P.A. et al (2001) Enhancement of protein modeling by human intervention in applying the automatic programs 3D-Jigsaw and 3D-PSSM. Proteins Suppl. 5, 39-46. 3. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402. 4. Xiang Z. and Honig B. (2002) Evaluating configurational free energies: the colony energy concept and its application to the problem of protein loop prediction. Proc. Natl. Acad. Sci. USA. 99(11), 7432-7437. 5. Bower M. et al. (1997) Sidechain prediction from a backbone-dependent rotamer library: A new tool for homology modeling. J. Mol. Biol. 267, 1268-1282. 6. MacCallum R.M. et al. (2000) SAWTED: Structure assignment with text description - enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics 16(2), 125-129.

SUNDARAMS (P0381) – 0 predictions

A-195 Protein 3D Structure from Primary Sequence Data (A Constrained Simulated Annealing Approach)

K.Sundaram1 and Shyam Sundaram2 1 – S.A. Engineering College, Chennai 600077, India, 2 – Bioinformatics Developer, Virginia, USA [email protected]

Our objective is to derive the full three-dimensional structure of a protein including H atoms from sequence information. The distinctive feature of our approach is the belief that the native 3D fold of a protein is largely determined by the geometrical constraints imposed by chain connectivity, compactness, and the avoidance of steric clashes, etc. This view is supported by a couple of recent researches [1-2]. Our technique to nudge the protein to a compact 3D fold in the refinement process would be to constrain it within a rigid enclosure of appropriate size to be determined by trial and error. Water and/or CCl4 molecules will also be floated around to fill the gaps and simulate the polar and non-polar environments in the native cellular environment.

As an alternative to large-scale simulation using supercomputer power or distributed computing over the Internet, we have tried to first derive a sampling of tangible structures that can be used as initial states in short sequence Monte Carlo simulations. In this process we make full use of the renowned services for the prediction of secondary structures, motifs, domains, etc. Typically, we have first used the residue sequence to query PROF [3-4] to generate a companion conformation sequence file consisting of letters E, H, or L to represent the local conformation at each  carbon atom. As the two linear sequences pass through (in tandem and residue by residue) specially designed interactive modeling software, the ,angle pair chosen at each residue position is presented on a Ramachandran plot template. Similarly, slider bars appear for choosing the  angles appropriate for the residue in question. The initial ,pair is chosen randomly within the specified region (H, E, or L). The chosen angles can be varied manually by clicking on the desired point or by pressing one of several buttons that choose the best energy based position, relaxing the side- chain alone, the residue alone, or the whole molecule. The total non-bonded interaction energy and a 3D model are presented to aid in the choice of a desired conformation. Among the initial structures generated as described here, information derived from predictions of motifs [5] and domains [6-7] as well as other intuitive deductions based on protein function, etc., will be used to shortlist probable candidate structures.

The shortlist structures will be subjected to Monte Carlo simulation within a rigid spherical enclosure. In the simulation module sophisticated potential functions including valence electronic polarizabilities are used, but, some of the energy components can be selectively turned off for computational efficiency, in the initial stages, or, if found to be not influencing the protein fold significantly.

We have been using this method to derive structures for CASP5 targets T0180, T0188, and T0190 and TMW target 8.

1. Banavar J.R. et al. (2002) Geometry and physics of proteins. Proteins 47, 315-322. 2. Hartl F.U. (1996) Molecular chaperones in cellular protein folding. Nature 381, 571-580. 3. Rost B. and Sander C. (1993) Prediction of protein secondary structure at better than 70% accuracy. J Mol Biol. 232, 584-599. 4. Rost B. et al. (1996) Topology prediction for helical transmembrane proteins at 86% accuracy. Prot Science, 7, 1704-1718 5. Hofmann K. et al. (1999) The PROSITE database, its status in 1999. Nucleic Acids Res. 27, 215-219. 6. Corpet F. et al. (1998). The ProDom database of protein domain families. Nucleic Acids Res 26, 323-326.

A-196 7. Corpet F. et al. (2000) ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res. 28, 267-269.

SUPERFAMILY (P0065) - 925 predictions: 925 3D

Structural Domain Predictions For All Genomes

J. Gough Structural Biology, School of Medicine, Stanford University, CA94305-5126, U.S.A. [email protected]

The SUPERFAMILY library of hidden Markov models (HMMs) [1-2] was designed to provide structural assignments to protein sequences on a genomic scale, and the extensive web site [http://supfam.org] aims to facilitate an analysis of the results. The library has been applied to all completely sequenced genomes and other large data sets, such as SwissProt + TrEMBL and nrdb90.

An obvious by-product of the work is three-dimensional structure prediction, which also offers the possibility of independent assessment and comparison to other methods. The predictions submitted to CASP were obtained by a method identical to that used for genome assignments, and are therefore indicative of their quality. However, since only significant hits are used in genome assignments, and since the assignments require an overall error rate of less than 1%, the CASP server was “forced” to produce several models for each target, regardless of the confidence score.

The full details of how the model library was created are described elsewhere [1], but it should be noted here that the procedure uses the latest public release of the SAM package of profile HMM programs [3]. The library is generated with expert intervention and offers advantages over the default SAM T99 procedure included in the release, but does not contain improvements due to the more advanced but as yet unreleased versions of the software. The SUPERFAMILY database is also based on the SCOP classification of proteins [4] and so structures added to the PDB since the latest release (1.59) are not included.

Because this server was designed for genome analysis, the underlying method satisfies three criteria that go beyond what is required of other CASP methods: it is fast and can be used on large datasets; it can deal with multi-domain proteins, including domains that are non-contiguous in their gene sequence, and is robust with respect to gene prediction errors; it has a reliable confidence score. In addition, the website provides a platform for browsing sequence alignments and comparing the distributions of protein families and their domain combinations across genomes. The server has been used in the annotation of several genomes and a number of other research projects of biological importance.

The database and model library are available for download from the web site. A development version of the next generation server was also submitted to CASP as SUPERFAMILY profile-profile.

1. Gough J. et al. (2001) Assignment of genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol. 313 (4), 903-919.

A-197 2. Gough J. and Chothia C. (2002) SUPERFAMILY:HMMs representing all proteins of known structure. SCOP sequence searches, alignments, and genome assignments. Nucl. Acids Res. 30 (1), 268-272. 3. Karplus K. et al. (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics. 14 (10), 846-856. 4. Murzin A. et al. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247 (4), 536-540.

SUPFAM_PP (P0086) - 728 predictions: 728 3D

The Next Generation of Structural Genome Analysis

J. Gough1 and M. Madera2 1 - Structural Biology, School of Medicine, Stanford University, CA94305-5126, U.S.A., 2 - Structural Studies, MRC Laboratory of Molecular Biology, Cambridge, CB22QH, UK [email protected]

This automatic server is a development version of the next-generation replacement for SUPERFAMILY [1-2] (which was also submitted to CASP, see the corresponding abstract). Although it is currently based on the same library of profile hidden Markov models (HMMs), this server employs a number of new techniques that make it fundamentally different from the production server. However, since it is intended for genome analysis, the modifications have been restricted to those techniques that can realistically be applied on a genomic scale.

We are planning to incorporate two key enhancements into the next generation of our server. Firstly, we intend to improve remote homology recognition by comparing profiles to profiles (rather than profiles to sequences) in a manner pioneered by [3], and secondly, we aim to provide additional biological information via a family- level classification of the query sequence. Development versions of both enhancements have been implemented in this server.

Regarding the first enhancement, in the current production version of SUPERFAMILY a query sequence is searched directly against a library of profile HMMs that represent all proteins of known structure. This model library is pre-generated using expert intervention, which is feasible because of the limited number of structural superfamilies. By contrast, this CASP server first uses an automated method to build a profile HMM from the query sequence, and then compares the profile HMM (rather than the query sequence) to all models in the library. Because it is built using an automated method, the query profile may not be as good as models in the curated library, but the extra information contained in the profile is enough to achieve a marked improvement in performance.

As far as family-level classification is concerned, SUPERFAMILY was originally designed to provide a structural classification of protein domains at the SCOP [4] superfamily level. However, most large superfamilies are diverse and contain a number of distinct families with different biological functions. From the point of view of sequence annotations it would therefore be exceedingly useful if we could determine the family of the query sequence, in addition to its superfamily. Our approach to this problem was motivated by the following question: “Given that a domain belongs to a particular superfamily, which is the most similar structure?” The SUPERFAMILY profile HMMs are bad at answering this question because they aim to represent the entire superfamily. On the other hand, pairwise sequence comparison methods are often unable to detect distant similarities, including ones within more divergent families. The method used here was therefore a hybrid of the two. It is based on a direct comparison of a pair of sequences, but uses a profile HMM as a guide. The profile provides an alignment of the two sequences, but the

A-198 alignment is scored using a conventional substitution matrix and gap penalties. However, to capture more of the information contained in the profile, the scores at each position are weighted by the degree of conservation shown by the profile.

The novel tools used by this server are currently under development, but will be made public and applied to genome analysis in the near future.

1. Gough J. et al. (2001) Assignment of genome sequences using a library of hidden Markov models that represent all proteins of known structure. J. Mol. Biol. 313 (4), 903-919. 2. Gough J. and Chothia C. (2002) SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments, and genome assignments. Nucl. Acids Res. 30 (1), 268-272. 3. Rychlewski L. et al. (1999) Comparison of sequence profiles. Strategies for structural prediction using sequence information. Protein Sci. 9, 232-241 4. Murzin A. et al. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247 (4), 536-540. Szed-Asmat (P0515) - 6 predictions: 6 3D

Comparative Modeling of Six Randomly Selected Target Proteins by MODELLER 6

A. Salim and S. Zarina Department of Biochemistry, University of Karachi, Karachi 75270, Pakistan [email protected]

Three dimensional structure predictions were made for 6 target proteins by comparative modeling technique using the protein structure-modeling program 1MODELLER 6v2 (windows version) which constructs the protein models by satisfaction of spatial restraints. The targets for 3D modeling included (i) Hypothetical Cytosolic Protein yckF (T0167) (ii) Hypothetical protein HP0162 (T0177), (iii) Spermidine synthase homolog (T0179), (iv) TM1478 (T0182), (v) TM1816 (T0188) and (vi) Transthyretin-related protein (T0190). These targets have sequence similarities between 30-42% with their respective templates identified by 2BLAST search against protein databank. As many as 10-20 models for each target protein were constructed by the MODELLER and the best model was selected which satisfies most of the stereochemical criteria after evaluating them with the program, 3PROCHECK. The Ramachandran plots of these models showed no residues in the disallowed region except in case of protein TM1478 (T0182) where a single residue was located in the disallowed region of the plot. The corresponding template of this target protein also has the identical residue in the disallowed region. Structural superpositions of the Calpha atoms between the models and the corresponding experimental structures showed root mean square deviations (RMSD) between 0.2 to 1.5Å showing that sequence similarity >30% produces models of greater accuracy.

1. Sali A., Blundell T.L. (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234, 779-815. 2. Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. (1990) Basic local alignment search tool. J Mol Biol 215, 403-410. 3. Laskowski R.A., McAurthur M.W., Moss D.S., Thornton J.M. (1993) PROCHECK: A program to check the stereochemical quality of protein structures. J Appl Cryst 26, 283-291. Taylor (P0423) - 113 predictions: 113 3D

CASP Modelling Methods

A-199 W.R. Taylor and K.X. Lin NIMR, London NW7 1AA, UK [email protected]

The target sequence was fed through an automatic databank search protocol over the non-redundant protein sequence databank. This involved a pre-scan through locally installed psiBLAST (4 cycles, p=0.001). The sequence fragments hit by psiBLAST were then extracted and realigned by MULTAL (automatically removing homologues closer that 90%) before being fed to the databank search program QUEST. QUEST is able to 'push-the envelope' further than psiBLAST as it has a built-in multiple alignment stage that reigns-in iterations that hit too many sequences. It often found useful sequences where psiBLAST had none (or just trivial variants). The final iteration reduced the selected sequences (if there were enough) to a non- homologous set in which no two pairs had more than 60% identity. This produced the target family for which the secondary structure was then predicted by psiPRED. Rather than run each member of the family against the databases again, a local BLAST directory was setup containing just the family members and this was used by a locally-installed psiPRED. The resulting predictions, along with the sequences coloured by their amino acids were written as postscript and displayed using "gv" which allows easy magnification and browsing over large alignments. The CAFASP results were taken as a prefilter for the selection of proteins to model. The CAFASP summary results for each target were downloaded and all sequence fragments extracted. These were fed to MULTAL which produced multiple (sub)alignments for each protein. As above, each alignment was fed to QUEST and scanned against the NR sequence DB. QUEST has the property that it anticipates a consensus domain size so any small fragments or over long members become regularised. This results in a set of families, at least one member of which has a known structure and has been seen by the CAFASP methods. These families, called the PDBseq. families were treated as was the target family to produce secondary structure predictions. The known secondary structure was also calculated and stored.

The target family was aligned using MST (Multiple sequence Threading) with each PDBseq. family in turn. MST uses a combination of 3D packing with predicted/observed secondary structure matching and profile/profile alignment to produce an alignment of the two families. This was then visualised in gv (with predicted secondary structures plus motif colouring) alongside the model of the target sequence on the known structure (also coloured by predicted secondary structure).

When all went well (ie there were homologues for both sides of the alignment) the full process was completely automatic and modelled structures could be 'flicked' through one-after-the-other. If there was a reasonably clear match with a number of known structures, then the structure with the top CAFASP jury score was taken (as this should make comparison with other models more direct).

If there was no homologues for the target, the QUEST search was rerun by-hand allowing the envelope to be pushed a little further before the junk flooded in. Occasionally some members were deleted from the search because of an improbable connection based on functional key-words. If there was no homologues for the template then the structural neighbours of the template were aligned using SAP. (and fed back into MST). If the MST alignment was only good in parts then marker points were inserted to hold the good part while the sequences were realigned. If there were still no homologues, then a novel ab-initio method was run. This uses a secondary structure lattice but also wanders off-lattice. It generates thousands of models that are then filtered by shape fold and packing. (Taylor, unpublished).

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 2. Taylor W.R. (1998) Dynamic databank searching with templates and multiple alignment. J. Mol. Biol. 280, 375-406. 3. Taylor W.R. (1997) Multiple sequence threading: an analysis of alignment quality and stability. J. Mol. Biol. 269 , 902-943.

A-200 TCS-Bioinformatics (P0404) - 40 predictions: 40 SS

Highly Trained Neural Network Predictors

Dilip Antony Joseph1, M. Vidyasagar2 and Sharmila Mande2 1Indian Institute of Technology, Madras, 2Tata Consultancy Services [email protected]

Artificial Neural Network based predictors have proven to be very effective in protein secondary structure prediction. The neural network is able to predict the state of the central residue of a window of amino acids. These predictors have obtained prediction accuracies of over 75%. The number of neurons in a predictor is often more than 20000. To effectively train a network with such a large number of changeable parameters requires a very large training set. The predictor developed here attempts to use a large amount of the available secondary structure data in training.

A standard feed forward neural network was used in the predictor. The input to the neural network consisted of a window of 15 amino acids [3]. Each amino acid in the window was represented by the 20 numbers obtained from the PSIBLAST [1, 2] profile of the sequence. The network classifies the central residue of the window as either in the Alpha Helix, Beta Strand or Coil state. A second network was used to ‘clean up’ the secondary structure sequence produced by the first network. The training set consisted of over 6500 protein sequences from the PDB SELECT [4, 5] database, which gave 1436264 input-output pairs for training the network. The effectiveness of the above training set is diminished by the similarity (up to 90 %) between some of the sequences in the set. However, this training set did give better prediction accuracies than the networks trained on a smaller number of sequences. Nine neural networks trained independently on the same training set (randomly shuffled) were constituted into a jury. This also led to a small increase in prediction accuracy.

It has been observed that the larger and more varied the training set; the better is the prediction accuracy. As more and more structural data becomes known in the future, it is important to include those sequences in the training set. However, retraining the whole network is a time consuming exercise. The effectiveness of retraining the network with only the new sequences is studied.

A jury consisting of highly trained predictors along with specialized alpha and beta strand predictors can effectively increase the prediction accuracies.

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 2. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202. 3. Rost B. and Sander C. (1993) Improved prediction of protein secondary structure by use of sequence profiles and neural networks. Proc. Natl. Acad. Sci. USA, 90, 7558-7562 4. Hobohm U., Scharf M., Schneider R., Sander C. (1992) Selection of a representative set of structures from the Brookhaven Protein Data Bank. Protein Science 1, 409-417 5. Hobohm U. and Sander C. (1994) Enlarged representative set of protein structures, Protein Science 3, 522

A-201 THW-FR (P0377) - 241 predictions: 241 3D

Net Charge Center for Protein Fold Recognition

I. Y. Torshin1,2,3 1 – Chair of Physical Chemistry, Chem. Dept., Moscow State University, 2 – Comp. Sci. Dept., GSU, Atlanta, GA, 3 – Biol. Dept., GSU [email protected]

Net charge center (NCC) is a novel physico-chemical model developed for analysis of the relationship between protein structure and function. Various “quantitative” models (often built, perhaps, as attempts to imitate the amazingly accurate mathematical apparatus of modern physics [1]) may allow to fit experimental data to calculations using a number of arbitrary empirical parameters, but do not appear to have clear physical significance. The NCC model does not include any empiric parameters whatsoever and is calculated solely on the base of three-dimensional structure of the protein using an extremely simple formula [unpublished]. The physical significance of the model is that NCC describes spatial distribution of the charged/ionized residues in a molecule of a protein at given physico-chemical conditions. The biological significance of the NCC model is that NCC is very often located in the biologically important sites of protein molecules of different biochemistry [unpublished data]. The last fact allows application of NCC model to the problem of fold recognition by generating a library of template structures.

Sequences around positive and negative charge centers (PNCC) are likely to be folding cores or folding intermediates [2-4]. The NCC model, as noted above, is likely to determine the location of functional regions and sequences. These two properties of a native protein were used to compile a template library for fold recognition. The library was based on domain database GTDD (Gestalt Theory Domain Database [unpublished]). Gestalt theory [5], though being proposed over 50 years ago, is still one of the best theories that describe principles of perception. The theory explains a large number of experimental facts pertaining to perception (human perception, in particular) without over-complicated or purely statistical explanations characteristic of modern behaviorism. The gestalt principles can be computerized for many purposes and in this study they were used to generate a database of domains using non-redundant PDB. In short, NCC + PNCC allows selection of potential templates from GTDD library of templates (that is, to perform fold recognition).

Non-redundant GTDD for fold recognition was built using non-redundant set of PDB sequences selected at BLAST E-value of 10e-7 [6]. Program FoldRec-CC generated a set of models then the models were refined by energy minimization using AMMP [7]. Several modifications of the method were used: 1. The full FoldRec- CC method; 2. NCC only; 3. PNCC only and 4. full FoldRec-CC but using a multiple sequence alignment (BLAST [8]) for the target. Preliminary data (as judged by proteins with definitely known templates such as T0137, T0144 etc) suggest that using NCC model alone (as in modification 2 above) often leads to correct fold recognition, though the full FoldRec-CC method (modification 1 above) is more reliable.

This method was also applied during the TMW-1 (Ten Most Wanted) experiment and, although structures of TMW proteins are not known, there are circumstantial evidences that at least correct fold was predicted for a number of the TMW targets. Although the method is fully automated, visual inspection of the final 10-20 models for a target as well as using secondary structure predictions [9] are likely to improve the results of fold recognition.

1. Kőhler W. (1947), Gestalt Psychology: An introduction to New Concepts in Modern Psychology, Liveright Publishing, NY, p.42. 2. Torshin I. et al (2002) Charge centers and formation of the protein folding core. Proteins, 43:353-364.

A-202 3. Torshin I. et al (2002) Identification of protein folding cores and nuclei using charge center model of protein structure. TheScientificWorld Journal, 2:84-86. 4. Torshin I. et al. (2002) Protein folding: search for basic physical models, submitted. 5. Kőhler W. (1947), Gestalt Psychology, 136-279. 6. Madej T. et al (1995) Threading a database of protein cores. Protein Struct. Funct. Genet. 23, 356-369. 7. Harrison R. et al (1995). Analysis of six protein structures predicted by comparative modeling techniques. Proteins 23, S463-471. 8. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402. 9. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292, 195-202.

TOME (P0450) - 260 predictions: 260 3D

Evaluation of a New Protein Structure Modelling Pipeline TOME

G. Labesse, V. Catherinot, J.-L. Pons, L. Martin and D. Douguet 1 - Centre de Biochimie Structurale (CNRS), Montpellier,France [email protected]

The fold compatibility between the targets and PDB entries was analyzed using our recently developped meta-server [1]. Query sequences are sent automatically to six distinct fold recognition or protein structure prediction servers: 3D-PSSM[2], PDB-BLAST (http://bioinformatics.burnham-inst.org/pdb_blast/), FUGUE[3],

GenTHREADER[4], SAM-T99[5] and J-PRED2[6] with default parameters but for PDB-BLAST (10 iterations). A consensus ranking was estimated. Fold recognition searches were resumed for multi-domain targets to re-assessed template ranking and scoring or to highlight structural similarities. In a few cases several runs were necessary for proper domain delimitation. As most “threaders” use the “frozen approximation”, each structural alignment was further evaluated using T.I.T.O [7]. Sequence identity, threading scores and ranking as well as the percentage of target sequence and template structure overlap was taken into account for validation of the proposed fold. Presence of hits with homologous structures (large family and/or easy target) or with related folds (small family and/or difficult targets) was also checked. Template-related structures were searched using FSSP [8]. For easy targets, models were built directly using MODELLER 6.0 [9]. Models were evaluated using PROSA [10] and Verify3D [11]. Indel modelling was carefully analyzed by visual inspection using XmMol [12]. Structural alignments were manually refined locally. Side chain modelling in the common core (as defined by target- template alignment) was also performed using SCWRL 2.8 [13] and similarly evaluated but not further refined. For difficult targets, additional evaluations of the fold compatibility were performed through extensive modelling using a dozen of distinct templates as well as careful structural alignment refinement through T.I.T.O. Additional restraints to be used in MODELLER 6.0 were deduced from template secondary structure assignment using P-SEA [14] and mixed with predictions, for exemple, from J-Pred. At least three models were deposited for each targets (but a few ones likely in the NEW_FOLD class): one built by MODELLER (the most complete), another built using SCWRL (no indel building) and a third one derived from the template by T.I.TO. (common core only, backbone of aligned residues + side chains of conserved residues). Additional models were sometimes added when distinct models were obtained (sub-optimal alignment of loop building with equivalent validation score).

A-203 1. Douguet D. et al. (2001) Easier threading through web-based comparisons and cross-validations. BioInformatics 17, 752-753. 2. Kelley L.A. et al. (2000) Enhanced Genome Annotation using Structural Profiles in the Program 3D-PSSM. J. Mol. Biol. 299, 501-522 3. Shi J. et al. (2001) FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 310, 243-257. 4. McGuffin L.J. et al. (2000) The PSIPRED protein structure prediction server. Bioinformatics 16, 404-405 5. Karplus K. et al. (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics 14, 846-856. 6. Cuff J.A. et al. (1998) Jpred: A Consensus Secondary Structure Prediction Server. Bioinformatics 14, 892-893 7. Labesse G. et al. (1998) A Tool for Incremental Threading Optimization (T.I.T.O.) to help alignment and modelling of remote homologs. Bioinformatics 14, 206- 211. 8. Holm L. et al. (1994) The FSSP database of structurally aligned protein fold families. Nucleic Acids Res. 22(17):3600-3609. 9. Sali A. et al. (1993) Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815. 10. Sippl M.J. (1993) Recognition of errors in three-dimensional structures of proteins. Proteins 17, 355-362. 11. Eisenberg D. et al. (1997) VERIFY3D: assessment of protein models with three-dimensional profiles. Methods Enzymol 277, 396-404 12. Tuffery P. (1995) XmMol: an X11 and motif program for macromolecular visualization and modeling. J. Mol. Graph. 13, 67-72. 13. Dunbrack R.L. et al. (1993) Backbone-dependent rotamer library for proteins. Application to side-chain prediction. J Mol Biol. 230, 543-574. 14. Labesse G. et al. (1997) P-SEA: a new efficient assignment of secondary structure from Ca trace of proteins. CABIOS 13, 291-295.

UCLA-DOE (P0301) - 59 predictions: 59 3D

The Directional Atomic Solvation Energy: An Atom-Based Empirical Potential for the Assignment of Protein Sequences to Known Folds

Parag Mallick1, Charlotte Deane1, Robert Weiss1 and David Eisenberg1 1 – UCLA-DOE Center for Genomics and Proteomics & Howard Hughes Medical Institute [email protected]

The Directional Atomic Solvation Energy (DASEY) is an atom-based description of the environment of an amino acid position within a known 3D protein structure. DASEY has been developed to align and score a probe amino acid sequence to a library of template protein structures for fold assignment. The DASEY is computed by summing the atomic solvation parameters [1] of atoms falling within a tetrahedral sector, or petal, extending 16Å along each of the four bond axes of each alpha-carbon atom of the protein. The DASEY is able to discriminate between pairs of structurally equivalent positions and random pairs, in proteins structure sharing a fold, but

A-204 belonging to different superfamilies, unlike some previous descriptors of protein environments, such as area buried. Furthermore, DASEY values have characteristic patterns of residue replacement, an essential feature of a successful fold assignment method. Benchmarking fold-assignment with DASEY scoring achieves coverage of 56% of sequences with 90% accuracy, why probe sequences are matched to protein structural templates belonging to the same fold, but to a different superfamily; an improvement of greater than 200% over a previous method of sequence derived properties.

For each CASP target, models were built by first identifying a candidate fold family, refining the fold prediction to a family prediction, generating multiple alignments to potential templates and by then building and refining molecular models. PHD [2], PSI-PRED [3] and JPRED [4] were used to predict the secondary structure all CASP targets. Next each prediction was used with to identify likely fold candidates by DASEY, the Method of Sequence Derived Properties [5], PSI-BLAST [6] and by a simple composition based filter. Next, DASEY was used to identify which superfamilies within a fold class were most similar to the target and which templates were most likely. Alignments to the selected templates were generated by DASEY and then visually inspected within SeaView [6]. Whenever possible, profile-profile alignments and PFAM-A [7] alignments were also generated for comparison. MODELLER [8] was used to build and refine 10 models of each alignment. Models were evaluated by MODELLER Energy, ERRAT [9], Verify3D [10] and by SwissPDBViewer Threading Potential. In some cases, alternate alignments and templates were used if no suitable candidate models were initially generated.

1. Eisenberg D., and McLachlan A.D. (1986). Solvation energy in protein folding and binding. Nature 319, 199-203. 2. Rost B. (1996). PHD: predicting one-dimensional protein structure by profile-based neural networks. Methods Enzymol 266, 525-539. 3. McGuffin L.J., Bryson K., and Jones D.T. (2000). The PSIPRED protein structure prediction server. Bioinformatics 16, 404-405. 4. Cuff J.A., Clamp M.E., Siddiqui A.S., Finlay M., and Barton G.J. (1998). JPred: a consensus secondary structure prediction server. Bioinformatics 14, 892-893. 5. Fischer D., and Eisenberg D. (1996). Protein fold recognition using sequence-derived predictions. Protein Science 5, 947-955. 6. Altschul S.F., Madden T.L., Schäffer A.A., Zhang J., Zhang Z., Miller W., and Lipman D.J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 25, 3389-3402. 7. Bateman A., Birney E., Cerruti L., Durbin R., Etwiller L., Eddy S.R., Griffiths-Jones S., Howe K.L., Marshall M., and Sonnhammer E.L. (2002). The Pfam protein families database. Nucleic Acids Res 30, 276-280. 8. Sali A., and Blundell T.L. (1993). Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol 234, 779-815. 9. Colovos C., and Yeates T.O. (1993). Verification of protein structures: patterns of nonbonded atomic interactions. Protein Sci 2, 1511-1519. 10. Eisenberg D., Luthy R., and Bowie J.U. (1997). VERIFY3D: assessment of protein models with three-dimensional profiles. Methods Enzymol 277, 396-404.

VENCLOVAS (P0425) - 20 predictions: 20 3D

Comparative Modeling Based on a Combination of Sequence Comparison and Assessment of Structural Fitness

C. Venclovas Lawrence Livermore National Laboratory, Livermore, California [email protected]

A-205 Comparative modeling approach used to build models for CASP5 is in many respects similar to the one used at CASP4 and described in more detail in the special Proteins issue [1].

Template selection PDB templates were identified by running either BLAST or PSI-BLAST [2] searches against the non-redundant NCBI sequence database. Usually more than one template was used to build models.

Sequence-structure alignments Sequence-structure alignments were generated and assessed both at the sequence level as well as at the 3D level. For high homology targets, where structural template(s) were among closely related sequences, the alignment was derived directly from PSI-BLAST results with some manual adjustments around insertions/deletions. In the case of distant homology targets, results of an initial PSI-BLAST search were used in an intermediate sequence search procedure (PSI-BLAST-ISS) [1]. In this procedure, a set of sequences that bridge sequence space between target sequence and template(s) were used as additional probes for searching the non-redundant sequence database. Target-template sequence alignments were extracted from resulting search data and their consistency was analyzed. For regions where one dominant alignment variant was produced, the alignment was considered reliable, while the regions where the consistency of target-template alignment was lacking were deemed unreliable. If unreliable regions were present in the alignment, multiple models were built to explore alternative alignment variants. In some of these cases to increase selectivity, models for less-distant homologs of the target were also built. Alignments for some regions that were expected to be structurally conserved, but could not be aligned by PSI-BLAST, were derived manually using PSIPRED [3] secondary structure predictions as a guide.

The final target-template alignment was selected by taking into account structural fitness of each of the alternative alignments. Structural fitness (quality) was assessed by several methods including visual inspection, ProsaII [4] profiles and Z-scores and reports from the WHATIF [5] quality evaluation module (Whatcheck).

Loop modeling Most of the loops for distant homology targets were assigned automatically during model-building. In other cases loops were modeled after suitable fragments from PDB structures. Preference was given to evolutionary related protein structures. In their absence the conformation which was dominant in the results of fragment searches was assigned to the targeted region.

Generating 3D structures Models were generated with MODELLER [6]. In most cases side chains were rebuilt using SCWRL [7]. Any strong side chain clashes after this step were removed manually. No energy minimization procedures were used.

1. Venclovas C. (2001) Comparative modeling of CASP4 target proteins: Combining results of sequence search with three-dimensional structure assessment. Proteins, Suppl. 5, 47-54. 2. Altschul S.F., Madden T.L., Schaffer A.A., Zhang J., Zhang Z., Miller W. and Lipman D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25, 3389-3402. 3. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol, 292, 195-202. 4. Sippl M.J. (1993) Recognition of errors in three-dimensional structures of proteins. Proteins, 17, 355-362. 5. Vriend G. (1990) WHAT IF: a molecular modeling and drug design program. J Mol Graph, 8, 52-56. 6. Sali A. and Blundell T.L. (1993) Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol, 234, 779-815.

A-206 7. Bower M.J., Cohen F.E. and Dunbrack R.L., Jr. (1997) Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: a new homology modeling tool. J Mol Biol, 267, 1268-1282.

Wolynes-Schulten (P0294) - 42 predictions: 42 3D

Ab initio Structure Prediction with Associative Memory Hamiltonians

Corey Hardin1, Michael Prentiss2, Michael P. Eastwood2, Zan Luthey-Schulten1, and Peter Wolynes2 1- University of Illinois - Urbana Champaign 2- University of California - San Diego [email protected]

We initially selected sequences for ab initio prediction if there was no obvious scaffold found by the automated comparative modeling servers for threading/comparative modeling. For the selected sequences, we used an Associative Memory Hamiltonian (AMH), with parameters chosen by optimization. The optimization aims to produce an energy landscape of the AMH that is as close to an ideal funnel as our reduced model allows without using homology information. The AMH has been optimized separately for all-alpha, and alpha-beta proteins [1-3]. We averaged the AMH potential over multiple sequence homologues when available. Information from secondary structure prediction was included via a potential biasing the phi-psi angles to the appropriate region of a Ramachandran plot. A sequence dependent hydrogen bond term was used to improve beta sheet formation. Molecular dynamics simulations were used to select low energy candidate structures. Subsequently, a smaller subset of structures was selected for submission using several filters. These include agreement with secondary structure predictions and available biochemical information as well as the energy from a second energy function designed for threading that includes a pairwise contact, structural profile, and backbone hydrogen bonding terms [4].

1. Eastwood M. P. et al. (2002) Statistical Mechanical Refinement of Protein Structure Prediction Schemes: Cumulant Expansion Approach. J. Chem. Phys. 117 (9),4602-4615 . 2. Hardin C. et al. (2000) Associative Memory Hamiltonians for Structure Prediction Without Homology: Alpha-Helical Proteins. Proc. Nat. Acad. Sci. U.S.A.97(26), 14235-14240. 3. Hardin C. et al. (2002) Associative Memory Hamiltonians for Structure Prediction Without Homology: Alpha-Beta Proteins . Proc. Nat. Acad. Sci. U.S.A. (accepted). 4. Koretke, K K. et al. Self-consistently Optimized Statistical Mechanical Energy Functions For Sequence Structure Alignment. Protein Science 5, 1043- 1059.

Yan-Research (P0069) - 60 predictions: 60 SS

A-207 The Prime Number Code: A Method of Protein Structure Prediction Derived from the Genetic Code

Johnson F. Yan and Benjamin C. Yan Yan Research [email protected]

A novel algorithm has been derived from number-theory principles that can predict the secondary structures of proteins, given the primary structure (amino acid sequence). Numerically, the amino acids are represented by 20 “z-numbers” (mostly prime numbers) with which sequence patterns may be calculated [1]. Rather than being arbitrarily assigned, an amino acid’s z-number is derived from the unique algebraic properties of the three deoxyribonucleotides in its codons [2].

Whereas amino acids have assigned z-numbers, peptides and structural motifs in proteins are characterized by z-sums. The z-sum of a peptide or a secondary structural motif is the sum of the z-numbers of its constituent amino acids. If the z-sum is a prime number, then the corresponding structural motif or peptide tends to recur as a unit. The indivisibility of a peptide unit is therefore analogous to the indivisibility of a prime number.

Secondary structures of proteins can be expressed in terms of their -helices and  sheets. -helices are expressed in units of heptapeptides, while -sheet strands are expressed in units of tripeptides [3].

Identification of -helices (heptapeptides) and -sheet strands (tripeptides) in a protein is accomplished by scanning the amino acid sequence for prime z-sums. A heptapeptide (-helix) scan is performed by scanning the z-sums of every seven residues for prime numbers. The scan is carried out for all seven reading frames. Similarly, a tripeptide (-strand) scan is performed by calculating the z-sums of consecutive, non-overlapping tripeptides; the scan is repeated for all three reading frames. Heptapeptides and tripeptides with prime z-sums (called “prime heptapeptides” and “prime tripeptides”) tend to present as recurring motifs.

This method is based upon intrinsic properties of the DNA sequence, which prescribes the amino acid sequence—also intrinsic—through the genetic code. Sequence analyses carried out using this method were supplemented with statistical data compiled for individual amino acids to detail extrinsic properties. The higher-order structure of a protein is therefore dependent on both intrinsic (sequence) and extrinsic (environmental) factors.

The algorithm described above was applied to the amino acid sequences of the Full Sequence Design protein of Dahiyat and Mayo [4], and of Arabidopsis cellulose synthase [5].

1. Yan J.F. 1999, U.S. Patent No. 5,856,928. 2. Yan J.F. et al. (1991) Prime numbers and the amino acid code: analogy in coding properties. J. Theor. Biol. 151, 333-351. 3. Yan B.C. and Yan J.F. (1999) Size and folding in globular proteins. Internatl. J. Biol. Macromol. 24, 65- 67. 4. Dahiyat B.I. and Mayo S.L. (1997) De novo protein design: fully automated sequence selection. Science 278, 82-87. 5. Arioli T. et al. (1998) Molecular analysis of cellulose biosynthesis in Arabidopsis. Science 279, 717-720. Yasara-Pushchino (P0202) - 192 predictions: 192 3D

A-208 WHAT IF YASARA Folds a Protein?

E. Krieger1, D.N. Ivankov2, A. Finkelstein2 and G. Vriend1 1 - CMBI, Center for Molecular and Biomolecular Informatics, University of Nijmegen, the Netherlands. 2 - Institute of Protein Research, Puschino, Russia [email protected]

Summary

Two thirds of the 48 submitted models were built fully automatically combining three newly developed approaches to structure prediction:

1) Self-parameterizing force fields. To achieve maximum accuracy, force field parameters were not derived from small molecules and then applied to proteins. Instead, the force fields were allowed to parameterize themselves while energy-minimizing high resolution X-ray structures[1]. Model refinement was done with the new YAMBER II force field, which uses the same energy function as AMBER[2], but different parameters optimized in crystal space.

2) Eliza, an "artificial modeling intelligence". Previous CASPs have shown that humans still do better than automated servers, we therefore tried to teach Eliza the human way of thinking when correcting an alignment.

3) Distributed computing with the Models@Home screensaver[3] (available from www.yasara.com/models) allowed to run hundreds of parallel molecular dynamics simulations of models built from various possible alignments. The resulting trajectories were clustered[4,5] to avoid false positives and to pick out truly improved models.

Abstract

The YASARA/WHAT IF modeling pipeline integrates functions provided by both programs and a variety of fold recognition and secondary structure prediction servers into a fully automatic method for protein structure prediction. During CASP5, human expert alignments were additionally fed into the pipeline. These were contributed by Dmitry N.Ivankov from Alexei Finkelstein's group, and in one third of the cases, they gave higher scores than Eliza's suggestions. Their method is described separately under group-name "Puschino".

Initial alignments were collected from the following fold recognition servers as summarized on the CAFASP website: SAM-T02[6], FORTE1 (www.cbrc.jp/htbin/forte1-cgi/forte1_form.pl), ORFeus, BasicC (grdb.bioinfo.pl), 3D-PSSM[7], GenTHREADER[8], mGenTHREADER, FUGUE2.1[9], INBGU (www.cs.bgu.ac.il/~bioinbgu), and MPALIGN (sunflower.kuicr.kyoto-u.ac.jp/mpalign). A consensus secondary structure prediction was obtained from PHD[10], JPRED[11], PSIPRED[12], and SAM-T02[6].

All alignments were analyzed and potentially modified by Eliza. Then the loops and structured N- and C-termini were added with YASARA's loop modeler, side-chains were completed by WHAT IF[13]. The models obtained for the various alignments were scored and the best one picked for further optimization.

A-209 In the refinement stage, the conformational space available to the model was sampled with Bert de Groot's CONCOORD program[14], then 100 parallel all-atom molecular dynamics simulations in aqueous solution (Particle Mesh Ewald electrostatics) were run with YASARA to 'home in' further on the target. This was done with the new YAMBER II force field (Yet Another Model Building and Energy Refinement force field), a second-generation self-parameterizing force field optimized in crystal space. Models were ranked based on a variety of WHAT IF quality checks[15] and clustered to avoid isolated false positives with artificially high scores[4,5].

Due to the huge computational requirements, the entire procedure was run in parallel using the Models@Home distributed computing system. Thanks to everyone working here at the CMBI in Nijmegen, Netherlands, for choosing the Models@Home screensaver. More information about the programs used is available at www.yasara.com and www.cmbi.nl/whatif.

1. Krieger E., Koraimann G. & Vriend G. (2002). Increasing the precision of comparative models with YASARA NOVA - a self-parameterizing force field. Proteins 47, 393-402. 2. Wang J., Cieplak P. & Kollman P. A. (2000). How well does a restrained electrostatic potential (RESP) model perform in calculating conformational energies of organic and biological molecules? J. Comp. Chem. 21, 1049-1074. 3. Krieger E. & Vriend G. (2002). Models@Home: distributed computing in bioinformatics using a screensaver based approach. Bioinformatics 18, 315-318. 4. Shortle D., Simons K. T. & Baker D. (1998). Clustering of low-energy conformations near the native structures of small proteins. Proc. Natl. Acad. Sci. USA 95, 11162- 5. Xiang Z. & Honig B. (2002). Evaluating conformational free energies: the colony energy and its application to the problem of loop prediction. Proc. Natl. Acad. Sci. USA 99, 7432-7437. 6. Karplus K., Barrett C., Cline M., Diekhans M., Grate L. & Hughey R. (1999). Predicting protein structure using only sequence information. Proteins 37(S3), 121- 125. 7. Kelley L.A., MacCallum R.M. & Sternberg M.J.E. (2000). Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol. 299, 499- 520. 8. Jones D.T. (1999). GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 287, 797-815. 9. Shi J., Blundell T.L. & Mizuguchi K. (2001). FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure- dependent gap penalties. J. Mol. Biol. 310, 243-257. 10. Rost B. (1996). PHD: predicting one-dimensional protein structure by profile-based neural networks. Methods Enzymol. 266, 525-539. 11. Cuff J.A., Clamp M.E., Siddiqui A.S., Finlay M. & Barton G.J. (1998). JPred: a consensus secondary structure prediction server. Bioinformatics 14, 892-893. 12. McGuffin L.J., Bryson K. & Jones D.T. (2000). The PSIPRED protein structure prediction server. Bioinformatics 16, 404-405. 13. Chinea G., Padron G., Hooft R.W.W., Sander C. & Vriend G. (1995). The use of position specific rotamers in model building by homology. Proteins 23, 415-421. 14. de Groot B. L., van Aalten D. M., Scheek R. M., Amadei A., Vriend G. & Berendsen H.J. (1997). Prediction of protein conformational freedom from distance constraints. Proteins 29, 240-251. 15. Hooft R.W.W., Vriend G., Sander C. & Abola E. E. (1996). Errors in protein structures. Nature 381, 272-272.

Yoon (P0262) - 35 predictions: 35 3D

A-210 Simulation of the Protein Folding Structures

Jin Kak Lee, Taesung Moon and Chang No Yoon Korea Institute of Science and Technology [email protected]

To simulate the folding structures of a protein, we used a simple off-lattice model with the unified-residue point, which represents the alpha carbon of each amino acid in the protein model. This model has two angle variables, one for the angle between two consecutive virtual bonds, residues i to j and j to k, the other for the rotational angle of the virtual bonds consisting of residues i, j, k and l. In order to generate the protein conformations the Monte Carlo method was used with the starting point of random coil conformations. During this procedure the range of the i-j-k angle was limited between 60 to 150 degrees. Among the trajectory data obtained from the navigation through the potential surface, about half of them were accepted and stored. The knowledge-based potential was used to obtain the potential energy surface. It was derived from the known protein structures. The total number of the accepted conformations was about 10E3 and the total steps for one run were about 10E8. Finally, all the conformations were clustered using the energy and cRMS between the alpha carbon traces. Then the obtained representative conformations were minimized with the potential energy.

Yoon (P0262) - 35 predictions: 35 3D

Prediction of Protein Structure Using Homology Modeling Technique

Taesung Moon, Jin Kak Lee and Chang No Yoon Korea Institute of Science and Technology [email protected]

The homology modeling technique predicts the three-dimensional structure of a given protein sequence (target) based on an alignment of the protein to one or more homologous proteins (templates) of known structure. This technique become more and more important because the structural information from x-ray crystallographic or NMR results is increased. In this study we carried out conventional homology modeling approaches. The target protein was aligned with the templates which selected using FASTA search against PDB (Protein Data Bank) database. Then, the coordinates amino acids of the template of aligned regions were transferred to target. The coordinates of the regions which not aligned were given using small fragment amino acid library. If the matched amino acid fragment was not found, the conformation search was carried out. The energy minimization and molecular dynamics simulation were performed to refine the model structure.

Zhou-HX (P0056) - 134 predictions: 69 3D, 65 SS

Improving Fold Recognition and Query-Template Alignment by Combining PSI-Blast and Sequence-Structure Threading

A-211 H. Chen1, 2 and H.-X. Zhou1 1 – Florida State University, 2 – Drexel University [email protected]

Both PSI-Blast [1] and sequence-structure threading have strengths and weaknesses in structure prediction. Our COBLATH [2] program was designed to exploit the complementarity of the two methodologies through judicious combination. The powerful sequence-alignment algorithm of PSI-Blast can generate a sequence profile that is highly informative, even when it cannot by itself identify a structural template. In particular, this sequence profile can be incorporated into sequence-structure threading to improve the success rate of fold recognition and the accuracy of query-template alignment.

The COBLATH program has modules for predicting the secondary structure and the solvent accessibility. Predictions for both structural features are based on a neural network, with sequence profile as the input. The predicted secondary structure and solvent accessibility in turn are used as part of the fitness function for sequence- structure threading. In addition, for a given query, the predicted secondary structure is used to screen a pool of proteins (consisting of ~3000 chains in the FSSP library) to obtain a reduced set of 200 potential templates. Threading between the query and the potential templates is carried out in both directions.

Of the 67 CASP5 targets, 40 had templates identified by PSI-Blast. The threading module was used to identify templates for the other 27 targets. In some cases (e.g., T0149), the PSI-Blast template leaves a significant portion of the query sequence uncovered. This portion of the sequence is singled out for further investigation by the threading module.

Regardless how the template was identified, a round of threading specifically designed for query-template alignment was used to obtain optimal alignment. This was based on the recognition that the objectives in fold recognition and in query-template alignment are different. In the former the objective is to discriminate the true template against decoys. This does not necessarily match the objective of finding the best alignment between the query and a particular template. In particular, the gap penalty was reduced in the threading for query-template alignment.

With the fully automated COBLATH program, the identified templates had high confidence levels for all but five of the 67 CASP5 targets.

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 2. Shan Y., Wang G., and Zhou H.-X. (2001) Fold recognition and accurate query-template alignment by a combination of PSI-BLAST and threading. Proteins 42 (1), 23-37.

A-212 CASP5 Poster Abstracts

A-213 A-214 Accelrys (P0210) - 24 predictions: 24 3D

Structural Prediction and Functional Annotation of Proteomic Sequences using GeneAtlasTM

Dana Haley-Vicente, Velin Spassov, Tina Yeh, Ken Butenhof, Christoph Schneider, Lisa Yan Accelrys Inc., 9685 Scranton Road, San Diego, CA 92121, USA [email protected]

We have used GeneAtlas™ to provide functional annotation of proteomic sequence data including structural prediction. GeneAtlas is an automated, high-throughput pipeline for the prediction of protein structure and function using sequence similarity detection, homology modeling, and fold recognition methods. Using template searching, GeneAtlas searches for relationships between query sequences and known protein structures, motifs, and folds. Subsequent inferences and assignment of the target protein’s function is based on its homology to the experimentally derived template protein and the models generated as part of the pipeline.

Using CASP5 targets as query sequences, we demonstrate that GeneAtlas detects additional relationships, via its high-throughput modeling component, in comparison with the sequence searching method PSI-BLAST only. Furthermore, functionally related proteins with sequence identity below the twilight zone can be recognized correctly.

In addition, some targets were selected to test two new methods that we have developed, ChiRotor and Looper, for side-chain and loop prediction. ChiRotor is a fast algorithm that predicts the conformation of all or part of amino-acid side chains with an average RMSD of about 1Å for the core residues. The loop-modeling program, Looper, produces a number of energy minimized loop backbone conformations ranked according to force-field energy terms. Both algorithms are a combination of a discrete search in dihedral angle space and CHARMm energy minimization.

1. Kitson et al. (2002) Functional annotation of proteomic sequences based on consensus of sequence and structural analysis. Briefings in Bioinformatics 3(1), 1-13.

ALAX (P0234) - 39 predictions: 39 3D

Sequence Alignment Method for Automatic Homology Modeling With Low Sequence Identity

Atsushi Hijikata1, Tosiyuki Noguti 2 and Mitiko Go1 1 Division of Biological Science, Graduate School of Science, Nagoya University, 2 Saga Medical School [email protected]

The quality of homology modeling depends on the accuracy of sequence alignment between the target and the template proteins. When the sequence identities are low (below 30 %), the Indel (insertion/deletion) regions either target or template sequences increase and thus the accurate alignment of Indel is more critical, than in the

A-215 pairs with high identity, for homology modeling. It was reported that the amino acid residues with high solvent accessibility appear more frequently in Indel regions than those with low solvent accessibility [1]. We had re-analyzed this feature using recently accumulated data and conformed the previous result. To obtain correct assignment of Indel in sequence alignment, we developed a new sequence alignment method using smaller gap penalty for surface residues and the Position Specific Scoring Matrix (PSSM) of PSI-BLAST program. We termed the method ALAX (ALignment based on solvent ACCessibility). To evaluate the quality of ALAX, we compared the alignment obtained by ALAX with that obtained by PSI-BLAST [2] by taking super position of 3D structures of the target and template as correct alignment. We show that ALAX is 11 % better than the alignment obtained by PSI-BLAST. Furthermore, Indel regions of PSI-BLAST alignments often exist in the interior of template 3D structures, whereas such cases happen scarcely for ALAX. These results indicate that ALAX is useful for full automatic homology modeling particularly when the sequence identity between target and template proteins is low.

3. Zhu Z.Y. et al. (1992) A variable gap penalty function and feature weights for protein 3-D structure comparisons. Protein Eng. 5(1), 43-51. 4. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402.

Aligners (P0064) - 31 predictions: 31 3D

Fold Recognition Using Only Boilerplate Methods of Database Search and Multiple Sequence Alignment

Arcady Mushegian Stowers Institute for Medical Research [email protected]

See methods section

arby-scai (P0183) - 68 predictions: 68 3D

The Arby Automated Structure Prediction Server

Niklas von Öhsen2, Ingolf Sommer1 1 – Max-Planck-Institute for Informatics, 2 – FraunhoferInstitute forScientific Computing and Algorithms [email protected]

See methods abstract (Ingolf Sommer and Niklas von Öhsen)

A-216 BAKER (P0002) - 377 predictions: 377 3D

De Novo Structure Predictions Using Rosetta

P. Bradley1+, J. Meiler1+, K.M.S. Misura1+, W.R. Schief1+, J. Schonbrun1+, W.J. Wedemeyer1+, O. Schueler-Furman1, M. Kuhn1, P. Murphy1, C.E.M. Strauss2, and D. Baker1 1 - University of Washington, 2 - Los Alamos National Laboratory, + - authors contributed equally [email protected]

See methods section

BAKER (P0002) - 377 predictions: 377 3D

Comparative Modeling Using Rosetta

D. Chivian1+, C.A. Rohl1+, C.E.M. Strauss2, P. Murphy1, and D. Baker1 1 - University of Washington, 2 - Los Alamos National Laboratory, + - authors contributed equally [email protected]

See methods section

Biogen (P0440) - 28 predictions: 28 3D

Consensus Scoring Approach to Fold Recognition of CASP5 Targets

A. Lugovskoy, D. Gottlieb, H. van Vlijmen, and J. Singh Structural Informatics Group, Biogen Inc.

A-217 [email protected]

To increase the strength of our predictions we used a consensus scoring approach to fold recognition of CASP5 targets. For 30 fold recognition targets we combined predictions of several algorithms (Discrete State Model (DSM), pattern-embedded DSM, Genefold, Seqfold, and Loopp) [1-4] and selected the template folds found by multiple methods or by any of the methods if the score was significantly higher than for other folds. A set of uniform non-redundant fold libraries based on SCOP v1.59 classification [5] was constructed to ensure equal representation of all structural families. We defined fold recognition targets as single domain molecules that showed no more than 25% of sequence identity to molecules in the PDB and yielded hits in the algorithms. For subsequent homology modeling we performed partial manual realignments of the target and the template sequences to maximize the continuity of secondary structure elements. Homology models were built using MODELLER [6], and minimized using CHARMM [7] with harmonic constraints on the backbones. We believe that a consensus scoring approach lowers the rate of false positive hits and increases the confidence in fold recognition solutions.

1. Bienkowska J.R. et al. (2000) Protein fold recognition by total alignment probability. Proteins 40 (3), 451-62. 2. Jaroszewski L et al. (1998) Fold predictions by a hierarchy of sequence, threading and modeling methods. Prot. Sci. 7 (6), 1431-1440. 3. Olszewski K.A. et al. (1999) SeqFold - fully automated fold recognition and modeling software -- validation and application. Theor. Chem. Acc. 11, 57-66. 4. Meller J. and Elber R. (2001) Linear programming optimization and a double statistical filter for protein threading protocols. Proteins 45 (3), 241-261. 5. Murzin A. G. et al. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536-540. 6. Šali A and Blundell T.L. (1993) Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779-815. 7. Brooks B.R. et al. (1983) CHARMM: A Program for Macromolecular Energy, Minimization, and Dynamics Calculations, J. Comp. Chem. 4, 187-217.

Bion (P0474) - 63 predictions: 63 SS

Secondary Structure Prediction with Shuffled Training by SPAM

R. Shigeta and J.P. LeFlohic Bion Bioinformatics Consulting [email protected]

This instance of the Structure Prediction Application Metatool (SPAM) uses two sequential neural networks. The first is a 15-75-3 sequence-to-structure network which takes as input the actual residue and a PSIBLAST [2] position specific sequence profile (PSSM). Similar to the JNET architecture [1], a window of output from the first neural network is fed into a 15-55-3 structure-to-structure network. SPAM also feeds a copy of the original residue and the PSSM probability data to the second network.

A non-redundant set of 504 protein sequences and structures from the protein data bank [3] set were used as the training set, with a random 114 set aside for a non- trained test set. Proteins with out any identified secondary structure were discarded. Upon loading, the sequences are broken into window length training patterns and a shuffled such that the neural network is presented with each class of secondary structure at each training step and a similar number of examples of each structure.

A-218 Training proceeds in epochs until all the errors from the neural networks in the application cease to change more than an epsilon value which must be assigned by hand, between 1e-3 and 1e-5. A training epoch is defined as the presentation of 10,000 patterns, and so the training cycles do not contain exactly the same data.

The final prediction of beta, helix, or coil is selected by choosing the highest of the three outputs for each residue. No weighting is applied to the outputs.

The confidence is calculated in the standard way as the difference between the highest float output and the second highest one. CASP entries were then edited by hand for improbable patterns in secondary structure.

1. Cuff J. A and Barton G.J (1999) Application of enhanced multiple sequence alignment profiles to improve protein secondary structure prediction, Proteins 40:502- 511. 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402. 3. Berman H.M., Westbrook J., Feng Z., Gilliland G., Bhat T.N., Weissig H., Shindyalov I.N., Bourne P.E.: The Protein Data Bank. Nucleic Acids Research, 28 pp. 235-242 (2000)

Braun-Werner (P0024) - 65 predictions: 65 3D

Automated Generation of Property Based Motifs to Search for Functional Neighbors and to Improve Sequence Alignments

Venkatarajan Mathura, Catherine H Schein, Numan Oezguen, Ovidiu Ivanciuc, Yuan Xu and Werner Braun Sealy Center for Structural Biology, Department of Human Biological Chemistry and Genetics, University of Texas Medical Branch, Galveston, TX 77555-1157 [email protected]

We have developed a novel automatic method, based on patterns of conservation of physical-chemical properties (PCPs) of amino acids in aligned protein sequences, to find distantly related proteins with low sequence identity. Conservation of PCPs among sequences of protein families can be conveniently defined in terms of five descriptors, E1 to E5, which represent a large number (237) of different physical-chemical properties [1]. PCP motifs, i.e., contiguous residues that are conserved in E1- E5, are automatically generated by our MASIA Web server [2,3].

The MASIA tool was used to identify 12 motifs, areas of significant sequence conservation, in an alignment of 42 apurinic/apyrimidinic endonucleases (APE's) [4]. APE's are part of the base excision repair pathway to replace damaged sites in DNA resulting from ionizing radiation or oxidation. The sequence motifs contain all the residues previously shown to be essential for APE1 function, but we also detected new motifs distinctive for APEs that are not directly involved in cleavage, but establish protein-DNA interactions 3’ to the abasic site. These additional bonds enhance both specific binding to damaged DNA and the processivity of APE1.

A-219 Five of sequence motifs of the APE family are also structurally conserved in DNase-1 and the IPP family. We call the structural segments corresponding to the sequence motifs "molegos", molecular legos. Correcting the sequence alignment to match the residues at the ends of two of the molegos that are absolutely conserved in each of the three families greatly improved the local structural alignment of APEs, DNase-1 and synaptojanin. The shared molegos have a similar metal and DNA binding function in both APE and DNase-1 [4].

Large-scale data mining for APE motifs in the ASTRAL40 database was then performed using a Bayesian scoring function to identify similar motifs in all proteins of the database. All of the previously identified distantly related members of the DNase-I superfamily scored highly. Other high scoring proteins had no overall sequence or structural similarity to the APEs. However, all were phosphatases and/or had a similar metal ion binding active site [3]. To test the ability of our method to functionally annotate novel protein sequences, the PCP-motif profiles of the APE family were then used to scan the Drosophila genome. We anticipate that our sequence and structural decomposition of APE related proteins from different genomes would help us to understand functional and evolutionary aspects of this protein.

In CASP 5 we tested our method based on physical-chemical property motifs for improving and ranking different alignments. For each target we prepared multiple alignments of the target sequence with similar sequences from other organisms as identified in BLAST/PSIBLAST. Our MASIA program generated motif profiles for each target, which were then used by our program ALIGNSCORER to find high scoring templates and alignments from all fold recognition servers that participated in CAFASP. For some of the targets we combined several alignments with high scoring motifs from different fold recognition servers and from different templates. The highest scoring sequences were then modeled with the distance geometry based modeling suite MPACK.

1. Venkatarajan M.S. and Braun W. (2001) New quantitative descriptors of amino acids based on multidimensional scaling of a large number of physical-chemical properties. J. Mol. Modeling 7, 445-453. 2. Zhu H., Schein C.H and Braun W. (2000) MASIA: recognition of common patterns and properties in multiple aligned protein sequences. Bioinformatics 16:950- 951. 3. Venkatarajan S.M., Schein C.H. and Braun W. (2002) Identifying Property Based Sequence Motifs in Protein Families and Superfamilies: Application to APE. Submitted. 4. Schein C.H., Oezguen N., Izumi T. and Braun W. (2002) Total sequence decomposition distinguishes functional modules, “molegos” in apurinic/apyrimidinic endonucleases, BMC-Bioinformatics (In press). Burnham (P0516) - 68 predictions: 68 3D

Automated Modeling Pipeline

M.Grotthuss1, L.Knizewski1, P.Szczesny1, L.Jaroszewski2 and A.Godzik1 1 – The Burnham Institute, 2 – JCSG Bioinformatics, UCSD [email protected]

The semi-automated modeling used to CASP5 predictions was based on the FFAS03 server (see the FFAS03: automated profile-profile distant homology recognition server applied to fold recognition. L.Jaroszewski and A.Godzik abstract in the same volume).

A-220 Target sequences were submitted to the FFAS03 fold prediction server and the top 20 predictions were analyzed, as explained below. In cases when no high reliability FFAS03[1],[2] predictions were available, predictions and alignments from other servers included in the Metaserver[3] were included as well.

Top predictions from the FFAS server (or from all servers in the Metaserver) were clustered based on the structural similarity of the predicted templates, based on the SCOP classification. For each cluster, all PDB structures with the same SCOP superfamily classification were also aligned with the target. All alignments were then compared and analyzed for consistency. The most consistent core of the alignment was used for modeling with a program NEST[4] from Jackal package. Loops were added by a loop building procedure LOOPY[5] from Jackal package. In most cases, several alternative alignments were explored.

The entire group of several models based on alternative alignments with each superfamily of potential templates was then evaluated using several energy based methods. PSQS[6] server was used to calculate average energy of the models. Models were analyzed to check if the function predicted from homology, database annotation and genomic context analysis were compatible. The final model was accepted based on a jury system, where various criteria (energy, agreement with function prediction, completeness of the model etc.) were weighted into a single scoring system.

1. Rychlewski L., Jaroszewski Ł., Li W. & Godzik A. (2000) Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Science 9, 232-241 2. Jaroszewski Ł., Rychlewski L. & Godzik A. (2000).Improving the quality of twilight-zone alignments. Protein Science 9, 1487-1496 3. Bujnicki J.M., Elofsson A., Fischer D., Rychlewski L. Structure prediction meta server. Bioinformatics. 2001 Aug;17(8):750-1 4. Xiang Z.; Honig B. Homology model building with artificial evolution. (in preparation). 5. Protein S., Xiang Z; Honig B. Evaluating configurational free energies: the colony energy concept and its application to the problem of protein loop prediction. Proc. Natl. Acad. Sci. USA 99:7432-7437. 6. equence Quality Score – model evaluation server, http://www.jcsg.org/psqs/.

Bystroff (P0131) - 132 predictions: 45 3D, 40 SS, 45 RR, 2 DR

Contact Map Threading Using HMMSTR

Y. Shao and C. Bystroff Department of Biology, Rensselear Polytechnic Institute [email protected], [email protected]

See methods section

A-221 Camacho-Carlos (P0098) - 46 predictions: 46 3D

Automated Consensus Method of Alignment for Confident Comparative Modeling

Jahnavi C. Prasad, Sandor Vajda, Carlos J. Camacho Bioinformatics Program, Boston University, Boston, MA 02215 [email protected]

We have developed an algorithm that consistently gives a high quality alignment for comparative modeling, and identifies the regions of this alignment that are reliable and structurally similar between the template and target. In order to identify a consistent way to get an accurate alignment, ten popular alignment methods were tested against a set of 79 pairs of homologous proteins for alignment accuracy in the context of comparative modeling. The top five performing methods were selected and a method for generating a consensus by combining the alignments from these five methods has been subsequently developed. By building on the strength of the consensus alignment, we have identified a set of criteria that remove alignment zones corresponding to structurally dissimilar regions and poor alignment reliability. When applied over an independent set of 49 homologous protein structure pairs, the average RMS deviations of the structures obtained with this consensus based alignment is on the order of 2.5 A, while the length of the alignment is about 80% of that found by standard structural superposition methods. While the selected top five methods had 20- 40% of the alignments that would yield predicted structures with RMS deviations of 6A or more from the native structure, there were such no cases at all from our method. In our tests, the method performs consistently over a range of target-template sequence identity spanning 5-30%. The algorithm is currently available as a server at http://structure.bu.edu/cgi-bin/consensus.cgi

1. Prasad J.C., Comeau S.R., Vajda S., Camacho C.J. Confident Homology Modeling Based On Consensus Alignment. Submitted for publication.

CaspIta (P0108) - 133 predictions: 70 3D, 63 SS

Fast Loop Modeling of Insertions and Deletions with Integrated Side Chain Placement and Energy Minimization

S. C. E. Tosatto1, F. Fogolari2 , A. Cestaro1 and G. Valle1 1 - CRIBI Biotechnology Centre, Universita' di Padova 2 - Science and Technology Dept., Universita' di Verona [email protected]

An extended protocol of the fast divide and conquer loop modelling method of Tosatto et al. [1] is used as a basis to construct insertions and deletions in models built by homology. The initial target to template alignment is modified to optimize the distances between the flanking regions of insertions and deletions. For insertions, the flanking regions are preferrably distant and exposed to the solvent. Deletions are shifted to minimize the distance between the flanking regions. In both cases a number of residues on either flank are selected to be modeled together with the insertion or deletion. Regular secondary structure elements were generally chosen as boundaries for the loops to be modelled.

A-222 Segments of the amino acid backbone chosen in this way are first generated and ranked using the divide and conquer method [1]. This method uses a series of artificial fragment databases generated from a Ramachandran plot distribution of (phi,psi) torsion angles found in loops to generate different loop conformations. Conformations showing strong steric clashes or amino acids in disallowed regions of the Ramachandran map (e.g. Proline) are eliminated. The remaining conformations are ranked according to a combination of geometric fit to the flanking regions and a knowledge-based potential.

Each of the top twenty solutions is then subjected to the following steps. Side chains are placed for the entire protein using SCWRL [2] to account for changes in side chain rotamers induced by different loop conformations. The CHARMM force field [3] without electrostatics is then used to minimize the loop in the context of the protein. Hundred steps of steepest descent and five hundred steps of conjugate gradient minimization are performed to relax the initial model. The final models are ranked according to the CHARMM energy. The one with the lowest energy selected as the most probable loop conformation.

1. Tosatto S.C.E. et al. (2002) A divide and conquer approach to fast loop modeling. Protein Eng. 15(4), 279-286. 2. Bower M.J. et al. (1997). Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: A new homology modeling tool. J. Mol. Biol. 267, 1268-1282. 3. MacKerell J.A.D. et al. (1998) All-hydrogen empirical potential for molecular modeling and dynamics studies of proteins using the CHARMM22 force field. J. Phys. Chem. B 102, 3586-3616.

CBC-FOLD (P0008) - 151 predictions: 151 3D

What’s so Good About Real Proteins?

Ajay K. Royyuru, Ruhong Zhou, Prasanna Athma, B. David Silverman, Gelonia Dent and Rosalia Tungaraza Computational Biology Center, IBM Thomas J. Watson Research Center, Yorktown Heights, NY 10598, USA [email protected]

See methods section

CHIMERA (P0153) - 94 predictions: 94 3D

Comparative Modeling Using CHIMERA Modeling System

Mayuko Takeda-Shitaka, Chieko Chiba, Hirokazu Tanaka, Daisuke Takaya and Hideaki Umeyama

A-223 Kitasato University [email protected]

See methods section

CHIMERAX (P0170) - 74 predictions: 74 3D

Full Length Protein Modeling Using CHIMERA eXtending Procedure

Genki Terashi, Ryota Yamatsu, Youji Kurihara, Mayuko Takeda-Shitaka, Mitsuo Iwadate and Hideaki Umeyama Kitasato University [email protected]

See methods section

CIRB (P0397) - 263 predictions: 200 3D, 63 RR

Detecting High Quality Profile-Profile Alignments Using Shannon Entropy.

E. Capriotti1,3, P. Fariselli2, I. Rossi2,3, and R. Casadio2 1 - Dept. of Physics/CIRB, University of Bologna, Italy, 2 - Dept. of Biology/CIRB, University of Bologna, Italy, 3 - BioDec srl, Bologna, Italy [email protected], [email protected]

We analyze the quality of the alignment generated by the profile-profile alignment comparison algorithm known as BASIC [1] and compare the results with those obtained with a structural alignment code. By this we compute that a Shannon entropy value > 0.5 gives a sequence to sequence alignment of the target/template couple comparable to that obtained with the structural alignment performed with CE.

In our fold recognition/threading code Tangram, the BASIC profile-profile alignment is implemented as follows:

1)The composition profiles PA and PB for the target and template are generated by multiple alignment of the sequences obtained from a three-iteration PSI-BLAST [2] search on the Non-Redundant database (the inclusion threshold is E=10-3). 2)the dot matrix (D) for the profile comparison of two protein sequences

A-224 T D= P A S PB , (with S=BLOSUM62 [3] substitution matrix) is computed using linear algebra routines. 3)the D matrix is searched for high-scoring alignment by means local Smith-Waterman dynamic programming algorithm [4].

The test set used for the evaluation is composed by 185 template/target couples of PDB structures that share the same SCOP label, but have less than 30% sequence identity

When the top-scoring alignments for each target protein in the test set is considered, our BASIC implementation detects the full SCOP label for 125 couples (68%) and generates 114 (62%) alignments with a MaxSub [5] score >=1.

Interestingly, it is found that nearly all of the high-quality alignments share a common feature: the average Shannon entropy for the profile sections aligned together is greater than 0.5 for both the template and the target.

If only the top scoring alignments for which this condition holds are considered, a subset of 119 alignments is selected, and for 116 of them (97%) the full SCOP label can be assigned to the target, while 108 (91%) gets a nonzero MaxSub score, with an average score of 4.6 MaxSub on the subset On the same 119 couples, the structural alignment program CE [6] computes a nonzero MaxSub score for 116 of them, with an average of 5.7 points.

These results indicate that the Shannon entropy value can be used to discriminate a subset of sequence profile-profile alignments of quality comparable to that obtained by means of a structural alignment program.

1. Rychlewski J. et al. (1998) Fold and function predictions for Mycoplasma genitalium proteins. Fold. Des. 3, 229-238 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 3. Henikoff, S. et al. (1998). Superior performance in protein homology detection with the BLOCKS database server. Nucleic Acids Res. 26, 309-312. 4. Smith T. S. and Waterman M. S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, 145-147 5. Siew N. et al (2000) MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics 16(9) 776-785. 6. Shindyalov I. N. and Bourne P. E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path Prot. Eng. 11(9) 739-74

DelCLAB (P0050) - 310 predictions: 310 3D

Protein Folding Prediction by Spectral Analysis Methods

Carlos A. Del Carpio-Muñoz, Hideto Shirasawa, and Kensuke Hagino Lab. for BioInformatics. Dept. of Ecological Eng. Toyohashi University of Technology. Tempaku. Toyohashi. 441-8580 [email protected]

A-225 A novel technique for protein folding recognition is presented here, which consists in applying a well known technique of front-end processing in robust automatic speech recognition (ASR) to the problem of protein fold recognition. This analysis-synthesis technique is based on the transformation of a signal into its cepstrum which is a measure of the periodic wiggliness of a frequency response plot. The cepstrum is calculated as the logarithm of the power spectrum of a signal, which is the expression of the primary structure of a protein using the physicochemical characteristics of the constituting amino acids. This leads to a logarithmic periodgram for which the spectral envelope is obtained as a smooth curve depicted by connecting the main local peaks of the minute structure of the frequency spectrum[1-2].

The technique applied to the analysis of the profile of physicochemical features of the amino acid sequence of the protein allows extraction of information in the form of the spectral envelop which used to model the relationship between the primary and tertiary structures of a protein.

Spectra are aligned using dynamic programming algorithms, and amino acid sequences are expressed using a set of dominant physicochemical parameters that are able to model a super-family in the SCOP data base[3].

The fold recognition technique is complemented with an analysis of the secondary structure of the target, aligning it with the secondary structure obtained as consensus of several secondary structure predicting methods.

The threading of the target sequence on the most plausible template is performed by a genetic algorithm that has as penalty function the calculation of the deviation derived by cutting and inserting fragments of structure into the template structure. The methodology presented here introduces several new concepts which can be directly related to the function of the molecule. Thus in recognizing a particular folding by identifying a characteristic spectrum representing an entire super-family the target may belong to, the methodology works under the assumption that the function and not only the structure (since the homology in sequence with which one has worked here belongs approximately to the twilight zone) has been encoded as an spectrum, from which not only structural homology may be read but also protein function.

1. Del Carpio C.A. and Yoshimori A. Fully Automated Protein Tertiary Structure Prediction Using Fourier Transform Spectral Methods. Protein Structure Prediction: Bioinformatic Approach. Edited by: Igor Tsigelny. University of California. International University Line Inc. 173-197 (2002). 2. Del Carpio-Muñoz C.A. Folding Pattern Recognition in Proteins Using Spectral Analysis Methods. Genome Informatics. In Press (2002). 3. SCOP( http://scop.mrc-lmb.cam.ac.uk/scop/ )

Dunbrack (P0329) - 46 predictions: 46 3D

New Algorithms for Loop And Side-Chain Prediction

A. A. Canutescu, A. A. Shelenkov, and R. L. Dunbrack, Jr. Fox Chase Cancer Center, Philadelphia PA USA

A-226 [email protected]

We present two new algorithms that can be used in comparative modeling of protein structures. The first is a new method to solve the “loop closure problem”. In many methods of loop prediction, random loop conformations are generated and must be adjusted to connect N and C-terminal anchors in the secondary structures neighboring the loop to be predicted. Current algorithms based on calculating the Jacobian such as “random tweak” [1] are slow and sometimes do not converge. They also require a matrix inversion, which may sometimes lead to singularities. One algorithm used in robotics that is flexible in allowing constraints to be placed at each step, easy to program, conceptually simple and elegant, and computationally fast is “cyclic coordinate descent” (CCD). This algorithm was originally developed by Li- Chun Tommy Wang et. al in 1991 [2] as an improved method for solving inverse kinematics problems in robotics. It involves adjusting one degree of freedom at a time to move the end effector toward the target object. This results in one equation in one unknown for each degree of freedom, and hence is analytically very simple and computationally fast. The method is free of singularities and it does not include matrix inversion. It proceeds in iterative fashion along a chain of degrees of freedom, modifying each joint so that the end effector gets as close as possible to the desired position. The equations are able to provide both an optimum setting for the variable and the first and second derivative of the change at the current position so that small increments can be made in preference to large changes, if desired. Given that the calculation of a parameter in one joint does not depend on parameters of the other joints, one can also place constraints on any degree of freedom, choosing to restrict their allowed values or place probability distributions on them.

We show that CCD can close loops from nearly any starting configuration as long as the chain is long enough to reach from N-terminus residue anchor to the C-terminus residue anchor. In tests of over 250,000 random conformations, CCD was able to close 99.95% of them. It fails only on a few very short, extended loop conformations. In this case, a Monte Carlo step that moves the end effector away from the anchor can be implemented. We have also explored the use of Ramachandran probability maps as constraints in the CCD closure procedure, and show that they do not effect the success rate of loop closure by CCD.

We also present a new algorithm for our side-chain prediction program SCWRL [3]. SCWRL relies on the fact that sidechains in proteins are "sparsely connected." If we represent the residues of a protein as the vertices in a graph and an edge between two residues as a potential steric clash for some pair of rotamers, then the number of edges per residue is much smaller than the number of vertices (residues) in the graph. In practice, this graph is rarely connected. It will consist of several clusters of interacting residues with no connecting edges between the clusters. If the clusters are not large, they are searched combinatorially in a branch-and-bound procedure. Frequently, the clusters become too large to handle combinatorially. In this case, SCWRL searches for a single residue (the "keystone") whose removal from the cluster will break up the cluster into two graphs which are not interconnected. If such a residue can be found, then each subgraph can be solved once for each rotamer of the "keystone residue."

We propose a new algorithm for SCWRL that breaks up clusters of interacting sidechains into the biconnected components of an undirected graph. Biconnected graphs are those that can not be broken apart by removal of a single vertex. In practice, residue clusters in the proteins are broken up entirely into clusters of size less than 10. These clusters can be solved in a second or less. By contrast in the "old" SCWRL, clusters of size 15-20 occur in some proteins (and more frequently in homology modeling situations), and these can take minutes and occasionally hours to solve.

1. Shenkin P.S. et al. (1987) Predicting antibody hypervariable loop conformation. I. Ensembles of random conformations for ringlike structures. Biopolymers 26 (12), 2053-2085. 2. Wang L. T. and Chen C. C.. A combined optimization method for solving the inverse kinematics problem of mechanical manipulators. IEEE Trans. Robotics and Automation. 7 (4), 489-499.

A-227 3. Bower M.J., Cohen F.E., and Dunbrack R.L. Jr. Prediction of protein side-chain rotamers from a backbone-dependent rotamer library: a new homology modeling tool. J. Mol. Biol. 267 (5), 1268-1282.

Dunbrack (P0329) - 46 predictions: 46 3D

Comparative Modeling of CASP5 Targets

G. Wang and R. L. Dunbrack, Jr. Fox Chase Cancer Center, Philadelphia PA USA [email protected]

We have developed two new scoring mechanisms for profile-profile alignments. The first is a Dirichlet mixture substitution matrix (DIMSUM) analogous to ordinary amino acid substitution matrices, but in which the scores represent probabilities of substituting profile columns for one another. The columns in the profiles are represented as components of a Dirichlet mixture developed from multiple sequence alignments and structural characteristics (secondary structure and surface exposure). The DIMSUM matrices were developed from structure alignments of homologous proteins using the CE program [1] in a manner similar to the BLOSUM matrices [2]. The profile-profile alignments are performed with a standard local-alignment dynamic programming algorithm.

The second scoring method is a combination of an amino acid substitution matrix and a matrix that represents the probability of predicted secondary structure in one profile (the CASP target) aligning to known secondary structure in the PDB entry. This matrix (SSAAC) was also developed from structure alignments by determining the substitution rates of predicted secondary structure in one protein in each structural alignment versus known secondary structure in the other protein. We combined both DIMSUM and SSAAC with a structure-derived amino acid substitution matrix (SDM) [3], applied to the two profile columns, such that the score is the sum over all i,j of piqjSij where pi and pj are the probabilities of amino acid types i and j in the two columns and S ij is the element from the substitution matrix. We use a gap penalty scheme that is dependent on the evolutionary distance of the two profiles. The scoring schemes were optimized at 50% SDM/50%DIMSUM for the DIMSUM method and 65% SDM/35% SSAAC for the SSAAC method.

We show that both methods compare very well with other profile-profile alignment schemes published by other groups in terms of alignment accuracy and search sensitivity.

1. Shindyalov I.N. and Bourne P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimzl path. Protein Eng. 11 (9), 739-747. 2. Henikoff S. and Henikoff J.G. (1993) Performance evaluation of amino acid substitution matrices. Proteins 17 (1), 49-61. 3. Prlic A., Dominques F.S. and Sippl M.J. Structure-derived substitution matrices for alignment of distantly related sequences. Protein Eng. 13 (8) 545-550.

A-228 evolutionaries (P0180) - 99 predictions: 99 3D

A Phylogenomic Approach to Fold Prediction

Kimmen Sjölander1, Emma Hill1, David Konerding1, Steven Brenner1, Andrej Sali2 and Andras Fiser2 1 – UC Berkeley, 2 – Rockefeller University [email protected]

See methods section

GeneSilico.PL-servers-only (P0242) - 68 predictions: 66 3D, 2 SS Bujnicki-Janusz (P0020) - 215 predictions: 67 3D, 58 SS, 49 RR, 41 DR GeneSilico (P0517) - 195 predictions: 86 3D, 64 SS, 45 RR

From Automated Models, to Refinement by a Human Expert, to Combination of Alternative Solutions Obtained by Independent Predictors

M. Feder, I. Cymerman, J. Kosinski, J. Sasin, M. Kurowski, J.M. Bujnicki International Institute of Molecular and Cell Biology (IIMCB) in Warsaw. Trojdena 4, 01-109 Warsaw, Poland [email protected]

The results of the last two CASP and CAFASP assessments in the fold-recognition (FR) category revealed that most of the top groups use the fully automated predictions generated by their own servers and/or other CAFASP servers as the starting point for protein model building and refinement. Interestingly, the performance difference between the human experts and computer predictors continues to narrow, which suggests that most of the refinement procedures used by humans can be fully automated. Several attempts have been made to quantify and monitor the “gap” between the quality of automated predictions and the models refined by humans, but to date no comparison has been made on a “case-to-case” basis.

In CASP4, J.M.Bujnicki participated as a member of the BioInfo duumvirate, as well as and one of four experts of the CAFASP-consensus group. Within BioInfo, he was responsible for building and refinement of all targets in the HM and FR categories, while in CAFASP-consensus, he participated in identification of the best automated models and in inference of a rational consensus between them. While the unrefined predictions gave CAFASP-consensus the overall ranking of 7, the refined predictions gave BioInfo even better score. However, it was not always clear if the improvement stemmed from the refinement or from application of different criteria for selection of the best automated models by the two groups.

A-229 In CASP5, we attempted to assess, in a systematic way, the value added to the automated model by the refinement carried out by a single human expert, as well as the (possible) value of additional input from less experienced predictors. Therefore, the GeneSilico team of the Bioinformatics Laboratory at the International Institute of Molecular Biology in Warsaw submitted predictions as three independent groups: i) GeneSilico-servers-only (selection of unrefined FR models), ii) Bujnicki-Janusz (models refined by a single experienced predictor), and iii) GeneSilico (consensus obtained after careful evaluation and comparison of models generated independently by all members of the team).

Selection of the best automated model by GeneSilico-servers-only in CASP5 was carried out in a similar manner to that of CAFASP-consensus in CASP4, only in a more disciplined way. Both groups relied on automatic models generated by the CAFASP servers. In CAFASP-consensus, modeling involved (at least in some cases) shifting of insertions and deletions to the surface-exposed regions and limited refinement of loops, without any changes of the target-template alignment in the core regions. On the other hand, the human intervention of GeneSilico-servers-only in CASP5 involved only selection of one of the FR alignments or one of the atomic models generated by HM or ab initio servers. Automated FR models were based on single templates and were preferably submitted in the AL format without explicit modeling of insertions or deletions to avoid the inevitable distortion of the raw data by automatic homology modeling in cases, such as disruption of the protein core.

Bujnicki-Janusz used the selected FR models as a starting point to generate refined models. Here, no limits were placed on modification of the original alignment and inclusion of additional templates. In several cases, large parts of the models (>10 aa) were (re)built by hand. The major constraint on the level of refinement was the limited amount of time available for each model, given the large number of targets in CASP5. Additional five members of the GeneSilico team explored alternative ways of refinement. In many cases, alternative models, different from that submitted by Bujnicki-Janusz, could be obtained. After evaluation of all models by knowledge- based potentials, the best model or the hybrid comprising best fragments of several models was submitted. Bujnicki-Janusz was rather stringent in rejecting uncertain predictions, resulting in submission of models for only a fraction of the CASP5 targets. On the other hand, GeneSilico attempted to submit models for all targets for which at least a plausible prediction could be made. In addition to comparison of automated and refined models, evaluation of the relative performance of a single expert (Bujnicki-Janusz) and the expert aided by a team of less experienced predictors (GeneSilico) will allow to assess the influence of the time constraints (personhours/model) as well as of additional, though somewhat naïve sampling of the alignment/model space, on the quality of the final prediction.

GERLOFF (P0240) - 9 predictions: 9 3D

Incorporation of Constraints Derived from Active/Functional Site Predictions in Protein Tertiary Structure Assembly

R. Schmid, D. C. Soares, Z. A. M. Hussein, B. J. Mitchell, R. S. Hamilton and D. L. Gerloff Biocomputing Research Unit & Structural Biochemistry Group, Institute of Cell and Molecular Biology,University of Edinburgh, UK [email protected]

A-230 We submitted tertiary structure predictions for five CASP5 target proteins in order to investigate the potential of knowledge and/or predictions about functional sites in these proteins for being used in combination with established structure prediction methods. The degrees of difficulty assigned to the prediction targets, and the categories in which our predictions are considered, vary - the monomer of T0132 is clearly homologous to its template whereas we could not find any suitable template structures for T0129*. Similarly, the way in which functional site information is used, and its impact on the final model varies slightly from target to target.

Our primary postulates are that:

(a), the interchange between structure and function prediction (or knowledge) leads to improvement at both ends; (b), formulation/adaptation of systematic fold-specific heuristics and function-specific heuristics is possible, at least for certain folds and functions; (c), prediction of structure/ function can go beyond trying to find re- occurrences of known cases.

While we found little opportunity within the set of CASP5 targets to demonstrate and/or test postulate (b) (CASP4 T0100 was a good example), we attempted to use function prediction/knowledge in all predictions we submitted. Primarily, we used predicted key residues in proteins presumed to function as enzymes to “anchor” threading alignments (in T0130, T0173, and to an extent in T0136 and T0132) so that their arrangement in the model would allow catalysis. We could not find a suitable fold template for T0129 and used the presumed proximity of presumed functional residues to guide the assembly of helices ab initio.

Prediction of key residues from multiple sequence alignments was generally based on complete, or high, conservation of functional type amino acids, sometimes taking into consideration patterns of conservation similar to those described in [1]. The choice of template structures used in our predictions was often influenced by the publicly available CAFASP2 predictions by automated servers, albeit not exclusively. Here again, the compatibility between the folds and biologically sensical arrangements of predicted key residues was our primary criteria in non-obvious cases. Secondary structure predictions by CAFASP2-servers were used by default but often refined according to [1] and in the course of modeling.

On our poster, we are discussing the value of our predictions in light of the experimental structures. Particularly interesting besides the structural discussion will be to re- assess the speculative functional roles of individual predicted key residues that we attempted to assign in most of our submissions. These blind predictions of functional aspects are influenced by the structure predictions as much as vice versa.

While the “manual component” in our CASP-predictions is obviously significant, our goal is to identify systematic aspects in the way biochemists’ knowledge influences (and quite often improves) tertiary structure predictions, with the goal of providing “refinement modules” for existing automated methods. Besides functional site assembly, consideration of the usually observed pseudo-symmetry in protein quaternary structures is under-explored in our field, and we believe that the prediction of (non-transient) quaternary structure besides tertiary structure would be a highly relevant addition to future CASPs. Interesting quaternary structure cases in the targets we considered were T0132 and T0136. Again, the benefits of further developing efforts in this direction could be mutually beneficial to either tertiary and quaternary structure prediction.

(* Please note that this abstract was written before we had the chance to see the experimental structures in order to comply with the deadlines.)

1. Benner S.A., Cannarozzi G.M., Gerloff D., Turcotte M. and Chelvanayagam G. (1997) Bona fide predictions of protein secondary structure using transparent analyses of multiple sequence alignments. Chem. Reviews 97, 2725-2843

A-231 Ho-Kai-Ming (P0437) - 129 predictions: 129 3D

Three Dimensional Threading Approach to Protein Structure Recognition

Kai-Ming Ho, Haibo Cao, Yungok Ihm, Zhong Gao, Cai-Zhuang Wang and Drena Dobbs Iowa State University [email protected]

See methods section

HOGUE-SLRI (P0267) - 254 predictions: 254 3D

Semi-Automated Homology Modeling of 38 CASP5 Targets Using a Modified TRADES Algorithm

M. Dumontier12, H. J.Feldman12 and C. W.V. Hogue12 1 Samuel Lunenfeld Research Institute, 600 University Ave. Toronto, Ontario, Canada M5G 1X5, 2 Department of Biochemistry, University of Toronto [email protected]

Homology modeling is a powerful method for predicting the three dimensional structure of biological macromolecules from their primary sequence given even weak sequence similarity to a biomolecule with an experimentally determined structure. High-quality models can provide important information regarding the function and mechanism of a biomolecule and could be used for rationalizing experimental data or guiding the design of new experiments. Here, we present a modified version of the TRADES algorithm1, used in the blind prediction of 38 protein structures from sequence for the Critical Assessment of techniques for protein Structure Prediction (CASP) competition.

Template protein structures for homology modeling of CASP targets were identified using BLAST against the protein structure database (PDB) and the conserved domain database (CDD). Templates with significant sequence similarity across the longest segment with the fewest indels and closest functional annotation were favorably considered. In the case of multi-domain proteins, the best hit for each domain was used as template. Where possible, alignments were modified to ensure that indels fell on loop regions rather than across elements of secondary structure.

Next, a new target trajectory distribution was built from the template backbone Ca trajectory using a modification of the TRADES algorithm. A slightly flexible single fragment from the recorded trace replaced each structurally conserved (gapless) region of alignment. Gap-spanning fragments for variable regions were created from

A-232 'takeoff angles' starting from one residue prior to the gap and ending one residue following the gap. These fragments consisted of six degrees of freedom - the distance between the start and end of the gap, two virtual Ca angles and three virtual Ca dihedrals. Three atoms from each side of the gap were placed in space, according to the takeoff angles. Alpha carbons required to fill the gap were given arbitrary starting co-ordinates within the gap region, and a steepest descent energy minimization consisting of virtual Ca bond length restraints, virtual Ca angles restraints, and a van der Waals term was carried out. The three anchoring atoms on either side of the gap were held fixed during the minimization. Finally, the resulting loop was incorporated as a fragment using its own Ca trace.

Roughly 1000 structures were generated using the fragments obtained from the previous steps and our Foldtraj software, with bump checking disabled. Using a modified version of a statistical residue-based potential2, which we have termed 'crease energy', the best five structures were chosen. These were then refined with a steepest-descent minimization using the CHARMM EEF1 force field to resolve steric clashes but without significantly changing the structure (typically 1A o RMSD between the refined and unrefined structures).

The modified TRADES algorithm generates realistic, all-atom protein structure homology models of non-idealized geometry as it incorporates side chains from a backbone dependent rotamer library and produces reasonable bond lengths, bond angles, torsion angles, as well as minimized electrostatics and van der Waals forces. Moreover, this method models loops for insertions and deletions and compensates for missing template atoms.

1. Feldman H.J. and Hogue C.W.V. (2000). A fast method to sample real protein conformational space. Proteins. 39 (2), 112-31. 2. Bryant S.H. and Lawrence C.E. (1993) An empirical energy function for threading protein sequence through the folding motif. Proteins. 16 (1), 92-112.

Huber-Torda (P0351) - 83 predictions: 83 3D

Fold Recognition and Sequence to Structure Alignment: Brobdingnagian Approximations and Lilliputian Success

T. Huber1, B. J.B. Procter2 and A.E. Torda2 1 - Department of Mathematics, The University of Queensland, Australia, 2 - Zentrum fuer Bioinformatik, University of Hamburg, Germany [email protected], [email protected], [email protected]

Protein fold recognition can be regarded as a search for the most compatible known structure with a sequence of interest. Common compatibility measures may be based on “knowledge-based” force fields, sequence profiles or hidden Markov models, but this usually implies simple interaction models and parameters reverse-engineered from collections of known structures. Unfortunately, finding good models and parameters to describe protein properties at low resolution often relies on weak assumptions and gross approximations.

We applied data mining methods, without prior assumptions, to extract the information from a large set of proteins down to a most parsimonious set of protein fragments. We began with a large number of protein fragments in a high dimensional space which was then modeled as a mixture of conditionally independent classes. The result is a collection of sequence and structure probability distributions. These could then be used as a score function for sequence to structure alignments.

A-233 Gap penalties were adjusted by a numerical optimization process using a penalty function which measured the structural quality of models. Finally, structures were ranked after including a contribution from a z-score optimized, low resolution quasi-energy function.

Models were built for all sequences ranging from the highly homologous to the totally exotic and the results are assessed in terms of model and fold recognition quality. jive (P0506) - 37 predictions: 37 3D

JIVE: Protein structure prediction by the assembly of local supersecondary structural motifs

David F. Burke, and Tom L Blundell Department of Biochemistry, University of Cambridge,80 Tennis Court Road, Cambridge, CB2 1GA, United Kingdom [email protected]

In the CASP 5 experiment, models of proteins which had low confidence values across the CAFASP3 servers were selected to be modelled by JIVE.

JIVE predicts the structure of small continuous domains of proteins by the assembly of fragments of local supersecondary motifs. Initially, homologous sequences were identified using PSI-BLAST[1] and secondary structure prediction was performed using PHD[2]. The conformational class of the supersecondary fragments were predicted using SLOOP[3-5]. SLOOP uses sequence/structure profiles derived from a database of loops clustered on the conformation of the loop and surrounding secondary structures. These fragments were then assembled using a Monte Carlo simulation. Unsuitable models were rejected based on excluded volume and a distance- dependent conditional probability function [6].

The generated structures were then searched against protein structures from both the HOMSTRAD[7] database of homologous families and the CAMPASS[8] database using the program SEA[9]. Potential hits were then analysed further for validation. In all, 17 targets were submitted.

1. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.Nucleic Acids Res. 25(17):3389-402. 2. Rost B., et al. (1994) PHD-an automatic mail server for protein secondary structure prediction.Comput Appl Biosci.10(1):53-60 3. Donate L.E., et al.(1996) Conformational analysis and clustering of short and medium size loops connecting regular secondary structures: a database for modeling and prediction. Protein Sci. 5(12):2600-16 4. Rufino S.D. et al (1997) Predicting the conformational class of short and medium size loops connecting regular secondary structures: application to comparative modelling. J Mol Biol. 267(2):352-67. 5. Burke D.F. et al. (2001) Improved Loop prediction from sequence alone. Protein Engineering 14 (7) 473-478 6. Samudrala R. et al. (1998) An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J Mol Biol. 275(5):895- 916 7. Mizuguchi K., et al. (1998) HOMSTRAD: a database of protein structure alignments for homologous families. Protein Science 7 2469-2471. 8. Sowdhamini R., et. Al (1996) A database of globular protein structural domains: clustering of representative family members into similar folds. Fold Des 1 (3):209- 20 9. Rufino S.D. et al. (1994) Structure-based identification and clustering of protein families and superfamilies. J Comput Aided Mol Des 8(1):5-27

A-234 Jones (P0067) - 121 predictions: 68 3D, 53 SS

Assessing the Reliability of Transmembrane Protein Topology Assignments by Homology

M. Pellegrini-Calace and D.T. Jones Bioinformatics Unit, Department of Computer Science University College London, Gower St, WC1E 6BT, London (UK) [email protected]

Membrane proteins make up a wide and important class of biological macromolecules and are interesting targets for medicinal chemistry. Moreover, helical membrane proteins represent a total of 20%-25% of the proteins in a typical genome, and the key role they play in cells makes crucial to increase the ability of detecting homology- related membrane proteins to gain a quick way to understand their functional features. [1]

PSI-BLAST (Position-Specific Iterated BLAST) is one of the most popular and powerful homology search programs currently available and has been shown to be more effective than most other methods in the detection of distantly related globular proteins. However, because unrelated transmembrane (TM) segment are more similar to each other than unrelated globular regions, PSI-BLAST has not shown a comparable effectiveness when applied to membrane proteins. [1-2]

To minimize the number of false hits obtained after membrane protein homology searches by PSI-BLAST, we performed a systematic optimization of the E-value to calculate a restrictive cut-off value for the inclusion of proteins in the iterative BLAST search underlying PSI-BLAST method.

Calculations were performed on a data set built by comparing three membrane protein databases: the MPtopo database (92 -helical proteins) [3]; the TMPDB database (189 -helical proteins) [4]; and the membrane proteins database from Moeller et al. available at the EBI ftp site (148 non-redundant -helical membrane protein) [5]. The databases were compared by default BLAST searches, at an E-value cut-off of 10-3. Entries from each of the three databases were the query sequences for 2 different runs, in which the 2 remaining databases were searched (the total number of BLAST searches was therefore 6). Topologies of TM helices (TMHs) from the found homologues were analyzed and compared with topologies of TMHs form query sequences. The number of TMPDB entries showing at least one homologue with agreeing TMH topology in both the other two databases resulted only 48, probably because of the small size of the MPtopo database. Therefore, entries from TMPDB showing at least one homologue with agreeing TMH topology either in the MPtopo or in the EBI database were chosen as benchmark data set (149 proteins, 94 from prokaryotes, 48 from eukaryotes and 7 from viruses).

Two BLAST calculations with default parameters were performed on the data set, scanning the non-redundant sequence database comprising also of the 149 sequences, in both filtered and not-filtered forms. The number of true hits (i.e. hits among the 149 sequences having agreeing topology and percentage of identity higher than 30%) was calculated at each E-value and the E-value with the lowest percentage of error was chosen as the restrictive cut-off value for PSI-BLAST calculations.

A-235 Finally, it was shown that the most reliable PSI-BLAST searches for membrane protein query sequences are performed setting the number of iteration at 2 and a cut-off E-value for the inclusion in the profile at 10-14.

1. Hedman M. et al. (2002) Improved detection of homologous membrane proteins by inclusion of information from topology predictions. Protein Sci. 11(3), 652-658. 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17), 3389-3402. 3. Jayasinghe S. et al. (2001). MPtopo: A database of membrane protein topology. Protein Sci. 10, 455-458. 4. Ikeda M. et al., (2002). Transmembrane topology prediction methods: a re-assessment and improvement by a consensus method using a dataset of experimentally- characterized transmembrane topologies. In Silico Biol., 2, 19-33. 5. Moeller S. et al. (2000) A collection of well characterised integral membrane proteins. Bioinformatics, 16, 1159-1160.

Jones (P0067) - 121 predictions: 68 3D, 53 SS

A Distributed Pipeline for Structure-based Proteome Annotation Using Grid Technology

L. McGuffin1, S. Sorensen1, C. Orengo2, D. Jones1, J. Cuff3, E. Birney4, A. Robinson4, J. Thornton4, K. Fleming5, A. Mueller5, L. Kelley5, S. Newhouse6, J. Darlington6, M. Sternberg5 1-Department of Computer Science, University College London, 2-Department of Biochemistry, University College London, 3-Sanger Centre Cambridge, 4-European Bioinformatics Institute, Cambridge ,5-Department of Biological Sciences, Imperial College, London, 6-Department of Computer Science, Imperial College, London [email protected]

In order to benefit from the wealth of information contained in recently sequenced genomes it is essential that we have structure based annotation of the proteins in terms of their 3-D conformations and their functions. This project aims to provide a structure-based annotation of the proteins encoded by the major genomes by linking resources at University College London (UCL), Imperial College London (IC) and the European Bioinformatics Institute (EBI) in Cambridge using Grid technology.

The objectives are: i - to establish local databases with structural and function annotations, ii - to disseminate to the biological community our proteome annotation via a single web-based distributed annotation [1], iii - to share computing power transparently between sites using GLOBUS, iv -to use the developed system for comparison of alternative approaches for annotation and thereby identify methodological improvements, v - to establish a pre-prototype at 6 months for demonstration purposes, vi - to provide a working system after two years, and vii - to link to relevant bioinformatics and Grid resources that will be integrated into this project.

At IC, the approach will be to use PSI-BLAST [2] to detect homologies between the proteome and other sequences and protein structures (characterised domains in SCOP [3]). This is followed by fold recognition using 3D-PSSM [4] to recognise remote homologies missed by PSI-BLAST. At UCL, GenTHREADER [5] will

A-236 directly analyse the sequences to detect both obvious and remote homologies. In addition, sequence motifs encoding the CATH structural domains [6] will be scanned against the proteomes using IMPALA [7], Hidden Markov Models [8] and Gene-3D [9]. To maintain links with the widely-used sequence-based annotation methods, the proteomes will also be scanned against INTERPRO.

The results of the above analyses will identify those regions (i.e. domains) of the proteomes for which there is a functional annotation [10]. For regions with sequence identity to known structures of >30%, three-dimensional models will be constructed using 3D-JIGSAW [11] and the co-ordinate construction package being developed within GenTHREADER.

We intend to disseminate the results of this project to the both the biological community interested in proteome annotation and the scientific community interested in Grid technology. Within the organisation of our project, we will explore including other groups either with relevant biological databases or with appropriate developed Grid technology. Industrial connections will be made through the EBI's Industry Programme which will allow the work to be presented to bioinformatics representatives from the pharmaceutical and biotech industry.

1. Hubbard T. & Birney E. (2000) Open annotation offers a democratic solution to genome sequencing. Nature 403, 825 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 3. Conte L.L., et al. (2000) SCOP: a structural classification of proteins database.Nucleic Acid Res. 28, 257-259 4. Kelley L.A., et al. (2000) Enhanced genome annotation using structural profiles in the program 3D-PSSM. J. Mol. Biol. 299, 501-522 5. Jones D.T. (1999) An efficient and reliable protein fold recognition method for genomic sequences. J. Mol. Biol. 287, 797-815 6. Orengo C.A., et al. (1997) CATH- a hierarchic classification of protein sequences. Structure, 5, 1093-1108 7. Schaffer A.A., et al. (1999) IMPALA: matching a protein sequence against a collection of PSI-BLAST-constructed position-specific score matrices. Bioinformatics 15, 1000-11 8. Hughey R. & Krogh A. (1996) Hidden Markov models for sequence analysis. CABIOS 12, 95-107 9. Buchan D.W.A., et al. (2002) Gene3D: Structural assignment for whole genes and genomes using the CATH domain structure database. Genome Res. 12, 503-514 10. Thornton J.M., et al. (2000) From structure to function: Approaches and limitations. Nature Struct. Biol. Supl: 991-994

KIAS (P0531) - 479 predictions: 176 3D, 303 SS

Prediction of Protein Tertiary Structure using PROFESY, a Novel Method based on Pattern Matching and Fragment Assembly

Julian Lee, Seung-Yeon Kim, Keehyung Joo, Ilsoo Kim, Saejoon Kim, and Jooyoung Lee School of Computational Scineces, Korea Institute for Advanced Study [email protected]

A-237 We introduce a novel method for the tertiary structure prediction, PROFESY (PROFile Enumerating SYstem). This method utilizes secondary structure prediction information and fragment assembly. The secondary structure prediction is first performed using the method PREDICT (PRofile Enumeration DICTionary) recently developed by our group, which uses a concept of distance between patterns. For a given protein sequence, this method uses PSI-BLAST to generate profiles, which define patterns for amino acid residues. Each pattern is compared with those in the pattern database generated from PDB, and the patterns close to the query pattern is selected to determine the secondary structure of the query residue. In order to construct the tertiary structure, we also collect the backbone dihedral angles along with these patterns. These constitute a library of the fragments for a given protein sequence.

By construction, the secondary structure of the tertiary structure obtained from PROFESY agrees with the ones predicted from PREDICT. In order to obtain the optimal tertiary packing of these secondary structure elements, we define a score function based on the number of long-range hydrogen bondings, burial of hydrophobic residues and exposure of hydrophilic residues, the radius of gyration, and the inter-residue Lennard-Jones interactions to avoid steric clashes. Replacement of fragments by the ones in the library is carried out, so that the score function is minimized. The score function minimization is performed by a powerful global optimization method conformational space annealing (CSA) method [1]. This method enables one to sample diverse low lying minima of the score function.

1. Lee J. et al. (1997) New optimization method for conformational energy calculations on polypeptides : Conformational Space Annealing. J. Comp. Chem. 18 (9), 1222-1232 ; 2. Lee J. et al. (1998) Conformational analysis of the 20-residue membrane-bound portion of Melittin by Conformational Space Annealing. Biopolymers. 46, 103- 115 ; 3. Lee J. et al. (1999) Conformational Space Annealing by parallel computations: extensive conformational search of Met-enkephalin and the 20-residue membrane- bound portion of Melittin. Int. J. Quant. Chem. 75, 255-265 ; 4. Lee J. et al. (1999) Energy-based de novo protein folding by conformational space annealing and an off-lattice united-residue force field: Application to the 10-55 fragment of staphylococcal protein A and to apo calbindin D9K. Proc. Natl. Acad. Sci. USA 96, 2025-2030 ; 5. Liwo A. et al. (1999) Protein structure prediction by global optimization of a potential energy function. Proc. Natl. Acad. Sci. USA 96, 5482-5485 ; 6. Lee J. et al. (1999) Calculation of protein conformation by global optimization of a potential energy function. PROTEINS: Structure, Function, and Genetics 3:204- 208 ; 7. Lee J. et al. (2000) Hierarchical energy-based approach to protein-structure prediction: Blind-test evalutation with CASP3 targets. Int. J. Quant. Chem. 77, 90-117

KIAS (P0531) - 479 predictions: 176 3D, 303 SS

Prediction of Protein Secondary Structure Using PREDICT,a Novel Method Based on Pattern Matching

Keehyung Joo1 , Ilsoo Kim1 , Julian Lee1, Seung-Yeon Kim1, Sung Jong Lee1,2 , and Jooyoung Lee1 1 School of Computational Scineces, Korea Institute for Advanced Study 2 Department of Physics, Suwon University

A-238 [email protected]

We introduce a novel method for the secondary structure prediction, PREDICT (PRofile Enumeration DICTionary). This method uses a concept of distance between patterns. For a given protein sequence, this method uses PSI-BLAST (Position Specific Iterative Basic Local Alignment Search Tool ) to generate profiles, which define patterns for amino acid residues. Each pattern is compared with those in the pattern database generated from PDB (Protein Data Bank), and the patterns close to the query pattern is selected to determine the secondary structure of the query residue. This method combines the idea of the nearest-neighbor method of Yi and Lander [1] with the profile generating technology of PSI-BLAST [2]. We tested the method on the set of 513 non-homologous proteins CB513, and applied it to the CASP5 targets for blind test. Preliminary result on the Q3 value of the secondary structure prediction of CB513 set using 7777 protein set as the database is about 80 %.

1. Yi T. et al. (1993) Protein Secondary Structure Prediction using Nearest-Neighbor Methods, J. Mol. Biol. 232, 1117-1129 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402. LAMBERT-Christophe (P0035) - 131 predictions: 131 3D

Improving Target-Template Alignment Using Neural Networks

C. Lambert, E. Depiereux Unité de Recherche en Biologie Moléculaire, Facultés Universitaires Notre-Dame de la Paix, rue de Bruxelles 61, 5000 Namur, Belgium [email protected]

The aim of our work is to propose a reliable automatic method for homology modeling (ESyPred3D[1]), especially when the protein of interest shares a low percentage of identities (20-30%) with the chosen template.

Our strategy consists in the usual steps for homology modeling: search for the template in databanks, target-template alignment and modeling. Actually, our method does not provide any assessment of the model.

For the search of a template in databank, we used four iterations of PSI-BLAST[2] on the non redundant protein database (nr) of the NCBI. All sequences having a expected value lower than 0.001 are included in the profile building. The template is chosen as the sequence of known structure (PDB) that has the lower expected value. The search in the nr databank also gives us a large number of similar sequences.

As far as possible, two sets of sequences are built. The first one contains the 50 best hits below the expected value cutoff of 0.001. The second one contains a subset of the sequences, after dropping too redundant ones. This method aims at creating different conditions to run multiple alignment programs and extracting different consensus in order to raise the confidence of the sequence-structure alignment.

The two sets are then submitted to five alignment programs: ClustalW[3], Dialign2[4], Match-Box[5], Multalin[6] and T-Coffee[7]. A pairwise alignment between the target and template sequences is extracted from each multiple alignment. All the pairwise alignments including the one provided by PSI-BLAST are used to generate a database of aligned positions (boxes). A neural network is used to assign a score to each box. Most confident boxes are taken as anchor points for the building of the final sequence-structure alignment. A three-dimensional model is built using MODELLER [8] on this final alignment.

A-239 We tested the performances of our alignment step by evaluating several alignment programs and comparing them to the performances of our alignment procedure. Results show an improvement of the multiple alignment quality by at mean 30%, especially for the cases where target and template sequences share a low rate of identities.

ESyPred3D web site: http://www.fundp.ac.be/urbm/bioinfo/esypred

1. Lambert C. et al. (2002) ESyPred3D: Prediction of proteins 3D structures. Bioinformatics. 18 (9), 1250-1256 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 3. Thompson J.D. et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice, Nucleic Acids Res. 22, 4673-4680 4. Morgenstern B. et al. (1998) DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics 14, 290-294 5. Depiereux E. et al. (1997) Match-Box server: a multiple sequence alignment tool placing emphasis on reliability. Comput. Appl. Biosci. 13, 249-256 6. Corpet F. (1988) Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res. 16, 10881-10890 7. Notredame C. et al. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302(1), 205-217 8. Sali A. et al. (1993) Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234(3), 779-815.

Lomize-Andrei (P0288) - 76 predictions: 76 3D

New Energy Functions For Protein Modeling Derived From G Values

A.L. Lomize, M.Y. Reibarkh, and I.D. Pogozheva College of Pharmacy, University of Michigan, Ann Arbor, MI [email protected]

Efficient methods for protein structure prediction require energy optimization. An especially important goal here is the correct evaluation of free energy differences, not enthalpy in vacuum that is usually calculated with molecular mechanics potentials. The required energy functions must take into account conformational entropy, solvation free energy, and dependence of interatomic interactions on the environment. They must be also tested against experimental thermodynamic stabilities of proteins or protein-ligand complexes. We have determined van der Waals (vdW) interaction energies between different atom types, energies of hydrogen bonds, and atomic solvation parameters from the published free-energy differences for 106 mutants with replacements of buried uncharged residues and available crystal structures [1]. The obtained energies of interatomic interactions were different from that in molecular mechanics in three important aspects: (1) they describe interactions in the protein interior rather than in vacuum; (2) they are generally weaker and follow "like dissolves like" rule; (3) they are related to enthalpy of melting rather than to enthalpy of sublimation. The developed potentials can be applied for side-chain packing, fold recognition, computational de novo design, estimation of ligand-binding constants, and modeling of nonregular loops.

A-240 1. Lomize A.L., Reibarkh M.Y., and Pogozheva I.D. (2002). Interatomic potentials and solvation parameters from protein engineering data for buried residues. Protein Sci. 11 (8), 1984-2000.

Lund-Ole (P0391) - 39 predictions: 39 3D

X3M – a Computer Program to Extract 3D Models

O. Lund, M. Nielsen, C. Lundegaard and P. Worning. Center for Biological Sequence Analysis, Biocentrum-DTU, Building 208, Technical University of Denmark, DK-2800 Lyngby, Denmark [email protected]

See methods section

Levitt (P0016) - 350 predictions: 350 3D

Ab Initio Structure Prediction of Target Proteins in CASP5

T.M. Raschke*, C.M. Summa*, R. Kolodny and M. Levitt Stanford University, Department of Structural Biology, Fairchild Building, Room D-109, Stanford, CA 94305 [email protected]

We applied the following ab initio prediction method to target proteins that received low scores from the CAFASP3 comparative modeling servers. Models were generated by assembling regularized backbone segments of length 9 (derived from a 2000-protein library) using Monte Carlo swap moves, as per Jones’ method used in CASP2[1] and Baker’s method in CASP3 and CASP4 [2-3]. The energy function used for annealing consisted of terms representing cooperative hydrogen bonds (as done by Keasar & Levitt in CASP4), residue-based hydrophobic burial propensity, and residue-based hydrophobic pair interactions. After 50,000 steps of annealing with the segment replacement method, the models were annealed with “refinement moves,” consisting of small 2° rotations of the backbone torsion angles, for 10,000 steps. This process was used to model the native sequence and homologous sequences (where appropriate) using the predicted secondary structures from several automated servers [4-6]. For some targets, the most likely emitted sequence from a Hidden Markov Model built from the target sequence family was also used [7]. A set of 1000 decoys was generated for each sequence/secondary structure combination, and all models were combined into one large dataset for selection. This dataset was pruned to 3000 members using a “colony energy” score [8] that combined several energy functions (atom cluster energy, electrostatic energy, RAPDF [9], and the energy from the decoy generation procedure) with a measure of the structural similarities between the decoys. The 3,000 best models were clustered with a hierarchical

A-241 clustering method using a Floyd distance metric (distance along the graph) [10]. Decoys in the top 5 clusters were evaluated by manual inspection, and typically one decoy from each of the top 5 clusters was submitted.

* These authors contributed equally to this work.

1. Jones D.T. (1997) Successful ab initio prediction of the tertiary structure of NK-lysin using multiple sequences and recognized supersecondary structural motifs. Proteins: Struct. Funct. Genet. S1, 185-191. 2. Simons K.T., Bonneau R., Ruczinski I. and Baker D. (1999) Ab initio protein structure prediction of CASP III targets using ROSETTA. Protein: Struct. Funct. Genet. S3,. 171-176. 3. Bonneau R., et. al. (2001) Rosetta in CASP4: Progress in ab initio protein structure prediction. Proteins: Struct. Funct. Genet. S5, 119-126. 4. PHD, http://www.embl-heidelberg.de/predictprotein/predictprotein.html 5. PSIPRED, http://bioinf.cs.ucl.ac.uk/psipred/ 6. SAM-T02-STRIDE, http://www.cse.ucsc.edu/research/compbio/HMM-apps/T02-query.html 7. Gough J.and Madera M. (2002) The next generation of structural genome analysis. CASP5 Abstract. 8. Xiang Z.X., Soto C.S. and Honig B. (2002) Evaluating conformational free energies: The colony energy and its application to the problem of loop prediction. Proc. Natl. Acad. Sci. USA 99 (11), 7432-7437. 9. Samudrala R. and Moult J. (1998) An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J. Mol. Biol. 275 (5), 895-916 10. Tenenbaum J.B., de Silva V. and Langford J.C. (2000) A global geometric framework for nonlinear dimensionality reduction. Science. 290 (5500), 2319. Levitt (P0016) - 350 predictions: 350 3D

Comparative Modeling Using Structural Alignments and Self-Consistent Mean-Field for Sidechain/Loop Prediction

E. Lindahl, M. Levitt, and P. Koehl Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305 USA [email protected]

For comparative modeling at CASP5, our group has focused on improving sequence alignments and on the prediction of sidechains and loop regions in proteins. All target sequences submitted to CASP5 were first screened using the results from the CAFASP3 servers and comparative modeling was only attempted for targets where at least one server showed intermediate or high scores. The remaining sequences were considered ab initio targets, for which a different method was used [1].

A consensus secondary structure prediction was derived from all the servers available at the CAFASP3 website, giving additional weight to the PsiPred [2] method. For all significant fold recognition hits we extracted both the original structures and other structures in the same SCOP superfamily [3] with good SPACI scores [4] to get high quality templates. We computed a sequence profile based on the structural alignments of these templates, derived from the FSSP database [5]. Position-dependent gap penalties were introduced based on the experimental and predicted secondary structures, FSSP fragments, and the distance between endpoints in the template structures for deletions.

A-242 We used both our alignments derived from the structural profiles and automated alignments from CAFASP3 to create a set of manually tweaked alignments for each target. The emphasis in this tuning process was not mainly on matching features, but rather on manual discrimination, correcting possible mismatches, and taking any additional knowledge about the sequence/structure into account. For large insertions or changes in secondary structure we first alter the backbone structure of the template and applied energy minimization with ENCAD/SegMod [6] and Gromacs [7] to get the structure to a reasonable state. Starting from the manual alignments, a model backbone framework was built by removing two residues on each side of insertions/deletions in the template. Candidate loop fragments were selected from a set of geometrically compatible backbone fragments. Similar fragment sets were generated for positions where there were PRO and GLY mutations between the template and query sequences. This approach is limited to insertions shorter than about 15 residues, and for a couple of cases we had to apply manual modeling using the O program [8] to generate potential loops.

In the final modeling step, we select a set of rotamers for each sidechain, and use a self-consistent mean-field approach [9-10] to simultaneously optimize sidechains and the altered backbone fragments. Manual inspection and the energy of the resulting all-atom models were used to select which predictions to submit.

1. Raschke T., Summa C., Levitt M. (2002) Ab Initio Structure Prediction of Targets in CASP5. CASP5 Abstract 2. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202. 3. Murzin A.G., Brenner S.E., Hubbard T., Chothia C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536-540 4. Brenner S.E., Koehl P., Levitt M. (2000) The Astral compendium for protein structure and sequence analysis. Nucleic Acids Res., 28, 254-256 5. Holm L., Sander C., Mapping the protein universe. (1996) Science 273, 595-602 6. Levitt M., Hirshberg M., Sharon R., Daggett V. (1995) Potential Energy Functions and Parameters for Simulations of Molecular Dynamics of Proteins and Nucleic Acids in Solution. Comp. Phys. Comm. 91, 215-231 7. Lindahl E., Hess B., van der Spoel D. (2001) GROMACS 3.0: A package for molecular simulation and trajectory analysis. J. Mol. Mod. 7(8), 306 http://www.gromacs.org 8. Jones T. A, Kjeldgard M. (1998) Essential O, Software manual, Uppsala University. http://xray.bmc.uu.se/alwyn/o_related.html 9. Koehl P., Delarue M. (1994) Application of a self-consistent mean field theory to predict protein side-chains conformation and estimate their conformational entropy. J. Mol. Biol., 239, 249-275 10. Koehl P., Delarue M. (1995) A self consistent mean field approach to simultaneous gap closure and side-chain positioning in homology modeling. Nature Struct. Biol., 2, 163-170

nexxus-delrio (P0370) - 7 predictions: 7 3D

Protein Structure Assessment by Matching Residues Function and Centrality

G. del Rio1, A. Garciarrubio2 and D.E. Bredesen1 1 – Buck Institute, 2 – Biotechnology Institute (UNAM)

A-243 [email protected]

See methods section

ORNL-PROSPECT (P0012) - 330 predictions: 330 3D

Protein Domain Decomposition Using Network Flow Algorithms and Neural Networks

J. Guo, D. Xu, D. Kim, and Y. Xu Oak Ridge National laboratory [email protected]

Structural domains are considered as the basic units of protein folding, function, evolution, and design. Automatic decomposition of protein structures into structural domains, though after many years of investigation, remains a challenging and unsolved problem. Manual inspection still plays a big part in domain decomposition of a protein structure in constructing domain databases. We have previously developed a computer program DomainParser, using network flow algorithms, for protein domain decomposition. The algorithm partitions a protein structure into domains accurately when the number of domains to be partitioned is known. However its performance drops when this number is unclear. Through utilizing various types of structural information including hydrophobic moment profile, we have developed an effective method for assessing the most probable number of domains a structure may have. The core of this method is a neural network, which is trained to rank different possible domain decompositions. By combining this neural network with our previous network flow algorithms, our new algorithm achieves 82% decomposition accuracy on a data set of 1317 protein chains while the old one has an accuracy of 75%, when compared to the manual decomposition results given in SCOP database.

ORNL-PROSPECT (P0012) - 330 predictions: 330 3D

A Computational Pipeline for Large-Scale Protein Structure Predictions

Manesh Shah1, Sergei Passovets1, Li Wang1, Dongsup Kim1 Kyle Ellrott1, 3, Dawei Lin4, Bi-Cheng Wang4 Dong Xu1,3, Ying Xu1,2,3 1Life Sciences Division, 2Computer Science and Mathematics Division,

A-244 Oak Ridge National Laboratory, Oak Ridge, TN 37831 3UT-ORNL Graduate School of Genome Science and Technology, Oak Ridge, TN 37831 4Department of Biochemistry & Molecular Biology, University of Georgia, Athens, GA 30602 [email protected]

We have recently developed a computational pipeline for automated protein structure predictions. It can be used for high-throughput protein structure prediction, including genome-scale prediction. The pipeline has capacities in both homology modeling and threading-based protein fold recognition.

The main components of the pipeline are: (a) a toolkit consisting of essential protein analysis tools, (b) a client/server system which provides access to the tools, (c) a pipeline manager which coordinates the processing tasks for a given analysis request, and (d) a web interface for query submission. The pipeline can be used through the command lines or a Web interface. Major computation of the pipeline is carried on a 64-node supercomputer at Oak Ridge National Laboratory.

The pipeline operations can be categorized into three distinct phases: (1) protein triage, (2) threading-based structure prediction and (3) sequence based function determination. Protein triage phase uses PRODOM (for domain parsing), SOSUI (for classification into globular or membrane protein), SignalP (for identifying signal peptide cleavage sites) and PSI-BLAST (for sequence homology in PDB, Swissprot and other databases). Structure prediction phase uses SSP (a secondary structure prediction tool that we developed), PROSPECT (a protein fold recognition that we developed), MODELLER (for atomic model construction) and WHATIF (for structure quality assessment). Sequence based function determination phase (not yet implemented) will use protein family classification tools Pfam, Motif and PRINTS. The pipeline manager invokes different tools depending on the user input and logic of the prediction process and controls the data and analysis flow of the pipeline. XML technology is used for data exchange between the web interface, the pipeline manager and the tools.

We have used this pipeline for the CAFASP2 predictions. The results also helped us for CASP5 predictions. We have applied the pipeline on Rhodopseudomonous palustris, where 799 soluble proteins and 281 membrane proteins were classified and predicted. It only took about a day using 64-node supercomputer at Oak Ridge National Laboratory. In addition, we predicted structure for more than 2500 proteins in Pyrococcus furiosus for the SouthEast Collaboratory for Structural Genomics (SECSG). The predicted structures are used for target selections and initial models for structure determinations.

Protfinder (P0282) - 222 predictions: 222 3D

POSTER: Sequence-structure Alignments with the Protfinder Algorithm

U. Bastolla Centro de Astrobiologia (INTA_CSIC), Madrid, Spain [email protected]

The Protfinder algorithm predicts protein structures by aligning the query sequence to candidate structures in the PDB. Alignments are evaluated through a minimal model of protein folding, which reproduces approximately some key features of protein thermodynamics and is very convenient for rapid computation. Information on sequence homology is not used in the scoring function.

A-245 Protein structures are represented as contact maps and their effective intramolecular interactions are modeled as a sum of contact interactions. We use the contact energy function optimized in Ref. [1], which assigns lowest energy to the experimentally known native structure for almost every sequence of monomeric protein whose structure has been determined by X-ray crystallography, except small fragments and chains with large cofactors. Moreover, it generates well-correlated energy landscapes, in the sense that structures very dissimilar from the native one have energies much higher than the native energy. This property is crucial for protein structure prediction. The effective energy function is also able to estimate the folding free energies of a set of small proteins folding with two-state thermodynamics, with reasonable agreement with experimental data [2].

The scoring function consists of three elements: the effective energy function described above, a chain entropy term estimated in Ref. [2] and a term penalizing gaps in the alignment. Gaps in secondary structure elements are strictly forbidden. Gaps in the structure are allowed only if the two residues that are shortcut are close in space and the angles characterizing their pseudo-peptidic bond lie within a predefined range. Gaps in the sequence are allowed only on the surface of the protein, which is identified by the fact that the number of contacts per residue is smaller than a threshold. Allowed gaps receive an energetic penalty G0 plus a penalty G1 for each residue in the gap.

To speed up the computation, each structure in the PDBSELECT [3] non-redundant subset of the PDB was preprocessed to produce its contact map and the list of allowed shortcuts in the structure. Secondary structure was obtained from the DSSP file [4] when available, otherwise from the PDB file. The few structures for which no secondary structure assignment could be obtained were discarded. Preprocessing, together with the fact that the code uses mostly integer arithmetic, speed up the computation considerably.

To search for the optimal alignment, we use a stochastic version of the deterministic Build-up algorithm developed by Park and Levitt to look for low energy configurations of discrete protein models [5]. The algorithm is very efficient at finding high-scoring alignments, although it is not guaranteed to find the best optimum.

The algorithm starts by generating all possible gapless alignments of length l between the query sequence and the test structure and stores the M alignments with maximum score. At each subsequent step, an attempt is made to add a new residue to each alignment. There are three possibilities: either the residue is aligned to the next structural position, or it is aligned introducing a gap in the structure (if allowed), or the residue is not aligned, initiating a gap in the sequence. All possible continuations are generated, and the M best scoring alignments are stored in memory and used as seeds for the next step. The algorithm is iterated until residues can not be added anymore.

To improve the efficiency, instead of using the deterministic algorithm described above, we select the M alignments at each step based on the sum of their score plus a random number. The relative importance of the randomness is large in the first steps, allowing the algorithm to visit a larger fraction of the alignment space. The randomness decreases as the alignments get longer, so that the complete alignment is chosen on the basis of the deterministic score. The algorithm is first applied using a small value M=50 to scan rapidly the whole database. The 200 proteins with the best alignments are then stored in memory and used for a second more accurate search with M=800.

Each candidate structure receives the score of its best alignment. The best scoring structure is used as prediction. The goodness of the prediction is estimated through the normalized energy gap, which measures the difference between the best score and the score of an alternative structure in units of the best score, divided by the structural distance between the best scoring structure and the alternative structure. If the minimal value of the normalized energy gap over all alternative structures is large the prediction is considered reliable, if it is small alignments with very different structure have scores quite similar to the best one and reliability is very low.

A-246 1. Bastolla U. et al. (2000) A statistical mechanical method to optimize energy functions for protein folding. Proc. Natl. Acad. Sci. USA 97, 3977-3981 2. Bastolla U. Testing the thermodynamics of a minimal model of protein folding, in preparation 3. Hobohm U. and Sander C. (1994) Enlarged representative set of protein structures. Protein Sci. 3, 522-524 4. Kabsch W. and Sander C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22 (12), 2577-2637 5. Park B.H. and Levitt M. (1995) The complexity and accuracy of discrete state models of protein structure. J. Mol. Biol. 249, 493-507

Pushchino (P0203) - 263 predictions: 263 3D

Cunning Simplicity of a Hierarchical Folding and of Protein Folding Funnels

A.V. Finkelstein Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia [email protected]

A hierarchic scheme of protein folding, as well as simple funnel models of protein folding do not solve the Levinthal paradox, since they cannot provide a simultaneous explanation for major features observed for protein folding: (i) folding within non-astronomical time, (ii) independence of the native structure on large variations in the folding rates of a given protein under different conditions, and (iii) co-existence, in a visible quantity, of only the native and the unfolded molecules during folding of moderate size (single-domain) proteins. On the contrary, a nucleation mechanism of folding can account for all these major features simultaneously and resolves the Levinthal paradox.

The author is grateful to N.S. Bogatyreva for discussions and assistance, and acknowledges a support of an International Research Scholar's Award from the Howard Hughes Medical Institute and of the Russian Foundation for Basic Research.

Pushchino (P0203) - 263 predictions: 263 3D

Common Features in Structures and Sequences of Sandwich-Like Proteins

A.V. Finkelstein1, A.E. Kister2 and I.M. Gelfand2 1 - Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia, 2 - Department of Mathematics, Rutgers University, Piscataway, NJ, 08854, USA [email protected]

A-247 The goal of this work is to define the structural and sequence features common to sandwich-like proteins (SP) – a group of very different proteins comprising now 69 superfamilies in 38 protein folds. Analysis of the arrangements of strands within main sandwich sheets revealed a rigorously defined constraint on the supersecondary substructure that holds true for 94% of known SP structures. The invariant substructure consists of two interlocked pairs of neighboring -strands. It is even more typical for centers of SP than the well-known ‘Greek key’ strands arrangement [1] for their edges.

As homology among these proteins is usually not detectible even with most powerful sequence-comparing algorithms, we employ a structure-based approach to sequence alignment. Within the interlocked strands we found 12 positions with fixed structural roles in SP. A residue at any of these positions possesses similar structural properties with residues in the same position of other SP. The 12 positions lie at the center of the interface between the -sheets and form the common geometrical core of SP. Of the 12 positions, 8 are occupied by only four hydrophobic residues in 80% of all SP.

Authors are grateful to C. Chothia, P. Ehrlich, M. Goldman and Yu. Vasiliev for stimulating discussions, and to L. Pogost N.S. Bogatyreva for assistance. A.V.F. acknowledges a support of an International Research Scholar's Award from the Howard Hughes Medical Institute and of the Russian Foundation for Basic Research.

1. Richardson J.S. (1981) The Anatomy and Taxonomy of Protein Structure. Adv. Prot. Chem, 34, 167-339. Pushchino (P0203) - 263 predictions: 263 3D

Protein Folding: Theoretical and Experimental Study

A.V. Finkelstein1, O.V. Galzitskaya1, D.N. Ivankov1, N.S. Bogatyreva1, S.A. Garbuzinskii1, M.Yu. Lobanov1, D.A. Dolgikh2 and M. Oliveberg3 1 - Institute of Protein Research, Russian Academy of Sciences, 142290, Pushchino, Moscow Region, Russia, 2 - Shemyakin and Ovchinnikov Institute of Bioorganic Chemistry, Russian Academy of Sciences, 117871, Moscow, Russia, 3 - Department of Biochemistry, Umeå University, S-901 87 Umeå, Sweden [email protected]

We present a theory for calculating refolding and unfolding rates and for finding the folding nuclei of globular proteins from their 3D structures and stabilities.

The method is based on solution of kinetic equations for networks of folding-unfolding pathways. The theoretical results obtained for a large set of small and middle- size proteins under various conditions are in a good correlation the available experimental observations.

On this basis, we predicted the folding and unfolding rates for protein S6 and two of its engineered circular permutants which have been designed so as to have increased rates of transitions between the folded and unfolded forms. The experimental study of these proteins confirmed the predictions.

The obtained results emphasize a combined action of protein topology and stability in controlling the rate of protein folding.

The work was supported by the Russian Foundation for Basic Research and by an International Research Scholar’s Award from the Howard Hughes Medical Institute.

A-248 Rokko (P0327) - 109 predictions: 109 3D

Method of Team Rokko: Multicanonical Ensemble Reversible Fragment Assembly and Physico-chemical Energy Function

Yoshimi Fujitsuka1, George Chikenji1, Nobuyasu Koga1, Akira R. Kinjo2, and Shoji Takada12 1Kobe University, 2Japan Science and Technology Corporation [email protected]

For CASP5, we use SimFold, a protein simulation program that we have been developing recently [1,2]. We briefly describe a) the energy function, b) the sampling method in SimFold, and c) how we did in CASP5. a) SimFold uses a coarse-grained protein model that has explicit backbone atoms and a sphere at the center of mass of sidechain. Each sidechain can take one of several rotamer states. The energy function is based on physico-chemical consideration and consists of many terms such as hydrophobic interaction, hydrogen bonds, vdW interactions, and so on. In particular, hydrogen bond interactions include dependence on local dielectric constant and correlation in neighboring two bonds in beta sheet. Many of length-parameters are determined from database survey. For the energetic parameters that need to be accurate, we optimized them on the basis of the energy landscape theory. For each of a 40 training protein structure set, we maximize |Z| score, the normalized difference between native energy and average energy in decoy structures. b) For conformational sampling, SimFold uses either the fragment assembly (FA) method or the replica exchange MD method. We emphasize that, very uniquely, both FA and MD methods are available in a single program SimFold. Our FA is different from what has been developed by Baker's group in two respects. First, we only use three-residue-fragments, instead of nine residue ones. Second, we have developed an algorithm of "reversible FA method" (Chikenji, Fujitsuka, & Takada unpublished). We note that the typical FA protocol does not obey the detailed balance, but our algorithm does. In reversible FA method, we prepare new fragment libraries which contain original fragment library structures and hybrid ones. Hybrid fragment structures of (i-1, i, i+1) residue segment consist of the latter half fragment structures of (1-2, i-1, i) segment and the first half of (i, i+1, i+2). Because reversible FA fulfills detailed balance condition, we could combine reversible FA with the multicanonical ensemble Monte Carlo method which is known to be highly powerful conformational sampling method for protein systems. Indeed, this approach is used in CASP5 and helps conformational sampling very significantly. Structures either with the lowest energy or at the center of large clusters are chosen as predicted models. We also perform MD-based replica exchange simulation, where each replica has the same protein with different temperature and exchanges of replicas are tried at a certain frequency. The lowest energy structure is searched in the replica at the lowest temperature, while high temperature replica is useful for escaping from misfolded traps. c) In CASP5, for all targets that have no homologous sequences of known structures, we submitted structures predicted by SimFold. For chains shorter than ~120, starting from random structures, we performed FA sampling either with multicanonical ensemble method or with simulated annealing. We chose either structures with low energies or those at the center of large clusters. For longer sequences, we started from models in CAFASP server and performed replica exchange MD for sampling

A-249 and chose structures in the lowest temperature replica. For some targets, other information such as annotation was used too.

1. Takada S., (2001) Protein Folding Simulation With Solvent-Induced Force Field: Folding Pathway Ensemble of Three-Helix-Bundle Proteins, Proteins 42, 85-98. 2. Fujitsuka Y., Takada S., Luthey-Schulten Z.A., & Wolynes P.G., (2002) Optimizing Physical Energy Functions for Protein Folding, submitted.

Ron-Elber (P0300) - 259 predictions: 259 3D

Protein Structure Prediction With Threading Using the LOOPP2 Algorithm

T. Galor1, C. Lowe1, J. Meller2, J. Pillardy3, O. Teodorescu1 and R. Elber1 1 –Department of Computer Science, Cornell University, Ithaca, N.Y., 14853; 2 –Cincinnati Children’s Medical Center, Pediatric Informatics, 3333 Burnet Avenue, Cincinnati, OH 4522; 3– Computational Biology Service Unit, Cornell University, Ithaca, N.Y., 14853 [email protected]

See methods section

SAM-T02-server (P0189) - 221 predictions: 221 3D

SAM-T02 Protein Structure Prediction Webserver

Kevin Karplus, Rachel Karchin, and Richard Hughey Center for Biomolecular Science and Engineering, University of California, Santa Cruz [email protected]

See methods section

SAMUDRALA-NEWFOLD (P0051) - 410 predictions: 410 3D

A-250 A Hybrid Method Combining Sequence and Chemical Shift Data to Predict Secondary Structure

L-H. Hung and R. Samudrala Dept of Microbiology, University of Washington [email protected]

The recent increase in the amount of available experimental structural data has been of considerable help to the structure prediction field. Strangely, the converse is not true - sequence and homology based methods have had relatively little impact on experimental methodologies. We are in the process of developing hybrid methods using de novo techniques to facilitate and automate NMR protein structure determinations.

As a first step towards this goal, we describe a new method for assigning secondary structure by using neural networks to combine sequence based prediction (Psipred [1] ) with chemical shift information. The resulting hybrid method (PsiCSI) achieves a Q3 accuracy of 89%, an increase of 5.5% and 6.2% (or equivalently, 33% and 36% fewer errors) over methods that use sequence information (Psipred) or chemical shifts (CSI [2] ) alone. In addition, errors made by PsiCSI almost exclusively involve the interchange of helix or strand with coil and not the helix with strand. The increase accuracy and automation of PsiCSI will be of use in NMR experiments where assignment of secondary structure is a useful intermediate step to the final determination of the tertiary structure. The increased accuracy and the elimination of gross errors should also be of use in protein structure predictions.

1. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202. 2. Wishart D. S., et al. (1992) The chemical shift index: A fast and simple method for the assignment of protein secondary structure through NMR spectroscopy. Biochemistry 31, 1647-1651

SAMUDRALA-NEWFOLD (P0051) - 410 predictions: 410 3D

An Extension to the All-Atom Distance-Dependent Potential For Ab Initio Protein Structure Prediction Based on Local Sequence Similarity

Shing-Chung Ngan and Ram Samudrala Dept of Microbiology, University of Washington [email protected]

Knowledge-based statistical potentials based on pairwise distances between residues (e.g. [1-2]) and atoms (e.g [3-4]) have been widely used in protein structure prediction. The determination of parameters for these potentials involves extracting the distance distributions for pairs of residue-types (or atom-types) from a set of proteins with known structures. A common drawback in the construction of the potentials is that the connectivity of residues in the protein chains is usually ignored. Hence, the influences of residues not local to the residue (or atom) pair under consideration are not fully captured by the resulting statistical model.

A-251 To provide a partial remedy to this shortcoming, the all-atom distance- dependent potential as described in [4] is extended in a manner analogous to the procedure described in [5], where a residue-residue distance dependent potential was augmented. Essentially, in determining the distance distribution of a pair of atom-types that is present in a given protein sequence whose structure is to be predicted, a window of amino acid sequence surrounding each of the two atoms is noted. Among the same pairs of atom types that are present in the set of proteins with known structures, only those pairs with local amino acid sequences similar to the noted amino acid sequences are to be used in forming the distance distribution. The similarity measure is defined through the BLOSUM 62 substitution matrix.

To evaluate the utility of the extended all-atom potential, it is tested on decoy sets from the Decoys 'R' Us database [6]. Performance of the potential is measured using the standard receiver-operating characteristic (ROC) analysis. We observe that the new potential outperforms the all-atom potential in most of the decoy sets. Further analysis of the extended potential on more decoy sets is currently under way.

1. Wodak S. and Rooman M. (1993). Generating and testing protein folds. Curr. Opin. Struct. Biol. 3, 247-259. 2. Sippl M. (1995). Knowledge based potentials for proteins. Curr. Opin. Struct. Biol. 5, 229-235. 3. Subramaniam S., Tcheng D.K. and Fenton J.M. Knowledge-based methods for protein structure refinement and prediction. In Proceedings of the Fourth International Conference on Intelligent Systems in Molecular Biology, St. Louis, 1996, Ed. David States et al., AAAI Press, California. p. 218-229, 1996. 4. Samudrala R. and Moult J. (1998). An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. J. Mol. Biol. 275, 895-916. 5. Skolnick J., Kolinski A. and Ortiz A. (2000). Derivation of protein-specific pair potentials based on weak sequence fragment similarity. Proteins 38:3-16. 6. Samudrala R, Levitt M. (2000). Decoys 'R' Us: A database of incorrect protein conformations for evaluating scoring functions. Protein Science, 9: 1399-1401.

SAMUDRALA-NEWFOLD (P0051) - 410 predictions: 410 3D

The Bioverse: a Framework for Exploring the Relationships among the Molecular and Organismal Worlds

Jason McDermott and Ram Samudrala University of Washington, Department of Microbiology [email protected]

The large number of sequencing efforts underway has driven the need for ways to better organize, visualize and use the vast amounts of genomic data being generated. The Bioverse is an extensible framework for representing the structural and functional data pertaining to single protein sequences in a genome and the relationships between these proteins in inter- and intra- genomic contexts. Predictions in the Bioverse are assigned confidence values that can be used to combine information from different sources using neural network-based approaches. For example, secondary structure in the Bioverse is predicted by combining standard methods such as Psipred [1], sequence similarity with proteins of known structure, and transmembrane region prediction methods, then using a neural network to derive the final prediction. The framework allows functional annotation of proteins using standard sequence similarity methods (BLAST [2], HMMer [3], PROSITE [4]) as well as through protein-

A-252 protein interaction and evolutionary network context. Prediction of protein structure using comparative modeling and/or ab intio methods and structural comparison with databases of known structures provides another powerful tool. In this way we are able to provide information about proteins that show little or no sequence similarity with proteins of known function. The Bioverse currently includes sequence, structure and function information and protein-protein interaction and evolutionary network representations of 12 genomes at http://bioverse.compbio.washington.edu.

1. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202. 2. Altschul S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25 (17), 3389-3402 3. Sonnhammer E.L.L. et al. (1997) Pfam: A comprehensive database of protein domain families based on seed alignments. PROTEINS: Structure, Function and Genetics 28: 405-420 4. Hoffman K. et al. (1999) The PROSITE database, its status in 1999. Nucleic Acids Res. 27: 215-219

SBC (P0084) - 94 predictions: 94 3D

The Pcons And Pmodeller Consensus Fold Recognition Servers

Björn Wallner, Fang Huisheng and Arne Elofsson Stockholm Bioinformatics Center, Stockholm University, 106 91 Stockholm, Sweden [email protected]

See methods section

Solovyev-Softberry (P0270) - 242 predictions: 177 3D, 65 SS

SoftPM: Softberry tools for protein structure modelling

V. Solovyev, D. Affonnikov, A. Bachinsky, I. Titov Ivanisenko and Y. Vorobjev Softberry Inc., 116 Radio Circle, Suite 400 Mount Kisco, NY 10549, USA [email protected]

See methods section

A-253 SUPERFAMILY (P0065) - 925 predictions: 925 3D

Structural domain predictions for all genomes

J. Gough Structural Biology, School of Medicine, Stanford University, CA94305-5126, U.S.A. [email protected]

See methods section

THW-FR (P0377) - 241 predictions: 241 3D

Net Charge Center for Protein Fold Recognition: the Large-Scale Fold Recognition in the Framework of CASP-5.

I. Torshin1,2,3, R. Harrison2 and I. Weber3 1 – Chair of Physical Chemistry, Chem. Dept., Moscow State University, 2 – Comp. Sci. Dept., GSU, Atlanta, GA, 3 – Biol. Dept., GSU [email protected]

Net charge center (NCC) is a novel physico-chemical model developed for analysis of the relationship between protein structure and function [unpublished] and is likely to determine the location of functional regions and sequences if spatial structure of a protein is known. Sequences around positive and negative charge centers (PNCC) are likely to be folding cores or folding intermediates [1-3]. These two properties of a native protein can be used for fold recognition using pre-compiled non-redundant “library” of templates. As the template library we have used non-redundant domain database GTDD (Gestalt Theory Domain Database [unpublished]). Gestalt theory [4], though being proposed over 50 years ago, is still one of the best theories that describe principles of perception. The gestalt principles can be computerized and were applied to construct a database of domains. In short, using NCC + PNCC allows to select potential templates from GTDD library of templates (that is, to perform fold recognition).

Complete 3d-models for all of the CASP-5 targets (T0129-T0195, 67 proteins) were prepared, 2-5 models for each target were submitted. Although structural data were not made available at the time of preparation of this abstract, some preliminary conclusions still can be made. Many targets had distinct sequence identities or otherwise apparent similarities to a known structure: T0137 T0143 T0144 T0150 T0151 T0153 T0154 T0155 T0158 T0160 T0165 T0168 T0169 T0171 T0175-T0179 T0183 T0188 T0189 T0193 (24 proteins) and thus our submitted fold predictions for those targets are likely to be reliable. The 24/67 ratio gives 37% as the assessment of the least reliability of the FoldRec-CC method. This minimal 37% reliability is, at least, comparable to the CASP-4 results. Analysis of the preliminary results for these 24

A-254 proteins also suggests that application of the NCC model alone can predict correct fold for at least 15 of these 24 proteins. An analysis of the same set of proteins also suggests that, unexpectedly, using multiple sequence alignment of the target protein for fold recognition does not significantly improve these preliminary results. Some of the targets were recognized as being very likely to belong to a new fold: T0129, T0148, T0161, T0184, T0186 (5 proteins). Although the results for the rest of the targets cannot be assessed at this moment, we have preferred, in general, to submit at least some of the prepared models rather than to increase the number of “new fold” (or, in other words, empty) submissions. The method is fully automated, though, of course, visual inspection of the final 10-20 models for each target as well as using secondary structure predictions [5] can improve the results of fold recognition.

1. Torshin I. et al (2002) Charge centers and formation of the protein folding core. Proteins, 43:353-364. 2. Torshin I. et al (2002) Identification of protein folding cores and nuclei using charge center model of protein structure. TheScientificWorld Journal, 2:84-86. 3. Torshin I et al. (2002) Protein folding: search for basic physical models, submitted. 4. W. Kőhler (1947), Gestalt Psychology, 136-279. 5. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292, 195-202.

Tsai (P0061) - 105 predictions: 105 3D

Using a Clustered 9mer Fragment Library and Evaluating Tertiary Interactions in a De Novo, Fragment Based Prediction Method

J. B. Holmes, H. C. Hodges, R. Swanson, D. Schell, R. Bliss, and J. Tsai Texas A&M University, Department of Biochemistry & Biophysics [email protected]

Our approach to de novo structure prediction was naturally influenced by the successful procedures of the Rosetta algorithm created in the Baker lab; however, we developed our own method named Mosaix. Our methods differed from Rosetta in using 1) only a clustered, 9mer move set, and 2) tertiary motif scoring both internally, in the evaluation of moves, and externally, in the final filtering of the decoys.

Fragment library: Our fragment library was created starting with the culled pdb structure list (now PISCES) [1]. Each pdb file in the list was split into all of its possible overlapping 9mers (9 adjacent residues) and the 9mers were grouped initially into super-clusters based on a ProMotif determination of 2° structure [2]. We considered three types of 2° structure: helix, sheet, and other (coil & turn). This created 135,298 9mers in 1,982 super-clusters, and in all, clustered into 34,952 clusters.

Tertiary motifs: We have constructed a library of tertiary motifs (TerMo) and have used them in two ways. For quick analysis during construction, we developed a gross TerMo score. Based on statistics of mean distances and vector torsion angles from the TerMo library, we assumed a normal distribution for both of these measures and installed a Gaussian scoring function to score the tertiary structure around contacts for similarity to a database entry for each newly built structure. In post-filtering of structures, we compared directly to the TerMo Library for a refined TerMo score. The peptide was cleaved between 2° structure components, and contacts were determined. These contact pairs were then compared via RMSD to the contact pairs in our local tertiary motif database. Similarity to the database structures conferred a better score, and we chose the structure with the highest tertiary motif score as our primary submission.

A-255 Heuristical Approach: We took 2° structure predictions from the CAFASP results (PHD [3], PSIPRED [4], Sam-T99-2d [5]), and we split long sequences into smaller pieces because of a power-law dependence of calculation time on sequence length. Long sequences were split within regions that lacked strong helical or strand propensities. In deciding how to split sequences we also looked at the CAFASP{Fischer, 2001 #8}results and some PSI-Blast results [6]. Tertiary structures were re- assembled from the split sequences manually (see below). While Mosaix is a derivative of the Rosetta method, we used only used the clustered 9mer library described above. Based on the 2° structure predictions, we chose 150 fragments (50/prediction) and added fifty more fragments that represented all types of 2° structure elements for wild-card rescue from possible error in 2° structure prediction. For each prediction target, Mosaix was run 1000 or more times for ne/e random fragment insertions (n=sequence length). Each new decoy created was kept or thrown out based on a Boltzmann-like Monte Carlo system. The potential function used incorporated the Rosetta environment, pair and bump check, along with the gross TerMo score for 2° structure pairing. After every 1% of the total insertion iterations, the current, best- scored structure was output in pdb format, as long as it met fairly relaxed contact order requirements, yielding about 50 decoys per run. In total, ~50K decoys were generated for each target, which were clustered by a multi-centered clustering method [7]. Linked lists and intensive use of memory allowed the clustering algorithm to process 84,000 decoys of 109 residues in 5.5 hours. The 30 cluster centers with the most members were minimized using ENCAD [8] and scored with the refined TerMo score. The top-scoring model was our primary submission, and the remaining were reviewed and scored manually by intuition. The final four submissions were based on a combination of the intuition-based ranking and the refined TerMo score. For sequences that were split into domains, the structures were then joined by first calculating the phi-psi angles and then building the protein in phi-psi space according to these angles.

1. Wang G. and Dunbrack R.L. Jr., (2002) PISCES: a protein sequence culling server. Bioinformatics.. (submitted). 2. Hutchinson E.G. and Thornton J.M. (1996) PROMOTIF--a program to identify and analyze structural motifs in proteins,. Protein Sci. 5(2): p. 212-20. 3. Przybylski D. and Rost B., (2002), Alignments grow, secondary structure prediction improves,. Proteins 46(2): p. 197-205. 4. McGuffin L.J., Bryson K., and Jones D.T., (2000), The PSIPRED protein structure prediction server. Bioinformatics 16(4): p. 404-5. 5. Karplus K. and Hu B., (2001), Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set . Bioinformatics. 17(8): p. 713-20. 6. Altschul S.F. and Koonin E.V. (1998), Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases, Trends Biochem Sci 23(11): p. 444-7. 7. Shortle D., Simons K.T., and Baker D., (1998) Clustering of low-energy conformations near the native structures of small proteins. Proc Natl Acad Sci U S A. 95(19): p. 11158-62. 8. Levitt M. and Lifson S.,(1969) Refinement of protein conformations using a macromolecular energy minimization procedure. J Mol Biol, 46(2): p. 269-79.

Zhou-HX (P0056) - 134 predictions: 69 3D, 65 SS

Improving Fold Recognition and Query-Template Alignment by Combining PSI-Blast and Sequence-Structure Threading

H. Chen1, 2 and H.-X. Zhou1 1 – Florida State University, 2 – Drexel University [email protected]

See methods section

A-256

A-257 A-258 CASP5 Software Demonstration Abstracts

A-259 A-260 harrison (P0188) - 43 predictions: 43 3D

Robust Molecular Modeling

John Petock1, Ping Liu1, Irene T. Weber1, and Robert W. Harrison2,1 1- Department of Biology, 2- Department of Computer Science, Georgia State University [email protected]

Robust molecular modeling is the problem of ensuring that a molecular model can be built that both satisfies a minimal set of input data as well as ensuring that the model explores the range of possible structures which meet those data. Input data typically consist of interatomic distances and partial structures in homology modeling, but can consist of solely distance data in NMR structure determination and ab initio modeling.

Two randomized algorithms for robust modeling are implemented in the computer program AMMP. These include both a self-assembling neural network[1], and simulated annealing distance geometry. The self-assembling neural network uses a Kohonen neural network to mimic the natural self-assembly of polymers. It is quite capable of taking a limited description of a polymer and generating sets of models that satisfy those data. The other algorithm uses Floyd's algorithm (iterated triangle inequality) to fill in the full distance matrix for distance geometry. Standard distance geometry algorithms perform this calculation to estimate interatomic distances. However, unlike standard distance geometry algorithms, it is capable of treating the distances derived via Floyd's algorithm as strict upper bounds rather than distance estimates. The distance geometry equations are solved by a straightforward Metropolis simulated annealing algorithm.

1. Harrison R.W. (1999) A self-assembling neural network for modeling polymer structure. J. Math. Chem 26, 125-137.

Head-Gordon (P0271) - 93 predictions: 93 3D

ProtoShop: Interactive Design of Protein Structures

O. Kreylos1, N. Max1 and S. Crivelli2 1 Department of Computer Science, University of California, Davis, 2 NERSC, Lawrence Berkeley National Laboratory [email protected]

We demonstrate ProtoShop, a software tool that geometrically creates protein structures from amino acid sequence and secondary structure prediction files and allows interactive visualization and manipulation of those structures to design protein configurations. The program has two major stages: In the first stage, an initial protein structure is created from an input file; in the second stage, protein structures are visualized and can be manipulated interactively.

Input to the program is either a PDB file, or a "prediction file" in FASTA format. Prediction files contain the amino acid sequence for a given protein, and specify each residue's secondary structure type as one of -helix, -strand or coil. When reading a prediction file, protein structures are created one amino acid residue at a time.

A-261 Each residue's type is read from the input file, and atom positions are read from residue template files. As the protein is assembled, the program sets the dihedral angles of each added residue according to its specified secondary structure type, and attaches the created residue to the end of the existing protein. This way, proteins are created with secondary structures already assembled. Typically, creating a protein is instantaneous.

Once a protein structure has been created, the program can visualize it using several rendering styles. The main purpose for visualization is to aid a user in the manipulation of the protein structure. Therefore, visualization includes manipulation guides that are not part of the protein itself, such as indicators for hydrogen bonds. Also, since collisions between atoms are not prohibited during manipulation, they are visualized to call them to the user's attention.

Protein structures can be manipulated in two main ways: First, structure types can be changed for individual residues on-the-fly, and secondary structures can be manipulated as a whole, e.g., by twisting or curling beta strands or re-forming alpha helices. Second, entire partial structure assemblies or secondary structures can be dragged to form tertiary structure. Dragging is achieved by automatically adjusting dihedral angles of selected coil regions using an Inverse Kinematics (IK) method [1]. IK allows the manipulator to translate a user’s six-degree-of-freedom motions into changes of a chain segment’s dihedral angles  and . This gives a user great flexibility in aligning parts of proteins without breaking the entire protein. The main application of dragging is to form arbitrary beta sheet alignments, either manually, or assisted by selecting residues for automatic bonding. Manipulation guides such as hydrogen bond indicators and potential bond site indicators were specifically designed for the purpose of forming beta sheets. These guides are updated in real-time during manipulation, including forming/breaking of hydrogen bonds and visualization of atom collisions.

Created protein structures can be saved in PDB format at any time during manipulation, either to serve as input to other programs, or to reload structures at a later time. ProtoShop has been used to create initial configurations for the global optimization method used by the Head-Gordon group. The tool has allowed this group to tackle proteins of any size and topology.

1. Welman C. (1993) Inverse kinematics and geometric constraints for articulated figure manipulation. Master’s Thesis, Simon Fraser University, Vancouver, Canada.

HOGUE-SLRI (P0267) - 254 predictions: 254 3D

The Distributed Folding Project

H.J. Feldman1,2 and C.W.V. Hogue1,2 1 – Samuel Lunenfeld Research Institute, Mount Sinai Hospital 2 – Department of Biochemistry, University of Toronto [email protected]

The number of users connected to the internet is growing faster than ever before. High speed connections are becoming more and more common, and will soon be the norm in many countries across the world as the modem goes the way of the dinosaurs. This, combined with the fact that the average computing power of a home

A-262 machine is now comparable to that of supercomputers just a few decades ago, means that there are massive amounts of computing resources becoming available, all linked through one common medium – the internet.

A total of 13 targets were predicted with the help of distributed computing using an ab initio approach. Using a modified version of our highly parallelizable TRADES algorithm [1] we developed a distributed computing application, The Distributed Folding Project, to sample protein conformational space. We incorporated secondary structure prediction from PsiPred [2] and performed all-atom kinetic random walks in Ramachandran space on client CPUs, biased by the 3-state secondary structure prediction. Sidechains were placed probabilistically using Dunbrack's backbone dependent rotamer library [3]. All residues are chirally and sterically valid, having a minimum of non-hydrogen van der Waal collisions.

Users download and run the software in the form of a Windows screensaver or a Windows/UNIX ASCII art text client. The software generates probabilistic conformers as described above and submits the results to our central server for analysis and storage. Every time a new protein is begun, the software automatically updates itself, downloading the information on the new protein from our server, and the data is digitally signed for security purposes.

Up to one billion structures were generated for each target using the Distributed Folding Project framework (http://www.distributedfolding.org/). This allowed us to make use of spare CPU cycles on thousands of computers from volunteers across the world to sample vast amounts of conformational space.

From the pool of generated structures various statistics were collected including radius of gyration, exposed surface area, exposed hydrophobic surface area, and energy score according to three different scoring functions: the EEF1 solvation term, a modified version of a statistical residue-based potential [4] which also compared actual secondary structure content to predicted content, and a species-specific contact potential developed in our lab. Structures with radii of gyration greater than 120% * 2.59 * N^0.346, where N is the number of residues in the protein, were all discarded. This ensured only compact structures were retained. The best structures were chosen based on their energy scores.

The Distributed Folding Project serves as a rapid testing ground for evaluation of new sampling algorithms and scoring functions, limited only in that they must remain parallelizable. When used with proteins of known structure, it can reveal how well different scoring functions are able to distinguish near-native structures from a large pool of decoy conformers, and how quickly different sampling algorithms converge towards native-like structure.

1. Feldman H.J. and Hogue C.W.V. (2000) A Fast Method to Sample Real Protein Conformational Space. Proteins 39 (2), 112-131. 2. Jones D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292 (2), 195-202. 3. Dunbrack R.L., Jr. and Karplus M. (1993) Backbone-dependent rotamer library for proteins. Application to side-chain prediction. J.Mol.Biol. 230 (2), 543-574. 4. Bryant S.H. and Lawrence C.E. (1993) An Empirical Energy Function for Threading Protein Sequence through the Folding Motif. Proteins 16 (1), 92-112.

Osgdj (P0292) - 100 predictions: 100 3D

A-263 The PROTSCAPE Protein Folding WEB Server

D.J. Osguthorpe & N. WhiteLegg University of Bath in Swindon [email protected]

The PROTSCAPE protein folding WEB server is a CGI interface to the PROTSCAPE protein folding algorithm. The sequence is simply pasted into the submission page and the generated conformations returned by e-mail.

A Beowulf cluster based on 20 Dual 1.2 Ghz Athlon processors is used to perform the simulated annealing calculations required to generate the structures. The resulting simplified model is converted to an all-heavy atom model using a combined procedure of RMS fitting and building The RMS fitting generates the all-atom backbone and is followed by a side chain building procedure to build the all-atom side chains using the simplified model side chain atoms as guide points. Because of the heavy computational nature of the problem it takes about 24 hours per 100 residues.

A-264 A-265 CASP5 Other Abstracts

A-266 A-267 Evaluation of Blind Predictions of Protein-Protein Interactions Made in the CAPRI Experiment

Raul Mendez, Raphael Laplae, Leonardo DeMaria and Shoshana J. Wodak Service de Conformation de Macromolécules Biologiques et Bioinformatique, Cp263, Université libre de Bruxelles, Blv du Triomphe, 1050 Brussels, Belgium. [email protected]

Tens of thousands of gene products are known or suspected to interact with many others, based on genetic, biochemical or bioinformatics methods, forming millions of putative complexes. A very small fraction of these complexes will be characterised, let alone have their 3D structure determined, in the near future. Procedures for predicting the modes of association from the structures of the components, docking procedures, have therefore received renewed attention recently. But before predicted modes of association can serve as a guide in genetic and biochemical experiments, the performance of the prediction methods must be systematically assessed. The Critical Assessment of PRedicted Interactions experiment (CAPRI) is a community-wide blind test, similar to CASP, but devoted to docking procedures (http://capri.ebi.ac.uk/Charleston.html). It aims at assessing the state of the art of methods for predicting protein-protein interactions from the 3D structure of the unbound components. Here we report the results of the evaluation that we conducted on 535 predictions submitted in two rounds of the CAPRI experiment by 19 different groups for a total of 7 target complexes. Several of the complexes were large multisubunit assemblies, and some featured conformational changes between the bound and unbound species. We recently assessed these predictions and presented the results to the predictors during the 1 st CAPRI evaluation meeting held in France Sep. 19-21, 2002. Here we would like to present highlights from this evaluation. To perform the evaluation we computed for each predicted complex the fraction of native residue-residue contacts (those observed between the interacting molecules in the target complex) that is recovered in the predicted structure and the fraction of native interface residues that is recovered on each face of the contacting proteins. We also quantified the rigid-body transformation (center of coordinates translation and rigid-body rotation) that are required to bring the predicted complex into register with the target and computed two different rmsd. One between the main chain of interface residues in the target versus the prediction, and another between the main chain of molecules B in the prediction versus the target, after molecules A of both complexes were optimally superimposed. Although overall, the predictions cannot be qualified as successful, for each target a few groups succeeded in coming close and sometimes very close to the right answer. But different groups contributed successful predictions for different targets. It was very encouraging that near correct predictions were made also in difficult cases where the components undergo some conformational changes upon binding. It appeared that in these and other cases, using biochemical information to guide the calculations provided a clear advantage, but in other cases such information was misleading. The evaluation and ensuing discussions were also useful in pointing out directions for future progress. Amongst the different docking procedures, a few were clearly computationally very efficient in sampling potential docking solutions, whereas others had better criteria for scoring these solutions. Approaches that combine the more efficient search algorithms with the best scoring functions should therefore be a good way forward. Another avenue for progress will undoubtedly come from several novel procedures, tested for the first time in CAPRI, with quite promising results. The results of the CAPRI experiment will be published in a special issue of Proteins, Structure, Functions and Genetics, in the spring of 2003.

A-268 The Pittsburgh Supercomputing Center and Hewlett Packard Support of CASP5

Troy Wymore1, Angela Loh2, David Deerfield II1 , Ralph Roskies1 and Ken Hackworth1 1Pittsburgh Supercomputing Center, 2Hewlett Packard Life and Material Sciences Division [email protected]

The National Science Foundation Partnerships for Advanced Computational Infrastructure program allocated computing time on PSC’s Terascale Computing System (TCS) for researchers participating in CASP5. The TCS is comprised of 3,000 HP Alpha Server EV68 processors and has a peak capability of six teraflops (six trillion operations per second) making it the most powerful system in the world for open research. This large­scale computational resource was intended to advance structure predictions and refinements by allowing researchers to use more accurate potentials and/or better sample conformational space. This presentation will detail the computational resources made available for the prediction season, the process by which time was allocated, the resource usage and plans for future CASP experiments.

A-269 Abstract Author Index

A-270 A Bourne...... 147, 154 Crivelli...... 76, 213 E Boxall...... 106 Cubellis...... 30 Adamczak...... 107 Eastwood...... 167 Bradley...... 12, 176 Cuff A.L...... 106 Affonnikov...... 153, 206 Ehebauer...... 30 Braun...... 25, 178 Cuff J...... 192 Akiyama...... 37, 61, 122 Eisenberg...... 165 Bredesen...... 114, 198 Cymerman...... 71, 186 Akutsu...... 110 Elber...... 134, 204 Brenner...... 52, 186 Czaplewski...... 145 Albrecht...... 33 Ellrott...... 116, 199 Brewerton...... 30 Alexandrov...... 3 D Elofsson...... 144, 206 Brooks...... 26 Amaro...... 146 Eskow...... 76 Brown...... 49 Damien...... 99 An...... 62 Eyrich...... 62 Bujnicki...... 27, 71, 186 Danzer...... 24 Arai...... 54, 55 Burke...... 30, 90, 190 Darlington...... 192 F Arakaki...... 152 Butenhof...... 3, 175 Day...... 62 Arnautova...... 145 Fang...... 150 Byrd...... 76 De Kee...... 18, 19 Athma...... 36, 181 Farid...... 62 Bystroff...... 28, 86, 180 Deane...... 165 Autenrieth...... 146 Fariselli...... 43, 44, 100, 182 Debe...... 24 Avbelj...... 10 C Favrin...... 88 Deerfield II...... 220 Feder...... 71, 186 B Camacho...... 31, 32, 180 Del Carpio-Muñoz...... 45, 183 Feig...... 26 Canutescu...... 48, 184 del Rio...... 114, 198 Bachinsky...... 153, 206 Feldman...... 81, 189, 214 Cao...... 80, 188 DeMaria...... 219 Badretdinov...... 3 Fernández...... 59 Capriotti...... 44, 182 Dent...... 36, 181 Baker...... 11, 12, 13, 176 Finkelstein...... 130, 169, 201, 202 Casadio...... 43, 44, 100, 182 Depiereux...... 51, 99, 195 Bakker...... 30 Fischer...... 57 Casper...... 136 DePristo...... 30 Baldi...... 14 Fiser...... 52, 186 Catherinot...... 9, 164 Diekhans...... 136 Bass...... 16 Fitzjohn...... 16 Cestaro...... 33, 34, 181 Diemand...... 68 Bastolla...... 124 Fleming...... 192 Chen H...... 171, 209 Dobbs...... 80, 188 Bates...... 16 Flohil...... 56, 57 Chen L...... 30 Dolgikh...... 202 Benner...... 18, 19 Floudas...... 58 Chen W...... 41, 148 Doniach...... 47 Bienkowska...... 156 Fogolari...... 34, 181 Cherukuri...... 156 Douguet...... 9, 164 Bindewald...... 34 Fooks...... 106 Chiba...... 41, 182 Drake...... 30 Birney...... 192 Friesner...... 62 Chikenji...... 133, 203 Draper...... 136 Bliss...... 208 Fujita...... 110 Chinchio...... 145 Dumontie...... 189 Blundell...... 30, 66, 90, 190 Fujitsuka...... 133, 203 Chivian...... 11, 13, 176 Dumontier...... 81 Bogatyreva...... 130, 202 Choi...... 76 Dunbrack R.L...... 48, 184, 185 G Bolanos-Garcia...... 30 Colubri...... 59 Dunker...... 49 Boniecki...... 152 Galor...... 134, 204 Combet...... 68 Boojala...... 147 Galzitskaya...... 130, 202 Coveney...... 152

A-271 Gao...... 80, 188 Hill...... 52, 186 Joo...... 96, 97, 193, 194 Koretke...... 68, 69, 70 Garbuzynskiy...... 130, 202 Hirokawa...... 37 Joseph...... 162 Kornev...... 78 Garciarrubio...... 114, 198 Ho...... 80, 188 Juan...... 100 Kosinski...... 71, 186 Garner...... 49 Hodges...... 208 Kotlovyi...... 78 K Garnier...... 67 Hogue...... 81, 189, 214 Kreylos...... 76, 213 Gelfand...... 202 Holm...... 82 Kalisman...... 94 Krieger...... 169 Gerloff...... 73, 187 Holmes...... 208 Kang...... 145 Krishnaswamy...... 150 Gibrat...... 65 Honig...... 83 Karchin...... 136, 137, 204 Kuhn...... 12, 176 Gibson K.D...... 145 Horimoto...... 110 Karplus...... 136, 137, 204 Kumar...... 150 Gibson R.C...... 106 Huber...... 85, 190 Katta...... 150 Kurihara...... 42, 182 Ginalski...... 74 Hubner...... 148 Kaźmierkiewicz...... 145 Kurowski...... 71, 186 Giordanetto...... 152 Hughey...... 136, 137, 204 Kaznessis...... 93 Kussell...... 148 Go...... 5, 175 Huisheng...... 144, 206 Keasar...... 94 L Godzik...... 56, 179 Hung...... 204 Kelley...... 156, 192 Goede...... 123 Hussein...... 73, 187 Khalili...... 145 Labesse...... 9, 164 Gottlieb...... 177 Hutchinson...... 106 Khandelia...... 93 Lai...... 30 Gough...... 159, 160 Kihara...... 152 Lambert...... 51, 99, 195 I Graña...... 100 Kilkenny...... 30 Laplae...... 219 Grotthuss...... 179 Ihm...... 80, 188 Kim D...... 116, 198, 199 Lattimore...... 106 Gunn...... 62 Imbert...... 41 Kim D.E...... 13 Lebedev...... 149 Guo...... 116, 198 Irbäck...... 88 Kim H...... 98 Lee J.K...... 170, 171 Gweon...... 30, 66 Ishida...... 20 Kim I...... 96, 97, 193, 194 Lee Jo...... 96, 97, 193, 194 Ishizuka...... 20 Kim S...... 97, 193 Lee Ju...... 96, 97, 193, 194 H Ivanciuc...... 25, 178 Kim S.-Y...... 96, 97, 193, 194 Lee S.J...... 96, 194 Hackworth...... 220 Ivanisenko...... 153, 206 Kinjo...... 133, 203 Leeuw...... 56, 57 Hagino...... 183 Ivankov...... 130, 169, 202 Kister...... 202 LeFlohic...... 23, 177 Haley-Vicente...... 3, 175 Iwadate...... 42, 54, 55, 182 Klepeis...... 58 Léonard...... 99 Hamilton...... 73, 187 Kloczkowski...... 67 Leplae...... 112 J Han...... 120 Knizewski...... 179 Levitt...... 108, 196, 197 Hardin...... 167 Jacobson...... 62 Kochupurakkal...... 30 Li M...... 132 Harmer...... 30 Jager...... 89 Koehl...... 108, 197 Li W...... 154 Harrington...... 62 Jagielska...... 145 Koga...... 133, 203 Li Xia...... 49 Harrison...... 75, 207, 213 Januszyk...... 146 Kolinski...... 152 Li Xin...... 62 Head-Gordon...... 76 Jaroszewski...... 56, 179 Kolodny...... 108, 196 Li Y...... 48 Heger...... 82 Jernigan...... 67 Kolossváry...... 22 Lin D...... 199 Hibi...... 20 Jha...... 152 Kondratova...... 130 Lin K.X...... 161 Hijikata...... 5, 175 Jones...... 91, 92, 191, 192 Konerding...... 52, 186 Lindahl...... 108, 197

A-272 Litvinov...... 130 Michalsky...... 123 Osguthorpe...... 117, 215 Ripoll...... 38, 145 Liu...... 75, 213 Migliavacca...... 68 Ota...... 122 Robertson...... 13 Liwo...... 145 Miguel...... 30 Robinson...... 192 P Lobanov...... 130, 202 Miki...... 20 Rohl...... 11, 13, 176 Lobley...... 30 Misura...... 12, 176 Pan...... 120 Romero...... 49 Loh...... 220 Mitchell...... 73, 187 Park...... 98 Roskies...... 220 Lomize...... 101, 196 Mizuguchi...... 30, 66 Pas...... 121 Rossi...... 44, 182 Lovell...... 30 Montaluoa...... 30 Passovets...... 116, 199 Rotem...... 123 Lowe...... 134, 204 Moon...... 170, 171 Pazos...... 100 Roytberg...... 130 Luethy...... 16, 101 Moreira...... 16 Pellegrini-Calace...... 191 Royyuru...... 36, 181 Lugovskoy...... 177 Mosberg...... 101 Pellequer...... 41 Rychlewski...... 21 Lund...... 102, 196 Mueller...... 192 Peng...... 88 Rykunov...... 135 Lundegaard...... 102, 196 Murphy E.F...... 106 Petock...... 75, 213 S Lupas...... 69, 70 Murphy P...... 11, 12, 176 Petrey...... 83 Luthey-Schulten...... 146, 167 Murzin...... 110 Pible...... 41 Saigo...... 110 Mushegian...... 6, 176 Pillardy...... 38, 134, 145, 204 Sali...... 52, 186 M Pincus...... 62 Salim...... 161 N MacCallum...... 104 Pogorelov...... 146 Samudrala 126, 127, 129, 139, 140, Madera...... 160 Nakamura...... 20 Pogozheva...... 101, 196 142, 204, 205, 206 Mallick...... 165 Nanias...... 145 Pollastri...... 14 Samuelsson...... 88 Malmstrom...... 13 Newhouse...... 192 Pons...... 9, 164 Saqi...... 95, 152 Mande...... 162 Ngan...... 126, 142, 205 Popovic...... 30 Sasaki...... 20 Mandel-Gutfreund...... 136 Nielsen...... 102, 196 Pothier...... 65 Sasin...... 71, 186 Marin...... 65 Nishimura...... 20 Prasad...... 31, 180 Sasson...... 143 Martin A.C.R...... 106 Noguchi...... 37, 122 Preissner...... 123 Saunders...... 145 Martin L...... 9 Noguti...... 5, 175 Prentiss...... 167 Sawicka...... 47 Mathura...... 25, 178 Procter...... 85, 190 Scheib...... 68, 69, 70 O Mavropulo-Stolyarenko...... 149 Schein...... 25, 178 R Max...... 213 O’Donoghue...... 146 Schell...... 208 McAllister...... 156 Obradovic...... 88 Radivojac...... 49 Scheraga...... 145 McCormack...... 18, 19 Oezguen...... 25, 178 Raghava...... 131, 132 Schief...... 12, 176 McDermott...... 206 Offman...... 16 Rapp...... 62 Schmid...... 73, 187 McGuffin...... 91, 92, 192 Ogata...... 112 Raschke...... 108, 196 Schnabel...... 76 Mehta...... 150 Oliveberg...... 202 Raval...... 95 Schneider...... 3, 175 Meiler...... 12, 13, 176 Olmea...... 43 Reddy...... 147 Schonbrun...... 12, 176 Meller...... 107, 134, 204 Onizuka...... 4 Reibarkh...... 196 Schueler-Furman...... 12, 176 Mendez...... 219 Orengo...... 192 Reva...... 135 Shah...... 116, 199

A-273 Shakhnovich B...... 148 T Vila...... 145 Yan J.F...... 168 Shakhnovich E.I...... 148 Vlijmen...... 177 Yan L...... 3, 175 Takada...... 133, 203 Shao...... 28, 86, 180 von Öhsen...... 7, 176 Yeh...... 3, 175 Takaya...... 41, 182 Sharikov...... 78 Vorobjev...... 153, 206 Yoon...... 170, 171 Takeda-Shitaka...... 42, 182 Shelenkov...... 184 Vriend...... 169 Talbot...... 106 Z Shestopalov...... 149 Vucetic...... 88 Tanaka...... 41, 182 Shetty...... 30 Zarina...... 161 Tang...... 83 W Shi...... 39, 66 Zemla...... 8 Tarakanov...... 135 Shigeta...... 23, 177 Wallin...... 88 Zhang F...... 148 Taylor...... 161 Shimizu...... 20 Wallner...... 144, 206 Zhang Y...... 152 Ten Eyck...... 78 Shindyalov...... 154 Wang B.-C...... 199 Zheng...... 47 Teodorescu...... 134, 204 Shirasawa...... 183 Wang C.-Z...... 80, 188 Zhou H.-X...... 171, 209 Terashi...... 42, 54, 55, 182 Shitaka...... 41 Wang G...... 48, 185 Zhou R...... 36, 181 Tereshchenko...... 87 Shortle...... 150 Wang L...... 199 Zimmermann...... 65 Thornton...... 192 Siew...... 57 Ward...... 92 Thorpe...... 30 Silverman...... 36, 181 Weber...... 75, 207, 213 Titov...... 153, 206 Singh...... 177 Wedemeyer...... 12, 176 Tomii...... 37, 61, 122 Sjölander...... 52, 186 Weiss...... 165 Toppo...... 33 Sjunnesson...... 88 WhiteLegg...... 215 Torda...... 85, 190 Skolnick...... 152 Wild...... 95 Torshin...... 163, 207 Smith...... 16 Wills...... 106 Tosatto...... 33, 34, 181 Soares...... 73, 187 Wodak...... 112, 219 Tsai...... 208 Solovyev...... 153, 206 Wolynes...... 146, 167 Tsigelny...... 78 Sommer...... 7, 176 Word...... 68 Tungaraza...... 36, 181 Sorensen...... 192 Worning...... 102, 196 Soto...... 83 U Wymore...... 220 Spassov...... 3, 175 Umeyama...... 41, 42, 54, 55, 182 Standley...... 62 X Stebbings...... 30, 66 V Xiang...... 83 Sternberg...... 156, 192 Xu D...... 116, 198, 199 Vajda...... 31, 180 Strauss...... 11, 12, 176 Xu J...... 132 Valencia...... 43, 100 Suenaga...... 37 Xu Yi...... 116, 198, 199 Valle...... 33, 34, 181 Summa...... 108, 196 Xu Yu...... 25, 178 Sundaram K...... 158 Venclovas...... 166 Sundaram S...... 158 Veretnik...... 154 Y Vert...... 110 Swanson...... 208 Yamatsu...... 42, 54, 55, 182 Vicatos...... 93 Szczesny...... 179 Yan B.C...... 168 Szilagyi...... 152 Vidyasagar...... 162

A-274 A-275 Abstract Contents (by abstract type & group)

A-276 CAMACHO-CARLOS (P0099) - 184 PREDICTIONS: 184 3D...... 32 Methods Abstracts CASPITA (P0108) - 133 PREDICTIONS: 70 3D, 63 SS...... 33 CASPITA (P0108) - 133 PREDICTIONS: 70 3D, 63 SS...... 34 CBC-FOLD (P0008) - 151 PREDICTIONS: 151 3D...... 36 123D_SERVER (P0476) - 68 PREDICTIONS: 68 3D...... 3 CBRC (P0041) - 385 PREDICTIONS: 279 3D, 105 SS, 1 DR...... 37 ACCELRYS (P0210) - 24 PREDICTIONS: 24 3D...... 3 CBSU (P0417) - 173 PREDICTIONS: 173 3D...... 38 ADVANCED-ONIZUKA (P0214) - 92 PREDICTIONS: 92 3D...... 4 CELLTECH (P0028) - 347 PREDICTIONS: 347 3D...... 39 ALAX (P0234) - 39 PREDICTIONS: 39 3D...... 5 CHEN-WENDY (P0264) - 37 PREDICTIONS: 37 3D...... 41 ALIGNERS (P0064) - 31 PREDICTIONS: 31 3D...... 6 CHIMERA (P0153) - 94 PREDICTIONS: 94 3D...... 41 ARBY-SCAI (P0183) - 68 PREDICTIONS: 68 3D...... 7 CHIMERAX (P0170) - 74 PREDICTIONS: 74 3D...... 42 AS2TS (P0081) - 26 PREDICTIONS: 26 3D...... 8 CIRB (P0397) - 263 PREDICTIONS: 200 3D, 63 RR...... 43 ATOME (P0464) - 318 PREDICTIONS: 318 3D...... 9 CIRB (P0397) - 263 PREDICTIONS: 200 3D, 63 RR...... 44 AVBELJ-FRANC (P0341) - 25 PREDICTIONS: 25 3D...... 10 DELCLAB (P0050) - 310 PREDICTIONS: 310 3D...... 45 BAKER (P0002) - 377 PREDICTIONS: 377 3D...... 11 DONIACH (P0401) - 42 PREDICTIONS: 42 3D...... 47 BAKER (P0002) - 377 PREDICTIONS: 377 3D...... 12 DOROTA (P0589) - 1 PREDICTION: 1 3D...... 47 BAKER-ROBETTA (P0029) - 199 PREDICTIONS: 199 3D...... 13 DUNBRACK (P0329) - 46 PREDICTIONS: 46 3D...... 48 BALDI (P0021) - 61 PREDICTIONS: 61 3D...... 14 DUNKER-KEITH (P0355) - 195 PREDICTIONS: 195 DR...... 49 BALDI-CONPRO (P0022) - 62 PREDICTIONS: 62 RR...... 14 ESYPRED3D (P0034) - 36 PREDICTIONS: 36 3D...... 51 BALDI-SSPRO (P0023) - 63 PREDICTIONS: 63 SS...... 14 EVOLUTIONARIES (P0180) - 99 PREDICTIONS: 99 3D...... 52 CMAP23DPRO (P0253) - 1 PREDICTION: 1 3D...... 14 FAMS (P0168) - 324 PREDICTIONS: 324 3D...... 54 CMAPPRO (P0255) - 0 PREDICTIONS...... 14 FAMSD (P0169) - 322 PREDICTIONS: 322 3D...... 55 SSPRO2 (P0254) - 65 PREDICTIONS: 65 SS...... 14 FFAS03 (P0309) - 314 PREDICTIONS: 314 3D...... 56 BASS-MICHAEL (P0384) - 51 PREDICTIONS: 51 3D...... 16 FLOHIL (P0545) - 3 PREDICTIONS: 3 3D...... 56 BATES-PAUL (P0096) - 72 PREDICTIONS: 72 3D...... 16 FLOHIL (P0545) - 3 PREDICTIONS: 3 3D...... 57 BENNER-STEVE (P0524) - 35 PREDICTIONS: 18 3D, 17 SS...... 18 FISCHER (P0427) - 161 PREDICTIONS: 161 3D...... 57 BENNER-STEVE (P0524) - 35 PREDICTIONS: 18 3D, 17 SS...... 19 FLOUDAS-C.A. (P0011) - 15 PREDICTIONS: 15 3D...... 58 BILAB (P0080) - 200 PREDICTIONS: 200 3D...... 20 FM-AF (P0571) - 17 PREDICTIONS: 17 3D...... 59 BIOINFO.PL (P0006) - 75 PREDICTIONS: 75 3D...... 21 FORTE1 (P0290) - 276 PREDICTIONS: 276 3D...... 61 BIOKOL (P0258) - 23 PREDICTIONS: 23 3D...... 22 FRIESNER (P0112) - 174 PREDICTIONS: 174 3D...... 62 BION (P0474) - 63 PREDICTIONS: 63 SS...... 23 FROST-MIG (P0047) - 72 PREDICTIONS: 72 3D...... 65 BIONOMIX (P0475) - 61 PREDICTIONS: 61 3D...... 24 FUGUE2 (P0014) - 330 PREDICTIONS: 330 3D...... 66 BRAUN-WERNER (P0024) - 65 PREDICTIONS: 65 3D...... 25 FUGUE3 (P0226) - 330 PREDICTIONS: 330 3D...... 66 BROOKS (P0373) - 252 PREDICTIONS: 252 3D...... 26 GARNIER-KLOCZKOWSKI (P0396) - 91 PREDICTIONS: 91 SS...... 67 BUJNICKI-JANUSZ (P0020) - 215 PREDICTIONS: 67 3D, 58 SS, 49 RR, 41 GEM (P0359) - 76 PREDICTIONS: 76 3D...... 68 DR...... 27 GEM (P0359) - 76 PREDICTIONS: 76 3D...... 69 BYSTROFF (P0131) - 132 PREDICTIONS: 45 3D, 40 SS, 45 RR, 2 DR. 28 GEM (P0359) - 76 PREDICTIONS: 76 3D...... 70 CAM-BIOCHEM (P0447) - 74 PREDICTIONS: 74 3D...... 30 GENESILICO (P0517) - 195 PREDICTIONS: 86 3D, 64 SS, 45 RR...... 71 CAMACHO-CARLOS (P0098) - 46 PREDICTIONS: 46 3D...... 31 GERLOFF (P0240) - 9 PREDICTIONS: 9 3D...... 73

A-277 GINALSKI (P0453) - 71 PREDICTIONS: 71 3D...... 74 PAN (P0032) - 164 PREDICTIONS: 99 3D, 65 SS...... 120 HARRISON (P0188) - 43 PREDICTIONS: 43 3D...... 75 PAS (P0513) - 73 PREDICTIONS: 73 3D...... 121 HEAD-GORDON (P0271) - 93 PREDICTIONS: 93 3D...... 76 PILOT (P0378) - 146 PREDICTIONS: 146 3D...... 122 HMMSPECTR (P0025) - 285 PREDICTIONS: 285 3D...... 78 POMI (P0465) - 46 PREDICTIONS: 46 3D...... 123 HO-KAI-MING (P0437) - 129 PREDICTIONS: 129 3D...... 80 PREISSNER (P0488) - 20 PREDICTIONS: 20 3D...... 123 HOGUE-SLRI (P0267) - 254 PREDICTIONS: 254 3D...... 81 PROTFINDER (P0282) - 222 PREDICTIONS: 222 3D...... 124 HOLM (P0090) - 38 PREDICTIONS: 38 3D...... 82 PROTINFO-AB (P0140) - 260 PREDICTIONS: 260 3D...... 126 HONIG (P0110) - 113 PREDICTIONS: 113 3D...... 83 PROTINFO-CM (P0138) - 251 PREDICTIONS: 251 3D...... 127 HUBER-TORDA (P0351) - 83 PREDICTIONS: 83 3D...... 85 PROTINFO-FR (P0139) - 325 PREDICTIONS: 325 3D...... 129 I-SITES/BYSTROFF (P0132) - 64 PREDICTIONS: 64 3D...... 86 PUSHCHINO (P0203) - 263 PREDICTIONS: 263 3D...... 130 INFORMAX (P0326) - 24 PREDICTIONS: 24 3D...... 87 RAGHAVA-GAJENDARA (P0054) - 482 PREDICTIONS: 224 3D, 258 SS131 IRBACK (P0559) - 20 PREDICTIONS: 20 3D...... 88 APSSP/RAGHAVA-GAJENDRA (P0137) - 65 PREDICTIONS: 65 SS...... 131 IST-ZORAN (P0454) - 195 PREDICTIONS: 195 DR...... 88 APSSP2/RAGHAVA-GAJENDRA (P0055) - 65 PREDICTIONS: 65 SS....132 JAGER (P0582) - 7 PREDICTIONS: 7 3D...... 89 RAPTOR (P0144) - 227 PREDICTIONS: 227 3D...... 132 JIVE (P0506) - 37 PREDICTIONS: 37 3D...... 90 ROKKO (P0327) - 109 PREDICTIONS: 109 3D...... 133 JONES (P0067) - 121 PREDICTIONS: 68 3D, 53 SS...... 91 RON-ELBER (P0300) - 259 PREDICTIONS: 259 3D...... 134 JONES-NEWFOLD (P0068) - 214 PREDICTIONS: 87 3D, 63 SS, 64 DR.92 RYKUNOV-REVA-TARAKANOV (P0529) - 198 PREDICTIONS: 198 3D...135 KAZNESSIS (P0548) - 15 PREDICTIONS: 15 3D...... 93 SAM-T02-HUMAN (P0001) - 203 PREDICTIONS: 138 3D, 65 SS...... 136 KEASAR (P0429) - 90 PREDICTIONS: 90 3D...... 94 SAM-T02-SERVER (P0189) - 221 PREDICTIONS: 221 3D...... 137 KGI-QMW (P0015) - 19 PREDICTIONS: 19 3D...... 95 SAMUDRALA-COMPARATIVE-MODELLING (P0053) –...... 139 KIAS (P0531) - 479 PREDICTIONS: 176 3D, 303 SS...... 96 SAMUDRALA-FOLD-RECOGNITION (P0052) - 315 PREDICTIONS: 315 KIAS (P0531) - 479 PREDICTIONS: 176 3D, 303 SS...... 97 3D...... 140 KIM-PARK (P0442) - 65 PREDICTIONS: 65 SS...... 98 SAMUDRALA-NEWFOLD (P0051) - 410 PREDICTIONS: 410 3D.....142 LAMBERT-CHRISTOPHE (P0035) - 131 PREDICTIONS: 131 3D...... 99 SASSON-IRIS (P0265) - 66 PREDICTIONS: 66 3D...... 143 LIBELLULA (P0230) - 216 PREDICTIONS: 216 3D...... 100 SBC (P0084) - 94 PREDICTIONS: 94 3D...... 144 LOMIZE-ANDREI (P0288) - 76 PREDICTIONS: 76 3D...... 101 SCHERAGA-HAROLD (P0314) - 135 PREDICTIONS: 135 3D...... 145 LUETHY (P0419) - 240 PREDICTIONS: 240 3D...... 101 SCHULTEN-WOLYNES (P0093) - 118 PREDICTIONS: 118 3D...... 146 LUND-OLE (P0391) - 39 PREDICTIONS: 39 3D...... 102 SDSC2:REDDY-BOURNE (P0347) - 54 PREDICTIONS: 54 3D...... 147 MACCALLUM (P0393) - 130 PREDICTIONS: 130 SS...... 104 SHAKHNOVICH-EUGENE (P0459) - 26 PREDICTIONS: 26 3D...... 148 MARTIN-ANDREW (P0471) - 55 PREDICTIONS: 55 3D...... 106 SHESTOPALOV (P0044) - 159 PREDICTIONS: 79 3D, 80 SS...... 149 MELLER-ADAMCZAK (P0441) - 23 PREDICTIONS: 23 3D...... 107 SHORTLE (P0349) - 32 PREDICTIONS: 32 3D...... 150 LEVITT (P0016) - 350 PREDICTIONS: 350 3D...... 108 SK-LAB (P0403) - 2 PREDICTIONS: 2 3D...... 150 MPALIGN (P0135) - 327 PREDICTIONS: 327 3D...... 110 SKOLNICK-KOLINSKI (P0010) - 361 PREDICTIONS: 361 3D...... 152 MURZIN (P0448) - 21 PREDICTIONS: 21 3D...... 110 SMD-CCS (P0249) - 4 PREDICTIONS: 4 3D...... 152 MZ-BRUSSELS (P0246) - 54 PREDICTIONS: 54 3D...... 112 SOLOVYEV-SOFTBERRY (P0270) - 242 PREDICTIONS: 177 3D, 65 SS153 NEXXUS-DELRIO (P0370) - 7 PREDICTIONS: 7 3D...... 114 SPAM1 (P0400) - 87 PREDICTIONS: 87 3D...... 154 ORNL-PROSPECT (P0012) - 330 PREDICTIONS: 330 3D...... 116 SRBI (P0331) - 109 PREDICTIONS: 109 3D...... 156 OSGDJ (P0292) - 100 PREDICTIONS: 100 3D...... 117 STERNBERG (P0105) - 71 PREDICTIONS: 71 3D...... 156

A-278 SUNDARAMS (P0381) – 0 PREDICTIONS...... 158 Poster Abstracts SUPERFAMILY (P0065) - 925 PREDICTIONS: 925 3D...... 159 SUPFAM_PP (P0086) - 728 PREDICTIONS: 728 3D...... 160 ACCELRYS (P0210) - 24 PREDICTIONS: 24 3D...... 175 SZED-ASMAT (P0515) - 6 PREDICTIONS: 6 3D...... 161 ALAX (P0234) - 39 PREDICTIONS: 39 3D...... 175 TAYLOR (P0423) - 113 PREDICTIONS: 113 3D...... 161 ALIGNERS (P0064) - 31 PREDICTIONS: 31 3D...... 176 TCS-BIOINFORMATICS (P0404) - 40 PREDICTIONS: 40 SS...... 162 ARBY-SCAI (P0183) - 68 PREDICTIONS: 68 3D...... 176 THW-FR (P0377) - 241 PREDICTIONS: 241 3D...... 163 BAKER (P0002) - 377 PREDICTIONS: 377 3D...... 176 TOME (P0450) - 260 PREDICTIONS: 260 3D...... 164 BAKER (P0002) - 377 PREDICTIONS: 377 3D...... 176 UCLA-DOE (P0301) - 59 PREDICTIONS: 59 3D...... 165 BIOGEN (P0440) - 28 PREDICTIONS: 28 3D...... 177 VENCLOVAS (P0425) - 20 PREDICTIONS: 20 3D...... 166 BION (P0474) - 63 PREDICTIONS: 63 SS...... 177 WOLYNES-SCHULTEN (P0294) - 42 PREDICTIONS: 42 3D...... 167 BRAUN-WERNER (P0024) - 65 PREDICTIONS: 65 3D...... 178 YAN-RESEARCH (P0069) - 60 PREDICTIONS: 60 SS...... 168 BURNHAM (P0516) - 68 PREDICTIONS: 68 3D...... 179 YASARA-PUSHCHINO (P0202) - 192 PREDICTIONS: 192 3D...... 169 BYSTROFF (P0131) - 132 PREDICTIONS: 45 3D, 40 SS, 45 RR, 2 DR YOON (P0262) - 35 PREDICTIONS: 35 3D...... 170 ...... 180 YOON (P0262) - 35 PREDICTIONS: 35 3D...... 171 CAMACHO-CARLOS (P0098) - 46 PREDICTIONS: 46 3D...... 180 ZHOU-HX (P0056) - 134 PREDICTIONS: 69 3D, 65 SS...... 171 CASPITA (P0108) - 133 PREDICTIONS: 70 3D, 63 SS...... 181 CBC-FOLD (P0008) - 151 PREDICTIONS: 151 3D...... 181 CHIMERA (P0153) - 94 PREDICTIONS: 94 3D...... 182 CHIMERAX (P0170) - 74 PREDICTIONS: 74 3D...... 182 CIRB (P0397) - 263 PREDICTIONS: 200 3D, 63 RR...... 182 DELCLAB (P0050) - 310 PREDICTIONS: 310 3D...... 183 DUNBRACK (P0329) - 46 PREDICTIONS: 46 3D...... 184 DUNBRACK (P0329) - 46 PREDICTIONS: 46 3D...... 185 EVOLUTIONARIES (P0180) - 99 PREDICTIONS: 99 3D...... 186 GENESILICO.PL-SERVERS-ONLY (P0242) - 68 PREDICTIONS: 66 3D, 2 SS ...... 186 BUJNICKI-JANUSZ (P0020) - 215 PREDICTIONS: 67 3D, 58 SS, 49 RR, 41 DR...... 186 GENESILICO (P0517) - 195 PREDICTIONS: 86 3D, 64 SS, 45 RR...... 186 GERLOFF (P0240) - 9 PREDICTIONS: 9 3D...... 187 HO-KAI-MING (P0437) - 129 PREDICTIONS: 129 3D...... 188 HOGUE-SLRI (P0267) - 254 PREDICTIONS: 254 3D...... 189 HUBER-TORDA (P0351) - 83 PREDICTIONS: 83 3D...... 190 JIVE (P0506) - 37 PREDICTIONS: 37 3D...... 190 JONES (P0067) - 121 PREDICTIONS: 68 3D, 53 SS...... 191 JONES (P0067) - 121 PREDICTIONS: 68 3D, 53 SS...... 192 KIAS (P0531) - 479 PREDICTIONS: 176 3D, 303 SS...... 193

A-279 KIAS (P0531) - 479 PREDICTIONS: 176 3D, 303 SS...... 194 Demonstration Abstracts LAMBERT-CHRISTOPHE (P0035) - 131 PREDICTIONS: 131 3D...... 195 LOMIZE-ANDREI (P0288) - 76 PREDICTIONS: 76 3D...... 196 HARRISON (P0188) - 43 PREDICTIONS: 43 3D...... 213 LUND-OLE (P0391) - 39 PREDICTIONS: 39 3D...... 196 HEAD-GORDON (P0271) - 93 PREDICTIONS: 93 3D...... 213 LEVITT (P0016) - 350 PREDICTIONS: 350 3D...... 196 HOGUE-SLRI (P0267) - 254 PREDICTIONS: 254 3D...... 214 LEVITT (P0016) - 350 PREDICTIONS: 350 3D...... 197 OSGDJ (P0292) - 100 PREDICTIONS: 100 3D...... 215 NEXXUS-DELRIO (P0370) - 7 PREDICTIONS: 7 3D...... 198 ORNL-PROSPECT (P0012) - 330 PREDICTIONS: 330 3D...... 198 ORNL-PROSPECT (P0012) - 330 PREDICTIONS: 330 3D...... 199 PROTFINDER (P0282) - 222 PREDICTIONS: 222 3D...... 200 Other Abstracts PUSHCHINO (P0203) - 263 PREDICTIONS: 263 3D...... 201 PUSHCHINO (P0203) - 263 PREDICTIONS: 263 3D...... 202 EVALUATION OF BLIND PREDICTIONS OF PROTEIN-PROTEIN INTERACTIONS PUSHCHINO (P0203) - 263 PREDICTIONS: 263 3D...... 202 MADE IN THE CAPRI EXPERIMENT...... 219 ROKKO (P0327) - 109 PREDICTIONS: 109 3D...... 203 The Pittsburgh Supercomputing Center and Hewlett Packard Support RON-ELBER (P0300) - 259 PREDICTIONS: 259 3D...... 204 of CASP5...... 220 SAM-T02-SERVER (P0189) - 221 PREDICTIONS: 221 3D...... 204 SAMUDRALA-NEWFOLD (P0051) - 410 PREDICTIONS: 410 3D.....204 SAMUDRALA-NEWFOLD (P0051) - 410 PREDICTIONS: 410 3D.....205 SAMUDRALA-NEWFOLD (P0051) - 410 PREDICTIONS: 410 3D.....206 SBC (P0084) - 94 PREDICTIONS: 94 3D...... 206 SOLOVYEV-SOFTBERRY (P0270) - 242 PREDICTIONS: 177 3D, 65 SS206 SUPERFAMILY (P0065) - 925 PREDICTIONS: 925 3D...... 207 THW-FR (P0377) - 241 PREDICTIONS: 241 3D...... 207 TSAI (P0061) - 105 PREDICTIONS: 105 3D...... 208 ZHOU-HX (P0056) - 134 PREDICTIONS: 69 3D, 65 SS...... 209

A-280