New Methods for the Prediction and Classification of Protein Domains

New Methods for the Prediction and Classi¯cation of Protein Domains Jan Erik Gewehr MÄunchen2007 New Methods for the Prediction and Classi¯cation of Protein Domains Jan Erik Gewehr Dissertation an der FakultÄatfÄurMathematik und Informatik der Ludwig{Maximilians{UniversitÄat MÄunchen vorgelegt von Jan Erik Gewehr aus LÄubeck MÄunchen, den 01.10.2007 Erstgutachter: Prof. Dr. Ralf Zimmer Zweitgutachter: Prof. Dr. Dmitrij Frishman Tag der mÄundlichen PrÄufung:19.12.2007 Contents 1 Motivation and Overview 1 1.1 The Bene¯t of Protein Structure Prediction ................. 1 1.2 Thesis Outline .................................. 3 2 Introduction to Protein Structure Prediction 7 2.1 Protein Structures and Related Databases .................. 7 2.1.1 From Primary to Tertiary Structure .................. 7 2.1.2 Structure-Related Databases ...................... 8 2.2 Assignment of Protein Domains ........................ 10 2.2.1 An Old Problem ............................ 10 2.2.2 Comparison of Domain Assignments . 11 2.3 Structure Prediction Categories ........................ 13 2.3.1 Comparative Modeling ......................... 13 2.3.2 Fold Recognition ............................ 14 2.3.3 Ab Initio ................................. 15 2.4 Community-Wide E®orts ............................ 15 2.4.1 Community-Wide Experiments .................... 15 2.4.2 Structural Genomics and Structure Prediction . 16 2.5 Alignment Methods ............................... 16 2.5.1 Sequence Alignment .......................... 16 2.5.2 Alignment Methods used in this Work . 19 3 Selection of Fold Classes based on Secondary Structure Elements 23 3.1 Introduction ................................... 23 3.2 Material ..................................... 25 3.2.1 Training and Test Data ......................... 25 3.2.2 Quoted Methods ............................ 27 3.3 Preselection of Fold Classes .......................... 28 3.3.1 Secondary Structure Element Alignment (SSEA) . 28 3.3.2 Selection Strategies based on SSEA . 29 3.4 Re¯nement with Pro¯le-Pro¯le Alignment . 31 3.5 Results ...................................... 31 3.5.1 Preselection Performance on ASTRAL25 . 32 vi CONTENTS 3.5.2 Fold Recognition Accuracy ....................... 33 3.5.3 Speed-Up Evaluation .......................... 34 3.6 Discussion .................................... 34 4 AutoSCOP: Unique Mapping of Patterns to SCOP 37 4.1 Introduction ................................... 38 4.2 Material ..................................... 40 4.2.1 InterPro and its Member Databases . 40 4.2.2 ASTRAL Asteroids and Family HMMs . 41 4.2.3 Training Data .............................. 41 4.2.4 Test Data ................................ 41 4.3 The AutoSCOP Approach ........................... 42 4.3.1 Motivation ................................ 42 4.3.2 Unique Patterns ............................. 42 4.3.3 Extension: Pattern Combinations ................... 44 4.3.4 AutoSCOP¤: Inclusion of Further Data Sources . 45 4.4 Results ...................................... 45 4.4.1 Mapping of Training Domains ..................... 45 4.4.2 Prediction of SCOP 1.67 Domains ................... 47 4.4.3 Comparison of InterPro Entries and AutoSCOP Mappings . 51 4.4.4 Fold Prediction of CAFASP4 Targets . 52 4.4.5 Performance in the Sequence Twilight Zone . 54 4.4.6 Using AutoSCOP¤ as a Filter ..................... 54 4.5 AutoPSI DB ................................... 55 4.5.1 AutoPSI Database Content and Methods . 55 4.5.2 Towards Large-Scale Protein Domain Prediction . 59 4.6 Conclusion .................................... 61 5 SSEP-Domain: Template-Based Protein Domain Prediction 65 5.1 Introduction ................................... 66 5.2 The Domain Prediction Pipeline ........................ 67 5.2.1 Preliminaries .............................. 69 5.2.2 Step 1: Finding Potential Domain Boundaries . 69 5.2.3 Step 2: Scoring of Domain Regions . 72 5.2.4 Step 3: Combining Multiple Domain Regions . 75 5.3 Results ...................................... 76 5.3.1 CAFASP 4 and CASP 6 Results .................... 76 5.3.2 Current Version under CAFASP Conditions . 77 5.3.3 Evaluation of InterPro and Combination with AutoSCOP . 86 5.3.4 SSEP-Align: An Extension towards Structure Prediction . 87 5.3.5 Other Possible Extensions ....................... 89 5.4 Two Years After: CASP 7 and its Lessons . 90 5.4.1 Analysis of CASP 7 Results ...................... 90 CONTENTS vii 5.4.2 Using Alternative De¯nitions for SCOP Domains . 91 5.4.3 Discontinuous Domains ......................... 94 5.5 Independent Applications and Evaluations . 95 5.6 Discussion .................................... 96 6 Environment-Speci¯c Alignment Computation and Scoring 99 6.1 The QUASAR Framework . 100 6.1.1 Methods ................................. 100 6.1.2 Use Cases ................................ 102 6.2 Optimized Score Combinations . 103 6.2.1 Preliminaries .............................. 103 6.2.2 Generation of Linear Combinations . 105 6.2.3 Results .................................. 106 6.3 Optimized Matrices for Alignment Ranking . 108 6.3.1 Comparison Matrices . 108 6.3.2 Range-Adaptive Genetic Algorithm . 109 6.3.3 Results .................................. 111 6.4 Optimized Pro¯le-Pro¯le Alignments . 112 6.4.1 Training and Test Data . 113 6.4.2 Modi¯ed Genetic Algorithm . 113 6.4.3 Results .................................. 114 6.5 Discussion .................................... 115 7 Additional Tools 117 7.1 Vorolign: Structural Alignment and SCOP Classi¯cation Prediction . 117 7.2 Representation of Protein Information in ProML . 119 7.3 BioWeka: Extending the Weka Framework for Bioinformatics . 121 7.3.1 Motivation ................................ 121 7.3.2 The Weka Framework . 122 7.3.3 The BioWeka Library . 124 7.3.4 Example Applications . 126 7.3.5 Discussion ................................ 128 8 Concluding Remarks 129 Acknowledgements 146 viii CONTENTS List of Figures 1.1 Introduction: Thesis Overview ......................... 6 2.1 Background: Homology-based Protein Structure Prediction . 14 3.1 Preselection: Overview ............................. 26 3.2 Preselection: Preselection Performance .................... 30 3.3 Preselection: Fold Recognition Accuracy on Three Evaluation Sets. 35 4.1 AutoSCOP: Pattern-Class Graph. ....................... 43 4.2 AutoSCOP: Pattern Combinations. ...................... 43 4.3 AutoSCOP: Screenshot of the AutoPSI Database. 56 4.4 AutoSCOP: Annotation Process for the AutoPSI Database. 57 4.5 AutoSCOP Example: 1a0p. .......................... 60 4.6 AutoSCOP Example: 1jwlc. .......................... 60 4.7 AutoSCOP Example: 1a79a. .......................... 60 5.1 SSEP-Domain: Overview ............................ 68 5.2 SSEP-Domain: Domain Length Histogram . 70 5.3 SSEP-Domain: Alignment Score Histogram . 74 5.4 SSEP-Domain: Overlap Score Example .................... 83 5.5 SSEP-Domain: Results Plot .......................... 85 6.1 QUASAR: Overview .............................. 101 6.2 Comparison of GA Matrices with Well-Known Matrices. 112 7.1 Vorolign: Similarity Computation . 118 7.2 BioWeka: Overview ............................... 124 x List of Figures List of Tables 4.1 AutoSCOP: Quantitative Analysis of Unique InterPro Patterns. 45 4.2 AutoSCOP: Coverage Analysis on Training Data. 46 4.3 AutoSCOP: Inﬂuence of InterPro databases. 46 4.4 AutoSCOP: Sensitivity and Speci¯city .................... 49 4.5 AutoSCOP: Reduced Training Data ...................... 51 4.6 AutoSCOP: Non-Trivial Targets ........................ 53 5.1 SSEP-Domain: Correctly Predicted Targets . 80 5.2 SSEP-Domain: Speci¯city ........................... 81 5.3 SSEP-Domain: Average Overlap Score .................... 84 5.4 SSEP-Domain: SSEP-Align Results ...................... 88 5.5 SSEP-Domain: Comparison with SSEP-Domain* on CAFASP 4 and CASP 7 94 5.6 SSEP-Domain: Algorithmic Ingredients .................... 96 6.1 Evaluated Scoring Schemes and Corresponding Parameter Settings . 107 6.2 Results of Optimized PPA Parameters . 114 xii List of Tables Summary Proteins play a central role in organisms as they perform many important tasks in their cells. Accordingly, the better we understand how proteins are built, the better we can deal with many common diseases. In particular, information on structural properties of proteins can give insight into the way they work and how mutations, for instance, may a®ect their operability. Such knowledge can therefore help and inﬂuence modern medicine and drug development. This work is situated in the ¯eld of protein structure prediction. Here, the aim is to determine a three-dimensional structure from a protein's amino acid sequence. Depending on their quality, the resulting structure models can be used for a variety of purposes. At the moment there are millions of known protein sequences but only tens of thousands of known protein structures available. As it cannot be expected that it will be possible to experimentally determine a structure for each protein in the near future, protein structure prediction is an important task of current bioinformatics research. In particular, so-called template-based modeling can result in very good predicted structures. Here, given a target sequence with unknown structure, a simple modeling approach (1) searches for similar sequences with known structures (so-called templates), (2) com- putes mappings from the target sequence to the template sequences (so-called alignments), (3) takes the atom coordinates from the mapped parts

New Methods for the Prediction and Classification of Protein Domains

A Stochastic Point Cloud Sampling Method for Multi-Template Protein Comparative Modeling

Emodel-BDB: a Database of Comparative Structure Models Of

(12) United States Patent (10) Patent No.: US 8,715,988 B2 Arnold Et Al

From Directed Evolution to Computational Enzyme Engineering—A Review

A Domain Based Protein Structural Modelling Platform Applied in the Analysis of Alternative

An Ultra-Fast Approach for Large-Scale Protein Structure Similarity Searching Lei Deng1, Guolun Zhong1, Chenzhe Liu1, Judong Luo2* and Hui Liu3*

Probabilistic Protein Homology Modeling

Engineering Enzyme Systems by Recombination

On the Engineering of Proteins: Methods and Applications for Carbohydrate-Active Enzymes

An XML Based Semantic Protein Map

STRUCTURAL MODELING of PROTEIN-PROTEIN INTERACTIONS USING MULTIPLE-CHAIN THREADING and FRAGMENT ASSEMBLY by SRAYANTA MUKHERJEE

How Significant Is a Protein Structure Similarity with TM-Score = 0.5?