New Methods for the Prediction and Classification of Protein Domains
Total Page:16
File Type:pdf, Size:1020Kb
New Methods for the Prediction and Classi¯cation of Protein Domains Jan Erik Gewehr MÄunchen2007 New Methods for the Prediction and Classi¯cation of Protein Domains Jan Erik Gewehr Dissertation an der FakultÄatfÄurMathematik und Informatik der Ludwig{Maximilians{UniversitÄat MÄunchen vorgelegt von Jan Erik Gewehr aus LÄubeck MÄunchen, den 01.10.2007 Erstgutachter: Prof. Dr. Ralf Zimmer Zweitgutachter: Prof. Dr. Dmitrij Frishman Tag der mÄundlichen PrÄufung:19.12.2007 Contents 1 Motivation and Overview 1 1.1 The Bene¯t of Protein Structure Prediction ................. 1 1.2 Thesis Outline .................................. 3 2 Introduction to Protein Structure Prediction 7 2.1 Protein Structures and Related Databases .................. 7 2.1.1 From Primary to Tertiary Structure .................. 7 2.1.2 Structure-Related Databases ...................... 8 2.2 Assignment of Protein Domains ........................ 10 2.2.1 An Old Problem ............................ 10 2.2.2 Comparison of Domain Assignments . 11 2.3 Structure Prediction Categories ........................ 13 2.3.1 Comparative Modeling ......................... 13 2.3.2 Fold Recognition ............................ 14 2.3.3 Ab Initio ................................. 15 2.4 Community-Wide E®orts ............................ 15 2.4.1 Community-Wide Experiments .................... 15 2.4.2 Structural Genomics and Structure Prediction . 16 2.5 Alignment Methods ............................... 16 2.5.1 Sequence Alignment .......................... 16 2.5.2 Alignment Methods used in this Work . 19 3 Selection of Fold Classes based on Secondary Structure Elements 23 3.1 Introduction ................................... 23 3.2 Material ..................................... 25 3.2.1 Training and Test Data ......................... 25 3.2.2 Quoted Methods ............................ 27 3.3 Preselection of Fold Classes .......................... 28 3.3.1 Secondary Structure Element Alignment (SSEA) . 28 3.3.2 Selection Strategies based on SSEA . 29 3.4 Re¯nement with Pro¯le-Pro¯le Alignment . 31 3.5 Results ...................................... 31 3.5.1 Preselection Performance on ASTRAL25 . 32 vi CONTENTS 3.5.2 Fold Recognition Accuracy ....................... 33 3.5.3 Speed-Up Evaluation .......................... 34 3.6 Discussion .................................... 34 4 AutoSCOP: Unique Mapping of Patterns to SCOP 37 4.1 Introduction ................................... 38 4.2 Material ..................................... 40 4.2.1 InterPro and its Member Databases . 40 4.2.2 ASTRAL Asteroids and Family HMMs . 41 4.2.3 Training Data .............................. 41 4.2.4 Test Data ................................ 41 4.3 The AutoSCOP Approach ........................... 42 4.3.1 Motivation ................................ 42 4.3.2 Unique Patterns ............................. 42 4.3.3 Extension: Pattern Combinations ................... 44 4.3.4 AutoSCOP¤: Inclusion of Further Data Sources . 45 4.4 Results ...................................... 45 4.4.1 Mapping of Training Domains ..................... 45 4.4.2 Prediction of SCOP 1.67 Domains ................... 47 4.4.3 Comparison of InterPro Entries and AutoSCOP Mappings . 51 4.4.4 Fold Prediction of CAFASP4 Targets . 52 4.4.5 Performance in the Sequence Twilight Zone . 54 4.4.6 Using AutoSCOP¤ as a Filter ..................... 54 4.5 AutoPSI DB ................................... 55 4.5.1 AutoPSI Database Content and Methods . 55 4.5.2 Towards Large-Scale Protein Domain Prediction . 59 4.6 Conclusion .................................... 61 5 SSEP-Domain: Template-Based Protein Domain Prediction 65 5.1 Introduction ................................... 66 5.2 The Domain Prediction Pipeline ........................ 67 5.2.1 Preliminaries .............................. 69 5.2.2 Step 1: Finding Potential Domain Boundaries . 69 5.2.3 Step 2: Scoring of Domain Regions . 72 5.2.4 Step 3: Combining Multiple Domain Regions . 75 5.3 Results ...................................... 76 5.3.1 CAFASP 4 and CASP 6 Results .................... 76 5.3.2 Current Version under CAFASP Conditions . 77 5.3.3 Evaluation of InterPro and Combination with AutoSCOP . 86 5.3.4 SSEP-Align: An Extension towards Structure Prediction . 87 5.3.5 Other Possible Extensions ....................... 89 5.4 Two Years After: CASP 7 and its Lessons . 90 5.4.1 Analysis of CASP 7 Results ...................... 90 CONTENTS vii 5.4.2 Using Alternative De¯nitions for SCOP Domains . 91 5.4.3 Discontinuous Domains ......................... 94 5.5 Independent Applications and Evaluations . 95 5.6 Discussion .................................... 96 6 Environment-Speci¯c Alignment Computation and Scoring 99 6.1 The QUASAR Framework . 100 6.1.1 Methods ................................. 100 6.1.2 Use Cases ................................ 102 6.2 Optimized Score Combinations . 103 6.2.1 Preliminaries .............................. 103 6.2.2 Generation of Linear Combinations . 105 6.2.3 Results .................................. 106 6.3 Optimized Matrices for Alignment Ranking . 108 6.3.1 Comparison Matrices . 108 6.3.2 Range-Adaptive Genetic Algorithm . 109 6.3.3 Results .................................. 111 6.4 Optimized Pro¯le-Pro¯le Alignments . 112 6.4.1 Training and Test Data . 113 6.4.2 Modi¯ed Genetic Algorithm . 113 6.4.3 Results .................................. 114 6.5 Discussion .................................... 115 7 Additional Tools 117 7.1 Vorolign: Structural Alignment and SCOP Classi¯cation Prediction . 117 7.2 Representation of Protein Information in ProML . 119 7.3 BioWeka: Extending the Weka Framework for Bioinformatics . 121 7.3.1 Motivation ................................ 121 7.3.2 The Weka Framework . 122 7.3.3 The BioWeka Library . 124 7.3.4 Example Applications . 126 7.3.5 Discussion ................................ 128 8 Concluding Remarks 129 Acknowledgements 146 viii CONTENTS List of Figures 1.1 Introduction: Thesis Overview ......................... 6 2.1 Background: Homology-based Protein Structure Prediction . 14 3.1 Preselection: Overview ............................. 26 3.2 Preselection: Preselection Performance .................... 30 3.3 Preselection: Fold Recognition Accuracy on Three Evaluation Sets. 35 4.1 AutoSCOP: Pattern-Class Graph. ....................... 43 4.2 AutoSCOP: Pattern Combinations. ...................... 43 4.3 AutoSCOP: Screenshot of the AutoPSI Database. 56 4.4 AutoSCOP: Annotation Process for the AutoPSI Database. 57 4.5 AutoSCOP Example: 1a0p. .......................... 60 4.6 AutoSCOP Example: 1jwlc. .......................... 60 4.7 AutoSCOP Example: 1a79a. .......................... 60 5.1 SSEP-Domain: Overview ............................ 68 5.2 SSEP-Domain: Domain Length Histogram . 70 5.3 SSEP-Domain: Alignment Score Histogram . 74 5.4 SSEP-Domain: Overlap Score Example .................... 83 5.5 SSEP-Domain: Results Plot .......................... 85 6.1 QUASAR: Overview .............................. 101 6.2 Comparison of GA Matrices with Well-Known Matrices. 112 7.1 Vorolign: Similarity Computation . 118 7.2 BioWeka: Overview ............................... 124 x List of Figures List of Tables 4.1 AutoSCOP: Quantitative Analysis of Unique InterPro Patterns. 45 4.2 AutoSCOP: Coverage Analysis on Training Data. 46 4.3 AutoSCOP: Influence of InterPro databases. 46 4.4 AutoSCOP: Sensitivity and Speci¯city .................... 49 4.5 AutoSCOP: Reduced Training Data ...................... 51 4.6 AutoSCOP: Non-Trivial Targets ........................ 53 5.1 SSEP-Domain: Correctly Predicted Targets . 80 5.2 SSEP-Domain: Speci¯city ........................... 81 5.3 SSEP-Domain: Average Overlap Score .................... 84 5.4 SSEP-Domain: SSEP-Align Results ...................... 88 5.5 SSEP-Domain: Comparison with SSEP-Domain* on CAFASP 4 and CASP 7 94 5.6 SSEP-Domain: Algorithmic Ingredients .................... 96 6.1 Evaluated Scoring Schemes and Corresponding Parameter Settings . 107 6.2 Results of Optimized PPA Parameters . 114 xii List of Tables Summary Proteins play a central role in organisms as they perform many important tasks in their cells. Accordingly, the better we understand how proteins are built, the better we can deal with many common diseases. In particular, information on structural properties of proteins can give insight into the way they work and how mutations, for instance, may a®ect their operability. Such knowledge can therefore help and influence modern medicine and drug development. This work is situated in the ¯eld of protein structure prediction. Here, the aim is to determine a three-dimensional structure from a protein's amino acid sequence. Depending on their quality, the resulting structure models can be used for a variety of purposes. At the moment there are millions of known protein sequences but only tens of thousands of known protein structures available. As it cannot be expected that it will be possible to experimentally determine a structure for each protein in the near future, protein structure prediction is an important task of current bioinformatics research. In particular, so-called template-based modeling can result in very good predicted struc- tures. Here, given a target sequence with unknown structure, a simple modeling approach (1) searches for similar sequences with known structures (so-called templates), (2) com- putes mappings from the target sequence to the template sequences (so-called alignments), (3) takes the atom coordinates from the mapped parts