Development of Novel Strategies for Template-Based Protein Structure Prediction

Imperial College London Department of Life Sciences PhD Thesis Development of novel strategies for template-based protein structure prediction Stefans Mezulis supervised by Prof. Michael Sternberg Prof. William Knottenbelt September 2016 Abstract The most successful methods for predicting the structure of a protein from its sequence rely on identifying homologous sequences with a known structure and building a model from these structures. A key component of these homology modelling pipelines is a model combination method, responsible for combining homologous structures into a coherent whole. Presented in this thesis is poing2, a model combination method using physics-, knowledge- and template-based constraints to assemble proteins using information from known structures. By combining intrinsic bond length, angle and torsional constraints with long- and short-range information ex- tracted from template structures, poing2 assembles simplified protein models using molecular dynamics algorithms. Compared to the widely-used model combination tool MODELLER, poing2 is able to assemble models of ap- proximately equal quality. When supplied only with poor quality templates or templates that do not cover the majority of the query sequence, poing2 significantly outperforms MODELLER. Additionally presented in this work is PhyreStorm, a tool for quickly and accurately aligning the three-dimensional structure of a query protein with the Protein Data Bank (PDB). The PhyreStorm web server provides comprehensive, current and rapid structural comparisons to the protein data bank, providing researchers with another tool from which a range of biological insights may be drawn. By partitioning the PDB into clusters of similar structures and performing an initial alignment to the representatives of each cluster, PhyreStorm is able to quickly determine which structures should be excluded from the alignment. For a benchmarking set of 100 proteins of diverse structure, PhyreStorm is capable of finding over 90 % of all high- scoring structures in the PDB, and over 80 % of all structures of moderate alignment score. Declaration of originality This thesis, and the work described within this thesis, is entirely my own work unless explicitly specified otherwise. Copyright The copyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No-Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial pur- poses and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work Acknowledgements I would like to thank the whole of the Sternberg group for their support throughout this PhD. In particular, I would like to thank Mike for pushing me to actually get things finished, Ioannis and Lawrence for the endless ideas (and all the beer), and Suhail for keeping all the software running despite my best efforts. This work was supported by the EPSRC (Grant EP/K502856/1). 1 Contents 1 Introduction 7 1.1 Motivation and Objectives .................... 11 1.2 Homology modelling ....................... 12 1.2.1 Remote homology .................... 13 1.2.2 Hidden Markov models .................. 15 1.3 MODELLER ........................... 20 1.3.1 Statistically-determined features ............ 21 1.3.2 Stereochemical features ................. 24 1.3.3 Building basis pdfs .................... 26 1.3.4 Building the molecular pdf ................ 28 1.3.5 Optimisation of the molecular pdf ............ 30 1.3.6 Shortcomings and extensions .............. 31 1.4 I-TASSER ............................. 35 1.4.1 I-TASSER ......................... 36 1.4.2 Zhang-server ....................... 54 1.5 Robetta .............................. 56 1.5.1 Domain parsing ...................... 57 1.5.2 Comparative modelling pipeline ............. 59 1.5.3 De novo modelling pipeline ............... 67 1.5.4 Model selection ...................... 70 1.5.5 Domain assembly ..................... 71 1.6 Phyre2 ............................... 72 1.7 Structural alignment ....................... 74 1.7.1 Root-mean-square deviation ............... 75 1.7.2 Global Distance Test ................... 76 1.7.3 TM-score ......................... 77 2 2 PhyreStorm 82 2.1 One-vs-many structural alignments ............... 84 2.1.1 Dali ............................ 87 2.1.2 VAST ........................... 88 2.1.3 FATCAT ......................... 89 2.1.4 SSM ............................ 89 2.2 Materials and Methods ...................... 90 2.2.1 The PhyreStorm database ................ 90 2.2.2 Searching the database .................. 96 2.2.3 Infrastructure ....................... 97 2.3 Interface .............................. 100 2.4 Results ............................... 105 2.4.1 Coverage ......................... 105 2.4.2 Speed ........................... 111 2.4.3 Precision and coverage of SCOP ............ 114 2.5 Conclusion ............................. 125 3 Poing2 126 3.1 Methods .............................. 130 3.1.1 Protein model ....................... 130 3.1.2 Model synthesis ...................... 131 3.1.3 Query-based restraints .................. 133 3.1.4 Template-based constraints ............... 143 3.1.5 Drag and solvent bombardment ............. 147 3.1.6 Explicit secondary structure ............... 148 3.2 Results ............................... 148 3.2.1 Training .......................... 152 3.2.2 Target T0600-D2 ..................... 161 3.2.3 Testing .......................... 167 3.2.4 Speed ........................... 181 3.2.5 Side-chain rotamers ................... 183 3.3 Conclusion ............................. 187 4 Conclusions and future work 188 A PhyreStorm benchmarking set 195 B Poing2 parameters 197 3 List of Figures 1.1 Sequence identity vs TM-score for CASP11 targets ...... 14 1.2 Alignment of a sequence to an HMM .............. 17 2.1 Nearest-neighbour tree ...................... 93 2.2 PhyreStorm infrastructure .................... 98 2.3 PhyreStorm submission page ................... 100 2.4 PhyreStorm chain picker ..................... 102 2.5 PhyreStorm results page. ..................... 103 2.6 Coverage of PhyreStorm ..................... 109 2.7 Top-scoring coverage of PhyreStorm ............... 110 2.8 PhyreStorm search durations .................. 113 2.9 Precision-coverage AUC for Dali and PhyreStorm ....... 117 2.10 Selected precision-coverage curves ................ 119 2.11 Comparison of AUCs for the same number of hits ....... 120 2.12 Structures found for target d3kyha_ .............. 122 2.13 Coverage of PhyreStorm and Dali ................ 124 3.1 Atom synthesis positions ..................... 133 3.2 Spring constraint handedness .................. 146 3.3 Significance testing for training runs .............. 156 3.4 Change in TM-score for every model compared with TM-score of best template .......................... 158 3.5 Change in TM-score for top-ranked model compared with TM-score of best template .................... 159 3.6 Target T0600-D2 with templates and models .......... 162 3.7 Cumulative error in ϕ and for T0600-D2 ........... 165 3.8 Segments of worst model aligned to target T0600-D2 ..... 166 3.9 Improvement in poing2 models over MODELLER against template quality ............................ 169 4 3.10 Improvement in model quality over templates for poing2 and MODELLER ........................... 170 3.11 Comparison between improvements by poing2 and MODELLER171 3.12 Models produced for target T0680-D2 .............. 173 3.13 Alignments of T0663-D2 with templates and models ...... 175 3.14 Hydrogen bond precision and recall ............... 177 3.15 Target T0685-D2 structure and models ............. 179 3.16 T0685-D2 β-sheet ......................... 180 3.17 Distribution of MODELLER and poing2 run times ...... 182 3.18 Comparison of rotamer accuracies ................ 186 5 List of abbreviations AUC Area under curve BLAST Basic Local Alignment Search Tool CASP Critical Assessment of protein Structure Prediction HMM Hidden Markov Model MSA Multiple Sequence Alignment PDB Protein Data Bank PDF or pdf Probability Density Function PSI-BLAST Position-Specific Iterated BLAST R3 Residue-rotamer-reduction RCSB Research Collaboratory for Structural Bioinformatics RMSD or rmsd Root Mean Square Distance SCOP Structural Classification of Proteins SCOPe Structural Classification of Proteins—extended TM-score Template Modelling score 6 Chapter 1 Introduction The function of a protein is determined by its tertiary structure, which is in turn determined by its primary structure, a sequence of amino acids. The process of experimentally determining the structure of a protein is difficult, time-consuming and costly, if it is even possible for a particular structure. As such, the number of publicly-available protein sequences now exceeds the number of determined protein structures by more than two orders of magnitude1. It is clear, then, that the development of protein structure prediction methods that are capable of producing models of sufficient quality to glean functional insight is of paramount importance in advancing our understanding of the biological processes governing every living moment. Examples for the uses of protein structure can be found in the field of drug 1Compare, for example, the number of sequences in the UniProtKB/TrEMBL database (http://www.ebi.ac.uk/uniprot/TrEMBLstats)

Development of Novel Strategies for Template-Based Protein Structure Prediction

Predicting and Characterising Protein-Protein Complexes

Functional Effects Detailed Research Plan

Centre for Bioinformatics Imperial College London

Centre for Bioinformatics Imperial College London Second Report 31

Spinout Equinox Pharma Speeds up and Reduces the Cost of Drug Discovery

The Phyre2 Web Portal for Protein Modeling, Prediction and Analysis

Statistical Inference for Template-Based Protein Structure

Deep Learning-Based Advances in Protein Structure Prediction

Raptorx: Exploiting Structure Information for Protein Alignment by Statistical Inference Jian Peng and Jinbo Xu*

Distance-Based Protein Folding Powered by Deep Learning Jinbo Xu Toyota Technological Institute at Chicago 6045 S Kenwood, IL, 60637, USA [email protected]

Virtual Screening of Human Class-A Gpcrs Using Ligand Profiles Built on Multiple Ligand-Receptor Interactions

Template-Based Protein Modeling Using the Raptorx Web Server