Algorithms for Protein Comparative Modelling and Some Evolutionary Implications

Algorithms for Protein Comparative Modelling and Some Evolutionary Implications B runo C o n t r e r a s -M oreira 2003 Biomolecular Modelling Laboratory Cancer Research UK, London Research Institute 44 Lincoln’s Inn Fields, London, WC2A 3PX and Department of Biochemistry and Molecular Biology University College London Gower Street, London, WC2E 6BT Submitted in part fulfilment of the requirements for the degree of Doctor of Philosophy in Biochemistry of the University of London. UMI Number: U602512 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. Dissertation Publishing UMI U602512 Published by ProQuest LLC 2014. Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code. ProQuest LLC 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 48106-1346 A mi familia, incluida Pili 3 Abstract Protein comparative modelling (CM) is a predictive technique to build an atomic model for a polypeptide chain, based on the experimentally determined structures of related pro teins (templates). It is widely used in Structural Biology, with applications ranging from mutation analysis, protein and drug design to function prediction and analysis, particu larly when there are no experimental structures of the protein of interest. Therefore, CM is an important tool to process the amount of data generated by genomic projects. Several problems affect the performance of CM and therefore solutions for them are needed to increase its applicability. In this work different algorithms and approaches were tested with this aim, particularly to help in template selection and alignment, and some useful insights were obtained. First, this work describes the development of DomainFishing, a tool to split protein sequences into functionally and structurally defined domains and to align each of them to the available templates. The performance of our approach is benchmarked and some problems and possible developments are identified. When comparing different alignment procedures none of them is found to be consistently superior, suggesting that a combina tion of several could be an advantage. Driven by these ideas and the fact that selecting templates can be a difficult problem, a new modelling approach is designed and tested. This algorithm uses crossover, mutation and selection within populations of protein mod els generated from different templates and alignments to obtain recombinant structures optimised in terms of fitness. Despite our simple definition of fitness, the procedure is shown to be robust to some alignment errors while simplifying the task of selecting templates, making it a good candidate for automatic building of reliable protein models. In-house benchmarks of the method show its strengths and limitations. The method was also benchmarked during the fifth Critical Assessment of techniques for protein Struc ture Prediction (CASP5), in which its perfomance was encouraging both for comparative modelling and fold recognition targets, among the top 20 predictors. Finally, we present some data to support a possible evolutionary feedback mechanism between protein struc ture and gene structure, using human and murine genomic data, structural data from the Protein Data Bank and the protein recombination methodology. This may have some implications for understanding protein evolution and protein design, which are discussed. 4 Acknowledgements Replying by e-mail to an advertisement that I found in a magazine almost four years ago had unexpected consequences: I ended up in London. Just like in Rayuela (a book by Julio Cortazar) now I must acknowledge people from both this and the other side of the British Channel. From this side (el lado de aca), I am specially grateful to Paul, my supervisor, for all our discussions and arguments. Thanks for your patience and guidance. From the Biomolecular Modelling Lab (BMM) I must also thank my friend Paul Fitzjohn for his generous help and for teaching me so many things. Graham also helped me a lot, partic ularly with Maths, Molecular Dynamics, English and even Spanish. Thanks for all that and for your support. I should also acknowledge the latest signings, Chris, Pall and Jose for our joint projects, their advice and for sharing their readPDB routines. From the old Mike Sternberg’s BMM, I am indebted to Arne for many things, including introducing me to MlgX and Bagpuss and also for TeXMed. I am sure I still owe you and Aengus a couple of pints. This is probably true for Adrian, Lawrence and Ben as well. I should also put here Raphael and Dave for everything, including reading drafts of this thesis and for being my loyal friends outside the lab (pubs are among the best places to talk about anything). Of course I should thank my overseas family (Marfa, Afroditula, Joana, Nick, Themis and Oscar) just for being there every day and for the Spanish, Greek, Portuguese and Lancashire food. Sofiki should also be here, but she preferred to live at Baron’s Court. I would also like to thank Cancer Research UK (and its volunteers) for the funding, the good working environment and the great time I had here. Neil McDonald and Nancy Hogg were especially kind to me. Thanks for your time and support. From the other side (el lado de alia), I must remember first my family. Living with them has so far been the best school. Alex, Edu, Elena, Javivi, Jevi and Arancha have always been my friends despite the distance. Agus put me on this crazy track and still inspires me. On a more scientific side of things, I thank here Antonio Garcfa-Bellido and Alfonso Valencia (and their groups, including Cassandrex, Damien and ‘cubanito’) for helping me reaching this page. My colleagues from www.precarios.org deserve my recognition for all their work to make life easier for Spanish scientists. Filially, Pili deserves everything for being at both sides and never surrendering. List of abbreviations BLAST Basic Local Alignment Search Tool BLOCKS ’alignment Blocks’ (no abbreviation) BLOSUM Blocks Substitution Matrix CASP Critical Assessment of techniques for protein Structure Prediction CM Comparative Modelling DNA Deoxyribonucleic Acid FASTA Fast Alignment Search Tool FR Fold Recognition HMM Hidden Markov Model HSP High-scoring Segment Pair HTML Hypertext Markup Language IEB Intron-Exon Boundary IMPALA Integrating Matrix Profiles And Local Alignments NCBI National Center for Biotechnology Information NMR Nuclear Magnetic Resonnance nr Non-Redundant Protein Database compiled by the NCBI mRNA messenger Ribonucleic Acid ORF Open Reading Frame PAM Point Accepted Mutation PDB Protein Data Bank PFAM Protein Family database of alignments and HMMs PIR Protein Information Resource ProDom Protein Domain PROSITE not an abbreviation (protein sequence pattern database) PSI-BLAST Position Specific Iterated BLAST PSI-PRED Position Specific Iterated Prediction PSSM Position Specific Scoring Matrix RMSD Root Mean Square Deviation RNA Ribonucleic Acid SCOP Structural Classification Of Proteins URL Unified Resource Locator Table 1: List of abbreviations used throughout this document. CONTENTS 6 Contents Abstract 3 Acknowledgements 4 List of abbreviations 5 1 Introduction 14 1.1 Protein structure and function ..................................................................... 16 1.1.1 Primary structure............................................................................ 16 1.1.2 Secondary structure ......................................................................... 17 1.1.3 Tertiary and quaternary structure .................................................... 18 1.1.4 Fibrous and membrane proteins .................................................... 19 1.1.5 Evolution of proteins: introns and exons ........................................ 19 1.2 Experimental methods for determination of protein structure ................... 21 1.2.1 X-ray crystallography ................................................................... 21 1.2.2 N M R ............................................................................................. 22 1.3 Theoretical methods to model protein structure and dynamics ................... 22 1.3.1 Databases ....................................................................................... 23 1.3.2 Introduction to algorithms ............................................................. 24 1.3.3 Overview of algorithms in protein structure prediction ................ 26 1.3.4 Overview of protein minimisation and dynam ics ......................... 26 1.4 Comparative modelling of proteins (CM) ................................................. 27 1.4.1 Finding the best tem plates ............................................................. 27 1.4.2 Aligning the templates to the q u e ry .............................................. 29 1.4.3 Modelling by satisfaction of spatial restraints ............................... 29 1.4.4 Modelling by fragment building approaches .................................. 30 1.4.5 Optimisation: selection of side-chains and loops ......................... 30 1.4.6 Energy refinement and molecular dynamics .................................. 33 CONTENTS 7 1.4.7 Error analysis .................................................................................

Algorithms for Protein Comparative Modelling and Some Evolutionary Implications

The Second Critical Assessment of Fully Automated Structure

Transcriptome Analyses of Rhesus Monkey Pre-Implantation Embryos Reveal A

Supplemental Table 1. Complete Gene Lists and GO Terms from Figure 3C

Molecular Effects of Isoflavone Supplementation Human Intervention Studies and Quantitative Models for Risk Assessment

CASP)-Round V

Signature Redacted Thesis Supervisor Certified By

A Chromosome Level Genome of Astyanax Mexicanus Surface Fish for Comparing Population

The Genetics of Bipolar Disorder

RNA-Seq-Based Analysis of Cortisol-Induced Differential Gene

Clathrin Light Chain a Drives Selective Myosin VI Recruitment to Clathrin-Coated Pits Under Membrane Tension

Jakub Paś Application and Implementation of Probabilistic

April 11, 2003 1:34 WSPC/185-JBCB 00018 RAPTOR: OPTIMAL PROTEIN THREADING by LINEAR PROGRAMMING