Exploring the Function and Evolution of Proteins Using Domain Families

Exploring the Function and Evolution of Proteins Using Domain Families Adam James Reid Research Department of Structural and Molecular Biology University College London A thesis submitted to University College London for the degree of Doctor of Philosophy January 2009 1 Declaration I, Adam James Reid, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. Adam James Reid January 2009 2 Abstract Proteins are frequently composed of multiple domains which fold independently. These are often evolutionarily distinct units which can be adapted and reused in other proteins. The classification of protein domains into evolutionary families facilitates the study of their evolution and function. In this thesis such classifications are used firstly to examine methods for identifying evolutionary relationships (homology) between protein domains. Secondly a specific approach for predicting their function is developed. Lastly they are used in studying the evolution of protein complexes. Tools for identifying evolutionary relationships between proteins are central to computational biology. They aid in classifying families of proteins, giving clues about the function of proteins and the study of molecular evolution. The first chapter of this thesis concerns the effectiveness of cutting edge methods in identifying evolutionary relationships between protein domains. The identification of evolutionary relationships between proteins can give clues as to their function. The second chapter of this thesis concerns the development of a method to identify proteins involved in the same biological process. This method is based on the concept of domain fusion whereby pairs of proteins from one organism with a concerted function are sometimes found fused into single proteins in a different organism. Using protein domain classifications it is possible to identify these relationships. Most proteins do not act in isolation but carry out their function by binding to other proteins in complexes; little is understood about the evolution of such complexes. In the third chapter of this thesis the evolution of complexes is examined in two representative model organisms using protein domain families. In this work, protein domain superfamilies allow distantly related parts of complexes to be identified in order to determine how homologous units are reused. 3 This work was generously supported by the Biotechnology and Biological Sciences Research Council. 4 Acknowledgements Firstly, many, many thanks to Professor Christine Orengo for allowing me to join the group, guiding me through my PhD and helping me to produce a good body of work. It has been an honour to work in such a prestigious group. Thanks go to Orengo and Martin group members past and present who have helped me out with problems of all sorts and provided an excellent working environment: Mark Dibley, Oliver Redfern, Timothy Dallman, Ian Sillitoe, Sarah Addou, Michael Maibaum, Russell Marsden, Tony Lewis, Alison Cuff, Andy Clegg, Jon Lees, Benoit Dessailly, David Lee, Stathis Sideris, Stephano Lise, Phil Carter, Robert Rentzsch, James Perkins, Lisa McMillan, Abhinandan Raghavan, Anya Baresic, Jakob Hurst and Ilhem Diboun. Thanks to Michael Wright for his top notch administration. Many thanks to Jesse Oldershaw, Tom Knight, Jahid Ahmed, Donovan Binns and Duncan McKenzie, members of the IT team past and present for keeping our farms and other machines running. Special thanks to Ian Sillitoe for, amongst other things, his dedication to CATH and his awesome script for generating figures based on SSAP alignments. Tim Dallman for inviting me to join his awesome band (Party in Hiroshima!). Oliver Redfern for long, interesting chats about science and helping revise my thesis. Corin Yeats for important early guidance in my work. Many thanks to Juan Antonio Garcia Ranea for a great deal of advice and guidance on analyses, statistics and biology generally. Thanks to David Jones, chair of my thesis committee and Andrew Martin, member of my thesis committee for guidance on my work. 5 Many thanks and kind regards to Alison Cranage for helping me through from start to finish, when it was fun and easy and when it got really hard. Unending thanks to my parents Martyn and Teresa Reid for giving me an amazing start and funding me well into my twenties. To anyone I may have forgotten, I apologise. A final thanks again to the BBSRC for funding this work and continuing to fund basic life sciences research in the UK. 6 List of Abbreviations BDBH Bi-Directional Best Hit BLAST Basic Local Alignment Search Tool BLOSUM BLOcks of Amino Acid SUbstitution Matrix CATH Class Architecture Topology Homology CODA Co-Occurrence of Domains Analysis COGS Cluster of Orthologous Groups COMPASS COmparison of Multiple Protein sequence Alignments with assessment of Statistical Significance CSA Catalytic Site Atlas DAG Directed Acyclic Graph DDI Domain-Domain Interaction DFT Domain Fusion Template DNA DeoxyriboNucleic Acid EC Enzyme Commission EM Expectation Maximisation EPQ Errors Per Query EVD Extreme Value Distribution FDR False Discovery Rate FP False Positive FSSP Fold classification based on Structure-Structure alignment of Proteins GO Gene Ontology GOSS GO Semantic Similarity GTD Genomic Threading Database HMM Hidden Markov Model IPI International Protein Index 7 LAMA Local Alignment of Multiple-Alignments MDA Multi-Domain Architecture N-W Needleman-Wunsch OPHID Online Predicted Human Interactions Database PAM Percent Accepted Mutation PDB Protein DataBank PFP Protein Function Prediction PIN Protein Interaction Network PPI Protein-Protein Interaction PPV Positive Predictive Value PQS Protein Quaternary Structure PRC PRofile Comparer PSI-BLAST Position-Specific Iterated BLAST PSSM Position Specific Scoring Matrix PTM Post-Translational Modification RMSD Root Mean Square Deviation SAM Sequence Alignment and Modelling system SAS Structural Alignment Score SCOP Structural Classification Of Proteins SMART Simple Modular Architecture Research Tool SSAP Sequential Structure Alignment Program STRING Search Tool for the Retrieval of INteracting Genes/proteins SVM Support Vector Machine S-W Smith-Waterman SYSTERS SYSTEmatic Re-Searching TAP-MS Tandem Affinity Purification linked to Mass Spectrometry TIGR The Institute for Genome Research TP True Positive 8 Y2H Yeast Two-Hybrid 9 Table of Contents Abstract...................................................................................................................3 Acknowledgements ...............................................................................................5 List of Abbreviations .............................................................................................7 Table of Contents .................................................................................................10 List of Figures.......................................................................................................17 List of Tables.........................................................................................................24 List of Equations ..................................................................................................27 Chapter 1 Introduction...................................................................................30 1.1. Molecular Biology as an Information Science....................................30 1.2. Proteins..................................................................................................34 1.2.1. Protein Sequence Resources.........................................................39 1.2.2. Protein Structure Resources.........................................................39 1.3. Detecting Evolutionary Relationships between Proteins..................41 1.3.1. Homology, Orthology and Paralogy...........................................41 1.3.2. Homologue Detection...................................................................41 1.3.3. Single Sequence Methods.............................................................44 1.3.4. Profile Methods.............................................................................48 1.3.5. Profile Hidden Markov Models...................................................49 1.3.6. Profile-Profile Methods ................................................................55 1.3.7. Structure-Based Homology Detection ........................................59 1.3.8. Algorithms for Clustering Proteins.............................................63 1.4. Protein Domain Classification Resources...........................................65 1.4.1. Sequence-Based Classifications ...................................................65 1.4.2. Structure-Based Classifications....................................................66 1.4.3. Identifying Domains in Protein Sequences.................................72 1.4.4. Whole-Chain Protein Classifications...........................................73 1.5. Evolution of Protein Domain Families ...............................................75 1.6. Protein Function Classifications..........................................................77 1.6.1. Gene Ontology ..............................................................................77 1.6.2. FunCat............................................................................................80 1.6.3. Enzyme Commission....................................................................80 1.6.4.

Exploring the Function and Evolution of Proteins Using Domain Families

Improving the Prediction of Transcription Factor Binding Sites To

Applied Category Theory for Genomics – an Initiative

A Community Proposal to Integrate Structural

SD Gross BFI0403

Functional Effects Detailed Research Plan

Bioinformatics Approaches to Identify Pain Mediators, Novel Lncrnas and Distinct Modalities of Neuropathic Pain

Prolango: Protein Function Prediction Using Neural~ Machine

100000 Protein Structures for the Biologist

Gaëlle GARET Classification Et Caractérisation De Familles Enzy

Glycoprotein Hormone Receptors in Sea Lamprey

Proteome-Scale Prediction of Structure and Func- Tion

The History of the CATH Structural Classification of Protein Domains