Protocols to Capture the Functional Plasticity of Protein Domain

Total Page:16

File Type:pdf, Size:1020Kb

Protocols to Capture the Functional Plasticity of Protein Domain Research Department of Structural and Molecular Biology PPrroottooccoollss ttoo CCaappttuurree tthhee FFuunnccttiioonnaall PPllaassttiicciittyy ooff PPrrootteeiinn DDoommaaiinn SSuuppeerrffaammiilliieess Robert Rentzsch October 2011 SUBMITTED IN SUPPORT OF THE DEGREE OF Doctor of Philosophy Declaration I, Robert Rentzsch, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. Robert Rentzsch October 2011 2 Abstract Most proteins comprise several domains, segments that are clearly discernable in protein structure and sequence. Over the last two decades, it has become increasingly clear that domains are often also functional modules that can be duplicated and recombined in the course of evolution. This gives rise to novel protein functions. Traditionally, protein domains are grouped into homologous domain superfamilies in resources such as SCOP and CATH. This is done primarily on the basis of similarities in their three-dimensional structures. A biologically sound subdivision of the domain superfamilies into families of sequences with conserved function has so far been missing. Such families form the ideal framework to study the evolutionary and functional plasticity of individual superfamilies. In the few existing resources that aim to classify domain families, a considerable amount of manual curation is involved. Whilst immensely valuable, the latter is inherently slow and expensive. It can thus impede large-scale application. This work describes the development and application of a fully-automatic pipeline for identifying functional families within superfamilies of protein domains. This pipeline is built around a method for clustering large-scale sequence datasets in distributed computing environments. In addition, it implements two different protocols for identifying families on the basis of the clustering results: a supervised and an unsupervised protocol. These are used depending on whether or not high-quality protein function annotation data are associated with a given superfamily. The results attained for more than 1,500 domain superfamilies are discussed in both a qualitative and quantitative manner. The use of domain sequence data in conjunction with Gene Ontology protein function annotations and a set of rules and concepts to derive families is a novel approach to large-scale domain sequence classification. Importantly, the focus lies on domain, not whole-protein function. 3 Acknowledgements Thanks are due to several people that helped making my time in the Orengo lab and in London in general a valuable experience. From very early on until the now nearing end, my fellow PhD student and friend Jim Perkins played a big role in this – from quite practical issues, such as moving houses, sleeping in other people’s houses and playing pool in semi-accordance with the rules, to long and inspiring conversations about basically everything imaginable, including science. I apologise for having destroyed his belief in German punctuality, though. Benoit Dessailly was a great postdoctoral colleague and will hopefully remain a friend; his special sense of humour is certainly beaten by no other Belgian. Adam Reid and Ollie Redfern, who both left after my first year, were inspiring people to talk to, each in their own very special way. Thanks to Ian Sillitoe, Corin Yeats and Jon Lees for helping me to squeeze the relevant bits and bytes out of ‘their’ databases. Further, thanks to everyone else in the Orengo and Martin groups for the many random acts of kindness, most of which involved food. I would like to particularly thank my supervisor, Christine Orengo, for her help along the way, reading manuscripts on the weekend or whilst (technically) being on holiday, and for giving me the opportunity to work in such a friendly and prestigious group. Thanks are also due to Kate Bowers, chair of my thesis committee, and David Jones, member of my thesis committee for guidance. Tom Knight, Jesse Oldershaw and Jahid Ahmed, known in the Darwin building as ‘the IT guys’, were of great help on several occasions. If I was Fox Mulder, these would be my personal Lone Gunmen. And I may well hire Tristan Clark from UCL Computer Science too, as he was incredibly helpful on short notice when it came to cluster troubles. Back to the lower ranks, it should be noted here that I never met a fellow student in the Darwin building who would not have been friendly, helpful and 4 easy to get along with. Actually, I think this holds for everyone I have ever met there – apart from maybe this one person, but to be honest, who would not get a bit grumpy maintaining a building that poses unprecedented challenges to climate and pest control. Concerning the failure of the latter, maybe I am even glad that the mice have never really left. Their jumping around like mad in the bins was a comforting noise in the long, otherwise solitary nights that were spent at the office during ‘peak’ times. It would be difficult to mention all the people and things outside work that made my time in London worthwhile. Yet, I will try to name a few important ones. I enjoyed every single fruit I ever bought on my way to and from work from the guy in front of the ‘Euston Supermarket’ – ten bananas for a pound! More importantly, he was always friendly and…well, there. Rain or shine. The same accounts for the Hare Krishna people who give out free food to students, homeless people and to everyone else who wants it. Again, value for (no) money paired with exceptional friendliness. Finally, there is the Poetry Library on the Southbank. This is probably the most serene and friendly place in London. Last but not least, I want to thank the European Union and all its tax payers, for funding my research. As acknowledgements tend to be written at a point where the word ‘deadline’ becomes perceptible in all its nasty metaphorical power, and memory weakens, there will inevitably be people that should have been mentioned here but are not. To those I apologise. Please do not think you were less important or helpful than the banana man. One person I will certainly not forget to thank is Anna. It is never easy to live with someone who is constantly working towards a self-imposed goal, but, at the same time, it is great to know there will be someone to share the comfort with when the goal is achieved. To doing this with her, my close friends and my family, I look forward. 5 Abbreviations aaRS Aminoacyl-tRNA-synthetase AMP Adenosine monophosphate ATP Adenosine triphosphate BP Biological Process (in GO) CC Cellular Component (in GO) DAG Diacyclic graph DNA Deoxyribonucleic acid EM Electron microscopy GMP Guanosine monophosphate GO Gene Ontology GTP Guanosine triphosphate HIV Human Immunodeficiency Virus HMM Hidden Markov model HPC High-performance computing I/O Input/Output LUCA Last Universal Common Ancestor (organism) MF Molecular Function (in GO) NAD Nicotinamide adenine dinucleotide NADP Nicotinamide adenine dinucleotide phosphate NCBI National Centre for Biotechnology Information NMR Nuclear magnetic resonance (spectroscopy) ORF Open Reading Frame RAM Random Access Memory RMSD Root Mean Square Deviation RNA Ribonucleic acid UNIX UNiplexed Information and Computing System 6 Contents Chapter 1. Introduction ...........................................................................17 1.1 Protein domains and superfamilies........................................................18 1.1.1 Origin and definition of concepts..................................................19 1.1.1.1 Superfamilies.............................................................................20 1.1.1.2 Protein domains........................................................................21 1.1.2 The evolution of multi-domain proteins.......................................22 1.1.3 Existing superfamily studies ...........................................................23 1.2 Relationships between protein sequences.............................................24 1.2.1 Pairwise relationships.......................................................................25 1.2.1.1 Homology..................................................................................25 1.2.1.2 Orthology and Paralogy...........................................................26 1.2.2 Group concepts................................................................................27 1.2.2.1 Superfamily................................................................................27 1.2.2.2 Orthologue cluster....................................................................28 1.2.2.3 Family.........................................................................................28 1.2.3 Application to proteins and domains ............................................30 1.3 Protein function annotation ...................................................................31 1.3.1 The Gene Ontology.........................................................................32 1.3.2 Other systems...................................................................................32 1.4 Bioinformatics methods..........................................................................33 1.4.1 Sequence alignment..........................................................................33 1.4.1.1 Pairwise sequence alignment...................................................34 1.4.1.2 Multiple sequence alignment...................................................37 1.4.2 Alignment
Recommended publications
  • Protein Classification Based on Text Document Classification Techniques
    PROTEINS: Structure, Function, and Bioinformatics 58:955–970 (2005) Protein Classification Based on Text Document Classification Techniques Betty Yee Man Cheng,1 Jaime G. Carbonell,1 and Judith Klein-Seetharaman1,2* 1Language Technologies Institute, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, Pennsylvania 2Department of Pharmacology, University of Pittsburgh School of Medicine, Biomedical Science Tower E1058, Pittsburgh, Pennsylvania ABSTRACT The need for accurate, automated The first category of methods (Table I-A) searches a protein classification methods continues to increase database of known sequences for the one most similar to as advances in biotechnology uncover new proteins. the query sequence and assigns its classification to the G-protein coupled receptors (GPCRs) are a particu- query sequence. The similarity search is accomplished by larly difficult superfamily of proteins to classify due performing a pairwise sequence alignment between the to extreme diversity among its members. Previous query sequence and every sequence in the database using comparisons of BLAST, k-nearest neighbor (k-NN), an amino acid similarity matrix. Smith–Waterman1 and hidden markov model (HMM) and support vector Needleman–Wunsch2 are dynamic programming algo- machine (SVM) using alignment-based features have rithms guaranteed to find the optimal local and global suggested that classifiers at the complexity of SVM alignment respectively, but they are extremely slow and are needed to attain high accuracy. Here, analogous thus impossible to use in a database-wide search. A to document classification, we applied Decision Tree number of heuristic algorithms have been developed, of and Naı¨ve Bayes classifiers with chi-square feature which BLAST3 is the most prevalent.
    [Show full text]
  • Targeted Disruption of Lynx2 Reveals Distinct Functions for Lynx Homologues in Learning and Behavior Ayse Begum Tekinay
    Rockefeller University Digital Commons @ RU Student Theses and Dissertations 2007 Targeted Disruption of Lynx2 Reveals Distinct Functions for Lynx Homologues in Learning and Behavior Ayse Begum Tekinay Follow this and additional works at: http://digitalcommons.rockefeller.edu/ student_theses_and_dissertations Part of the Life Sciences Commons Recommended Citation Tekinay, Ayse Begum, "Targeted Disruption of Lynx2 Reveals Distinct Functions for Lynx Homologues in Learning and Behavior" (2007). Student Theses and Dissertations. Paper 28. This Thesis is brought to you for free and open access by Digital Commons @ RU. It has been accepted for inclusion in Student Theses and Dissertations by an authorized administrator of Digital Commons @ RU. For more information, please contact [email protected]. TARGETED DISRUPTION OF LYNX2 REVEALS DISTINCT FUNCTIONS FOR LYNX HOMOLOGUES IN LEARNING AND BEHAVIOR A Thesis Presented to the Faculty of The Rockefeller University In Partial Fulfillment of the Requirements for the degree of Doctor of Philosophy by Ayse Begum Tekinay June 2007 © Copyright by Ayse Begum Tekinay 2007 TARGETED DISRUPTION OF LYNX2 REVEALS DISTINCT FUNCTIONS FOR LYNX HOMOLOGUES IN LEARNING AND BEHAVIOR Ayse Begum Tekinay, Ph.D. The Rockefeller University 2007 Endogenous short peptide modulators of ion channels provide a new level of regulation of nervous function. Lynx1 was identified as an endogenous mammalian homologue of snake venom peptide neurotoxins capable of binding to and functionally modulating nicotinic acetylcholine receptors (nAChR). Lynx1 is a member of the Ly6- α−neurotoxin superfamily (Ly6SF) of genes. Through extensive database searches, I identified 85 members of this superfamily including previously unidentified vertebrate and invertebrate family members. I show that these proteins are very divergent in their sequences, and identify two conserved subfamilies, snake toxins and immune system expressed Ly6 genes through phylogenetic inference.
    [Show full text]
  • Evolutionary Analysis of the CAP Superfamily of Proteins Using Amino Acid Sequences
    Evolutionary Analysis of the CAP Superfamily of Proteins using Amino Acid Sequences and Splice Sites by Anup Abraham A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science Approved April 2016 by the Graduate Supervisory Committee: Douglas E. Chandler, Chair Kenneth H. Buetow Robert W. Roberson ARIZONA STATE UNIVERSITY May 2016 ABSTRACT Here I document the breadth of the CAP (Cysteine-RIch Secretory Proteins (CRISP), Antigen 5 (Ag5), and the Pathogenesis-Related 1 (PR)) protein superfamily and trace some of the major events in the evolution of this family with particular focus on vertebrate CRISP proteins. Specifically, I sought to study the origin of these CAP subfamilies using both amino acid sequence data and gene structure data, more precisely the positions of exon/intron borders within their genes. Counter to current scientific understanding, I find that the wide variety of CAP subfamilies present in mammals, where they were originally discovered and characterized, have distinct homologues in the invertebrate phyla contrary to the common assumption that these are vertebrate protein subfamilies. In addition, I document the fact that primitive eukaryotic CAP genes contained only one exon, likely inherited from prokaryotic SCP-domain containing genes which were, by nature, free of introns. As evolution progressed, an increasing number of introns were inserted into CAP genes, reaching 2 to 5 in the invertebrate world, and 5 to 15 in the vertebrate world. Lastly, phylogenetic relationships between these proteins appear to be traceable not only by amino acid sequence homology but also by preservation of exon number and exon borders within their genes.
    [Show full text]
  • Structure-Based Subfamily Classification of Homeodomains
    Structure-Based Subfamily Classification of Homeodomains by Jennifer Ming-Jiun Tsai A thesis submitted in conformity with the requirements for the degree of Master of Science Graduate Department of Molecular Genetics University of Toronto © Copyright by Jennifer Ming-Jiun Tsai 2008 STRUCTURE-BASED SUBFAMILY CLASSIFICATION OF HOMEODOMAINS A thesis submitted in conformity with the requirements for Master of Science Jennifer Ming-Jiun Tsai Graduate Department of Molecular Genetics, University of Toronto, 2008 Abstract Eukaryotic DNA-binding proteins mediate many important steps in embryonic development and gene regulation. Consequently, a better understanding of these proteins would hopefully allow a more complete picture of gene regulation to be determined. In this study, a structure- based subfamily classification of the homeodomain family of DNA-binding proteins was undertaken in order to determine whether sub-groupings of a protein family could be identified that corresponded to differences in specific function, and identification of subfamily-determining residues was performed in order to gain some insight on functional differences via analysis of the residue properties. Subfamilies appear to have different specific DNA binding properties, according to DNA profiles obtained from TRANSFAC [1] and other sources in the literature. Subfamily-specific residues appear to be frequently associated with the protein-DNA interface and may influence DNA binding via interactions with the DNA phosphate backbone; these residues form a conserved profile uniquely identifying each subfamily. ii Acknowledgements First and foremost, I would like to thank my husband Christopher, my parents David and Virginia, and my sister Margaret for their unfailing love and support that has enabled me to maintain my focus on my studies and research.
    [Show full text]
  • GPRASP/ARMCX Protein Family: Potential Involvement in Health And
    GPRASP/ARMCX protein family: potential involvement in health and diseases revealed by their novel interacting partners Juliette Kaeffer, Gabrielle Zeder-Lutz, Frederic Simonin, Sandra Lecat To cite this version: Juliette Kaeffer, Gabrielle Zeder-Lutz, Frederic Simonin, Sandra Lecat. GPRASP/ARMCX pro- tein family: potential involvement in health and diseases revealed by their novel interacting part- ners. Current Topics in Medicinal Chemistry, Bentham Science Publishers, 2021, 21 (3), pp.227-254. 10.2174/1568026620666201202102448. hal-03095663 HAL Id: hal-03095663 https://hal.archives-ouvertes.fr/hal-03095663 Submitted on 4 Jan 2021 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. GPRASP/ARMCX protein family: potential involvement in health and diseases revealed by their novel interacting partners Juliette Kaeffer, Gabrielle Zeder-Lutz, Frédéric Simonin* and Sandra Lecat* Affiliation : Biotechnologie et Signalisation Cellulaire, UMR7242 CNRS, Université de Strasbourg, Illkirch-Graffenstaden, France. * Corresponding authors: Sandra Lecat ([email protected]) and Frédéric Simonin ([email protected]) : Biotechnologie et Signalisation Cellulaire, UMR7242 CNRS / Université de Strasbourg, 300 boulevard Sébastien Brant, CS 10413, 67412 Illkirch-Graffenstaden, Cedex, France. 1 Abstract GPRASP (GPCR-associated sorting protein)/ARMCX (ARMadillo repeat-Containing proteins on the X chromosome) family is composed of 10 proteins, which genes are located on a small locus of the X chromosome except one.
    [Show full text]
  • Gene Coexpression Network Analysis Reveals the Role of SRS Genes in Leaf Senescence of Maize (Zea Mays L.)
    Supplementary data: Gene coexpression network analysis reveals the role of SRS genes in leaf senescence of maize (Zea mays L.) Bing He, Pibiao Shi, Yuanda Lv, Zhiping Gao, Guoxiang Chen* Table 1. List of significantly correlated genes. Gene ID Annotation AC148152.3_FGT005 bx13 - benzoxazinone synthesis13 AC149829.2_FGT002 Uncharacterized protein AC155376.2_FGT004 Uncharacterized protein AC177897.2_FGT002 Putative E3 ubiquitin-protein ligase BAH1-like AC182482.3_FGT003 Uncharacterized protein AC186166.3_FGT008 Uncharacterized protein AC187843.3_FGT006 Uncharacterized protein AC188757.4_FGT004 Uncharacterized protein AC190772.4_FGT011 Uncharacterized protein AC191551.3_FGT003 Uncharacterized protein AC191654.3_FGT004 Uncharacterized protein AC193423.3_FGT003 Uncharacterized protein AC194472.3_FGT001 Uncharacterized protein AC194485.3_FGT002 Uncharacterized protein AC194897.4_FGT005 Uncharacterized protein AC195135.3_FGT002 Uncharacterized protein AC196779.3_FGT003 Uncharacterized protein AC196984.3_FGT001 Uncharacterized protein AC197355.3_FGT001 Uncharacterized protein AC197705.4_FGT001 pdc1 - pyruvate decarboxylase1 AC198725.4_FGT009 wrky64 - WRKY-transcription factor 64 AC202107.3_FGT001 Uncharacterized protein AC203430.3_FGT007 Uncharacterized protein AC203862.4_FGT001 Uncharacterized protein AC203989.4_FGT001 Uncharacterized protein AC204530.4_FGT003 Uncharacterized protein AC204711.3_FGT002 Uncharacterized protein AC205376.4_FGT008 Uncharacterized protein AC205413.4_FGT001 Uncharacterized protein AC205471.4_FGT007 Uncharacterized
    [Show full text]