Communities and Homology in Protein-Protein Interactions

Total Page:16

File Type:pdf, Size:1020Kb

Communities and Homology in Protein-Protein Interactions Communities and Homology in Protein-protein Interactions Anna Lewis Balliol College University of Oxford A thesis submitted for the degree of Doctor of Philosophy Michaelmas 2011 2 3 This thesis is dedicated to the inspirational Jen Weterings. Wishing her well in her ongoing recovery. Acknowledgements With thanks to Charlotte, Nick and Mason, for being the best supervisors a girl could ask for. With thanks to all those who have helped me with the work in this thesis. In particular Sumeet Agarwal, Rebecca Hamer, Gesine Reinert, James Wakefield, Simon Myers, Steve Kelly, Nicholas Lewis, members of the Oxford Protein Informatics Group and members of the Oxford/Imperial Systems and Signalling group. With thanks to all those who have made my DPhil years a wonderful experience, in particular my family, my Balliol family, and my DTC family. Abstract Knowledge of protein sequences has exploded, but knowledge of protein function is needed to make use of sequence information, and this lags behind. A protein’s function must be understood in context and part of this is the network of interactions between proteins. What are the relationships between protein function and the structure of the interaction network? In the first part of my thesis, I investigate the functional relevance of clusters, or communities, of proteins in the yeast protein interaction network. Communities are candidates for biological modules. The work I present is the first to systematically investigate this structure at multiple scales in such networks. I develop novel tests to assess whether communities are functionally homogeneous, and demonstrate that almost every protein is found in a functionally homogeneous community at some scale. The evolution of protein sequences is well-studied, but comparatively little is known about the evolution of protein function. Such knowledge is needed to un- derstand when it is appropriate to annotate newly sequenced proteins by transferring functional information from homologs–i.e. evolutionarily related proteins. In the sec- ond part of my thesis, I assess the success of transferring protein-protein interactions across species and use this to estimate the rate at which interactions are lost in evolu- tion. At levels of sequence similarity associated with functional annotation transfer, I demonstrate that protein-protein interaction transfer is unreliable. The relevance of community structure for understanding protein function and the low conservation of individual interactions, suggests a possible role for communities in the evolution of cellular function. I discuss this possibility in my conclusions. Contents 1 Introduction 1 1.1 Overview.................................. 1 1.1.1 Systems biology and protein-protein interactions . ...... 1 1.1.2 Communities ........................... 2 1.1.3 Homology ............................. 3 1.1.4 Communities and homology of protein-protein interactions .. 3 1.2 Protein-proteininteractiondata . ... 4 1.2.1 Proteins .............................. 4 1.2.2 What is a protein-protein interaction? . 5 1.2.3 Experimentalprotocols. 6 1.2.3.1 Yeast-two-hybrid screens . 6 1.2.3.2 Purification and identification of complexes . 7 1.2.3.3 Othermethods ..................... 9 1.2.3.4 Literature curation of small-scale experiments . .. 9 1.2.3.5 Databases........................ 10 1.2.4 Choiceofdataset......................... 11 1.2.5 Estimatesoferrorrates. 13 1.2.6 Consideredasanetwork . 15 1.3 Protein-protein interactions and function . ...... 17 1.3.1 Predictingproteinfunction. 17 i 1.3.2 Degreedistribution . 18 1.3.3 Motifs ............................... 20 1.3.4 Communities ........................... 21 1.4 Protein-protein interactions and evolution . ...... 22 1.4.1 Evolutionary claims in the literature . 22 1.4.2 Interactions effecting rate of evolution . .. 23 1.4.3 Co-evolution of interacting proteins . .. 24 1.4.4 Models for protein-protein interaction network evolution ... 27 1.4.5 Between-species comparisons . 31 1.4.5.1 Interologs ........................ 31 1.4.5.2 PINalignment ..................... 35 1.5 Protein-protein interactions, function, and evolution.......... 35 2 Tools and techniques 38 2.1 Communitydetection. .. .. 38 2.1.1 Introduction............................ 38 2.1.2 Choosingamethod........................ 39 2.1.3 Modularitymaximisation. 40 2.1.3.1 Resolutionlimit. 42 2.1.3.2 A multi-resolution generalisation . 42 2.1.3.3 Algorithms for maximising modularity . 43 2.1.4 Other multi-resolution approaches . 44 2.1.4.1 Traditional clustering techniques . 44 2.1.4.2 k-cliquepercolation. 44 2.1.4.3 MCL........................... 45 2.1.4.4 Hierarchicalrandomgraphs . 45 2.1.4.5 Dynamical communities . 46 2.1.5 What next after detecting communities? . 46 ii 2.2 Comparingpartitions........................... 49 2.3 Assessing predictive ability . .. 53 2.4 Similarity measures based on structured ontologies . ....... 55 3 The Function of Communities in Protein-Protein Interaction Net- works at Multiple Scales 60 3.1 Introduction................................ 60 3.2 Datasetsanddataprocessing . 62 3.2.1 Protein-protein interaction data sets . .. 62 3.2.2 GO................................. 64 3.2.3 MIPS ............................... 65 3.2.4 Chemoinformatics screen of growth rates data . .. 65 3.3 Pairwisepropertiesofproteins . .. 66 3.4 Communities ............................... 67 3.4.1 Robustnessofpartitionsfound. 70 3.5 Examples of communities found at multiple resolutions . ....... 72 3.6 The standard assessment of biological relevance: functional enrichment 75 3.7 Functional homogeneity of communities . ... 76 3.7.1 Beyondpairwisemeasures . 83 3.8 Use of topological properties to select functionally homogeneous com- munities .................................. 85 3.9 Tracing the community membership of a particular protein...... 92 3.10 Investigating particular protein functions . ........ 94 3.10.1 Interactions and functional classes . ... 96 3.10.2 Distribution of functional classes in communities . ....... 98 3.10.3 Co-occurrence of functional classes in communities ...... 102 3.10.4 Functionally homogeneous communities and functional classes 103 3.11Conclusions ................................ 103 iii 4 What evidence is there for the homology of protein-protein interac- tions? 109 4.1 Introduction................................ 109 4.2 Datasetsanddataprocessing . 111 4.2.1 Protein-proteininteractiondata . 111 4.2.2 Homologydata .......................... 113 4.3 Interactions conserved across species . .... 118 4.3.1 Theevidence ........................... 118 4.3.2 Errorsintheinteractomedata. 128 4.3.2.1 Falsepositives . 130 4.3.2.2 Coverage of the source species interactions . 133 4.3.2.3 Coverage of the target species interactions . 133 4.3.3 Probability per million years that a duplicated interaction is lost141 4.3.4 Can one select the conserved interactions? . 143 4.4 Interactions conserved within species . .... 152 4.5 Conclusions ................................ 156 5 Conclusions and future directions 159 5.1 Communities ............................... 159 5.2 Homology ................................. 161 5.3 Communities and homology of protein-protein interactions ...... 162 A Methods for predicting protein interactions 165 A.1 Assessing prediction methods . 166 A.2 Network-basedmethods . 167 A.3 Genome-basedmethods. 168 A.3.1 Geneneighbourhood . 168 A.3.2 Genefusion ............................ 169 iv A.3.3 Phylogeneticprofile. 169 A.4 Sequence-based methods . 170 A.4.1 Interologs ............................. 170 A.4.2 Phylogenetictreemethods . 170 A.5 Structure-basedmethods . 170 A.6 Domain-basedmethods. 171 A.7 Summary ................................. 173 B Examples of community membership 175 C Examples of community membership following individual proteins 198 D Alternative sequence similarity scores for inferring interactions 207 Bibliography 213 v List of Figures 1.1 Growth of the IntAct protein-protein interaction database ...... 12 1.2 PINevolutionmodels. .......................... 28 1.3 ConceptofanInterolog.......................... 33 2.1 StructureoftheGeneOntology. 57 3.1 Communities identified in the A and P Networks ........... 68 3.2 Community structure of an Erd˝os-R´enyi random network and a syn- thetic network with strong community structure . .. 69 3.3 Robustness of community detection algorithm . .... 71 3.4 Examples of communities found . 73 3.5 Literature standard test for functional enrichment of communities . 77 3.6 Functional homogeneity of communities found . .... 79 3.7 Correlation of similarity measures . ... 81 3.8 Correlation of normalised and un-normalised GO similarity measures 82 3.9 Functional homogeneity of communities found: chains . ....... 86 3.10 Correlation of similarity measures: chains . ....... 87 3.11 Correlation of similarity measures: chains-based test compared to interactors- basedtest ................................. 88 3.12 Using topological properties to pick out functionally homogeneous com- munities .................................. 91 vi 3.13 Tracing the community membership of a particular protein: example 94 3.14 Tracing the community membership of a particular protein: further examples.................................. 95 3.15 GOslimandinteractingproteins . 99 3.16 The distribution of four GO slim terms throughout communities . 101 3.17 The fraction of interactions connecting proteins of different types found withincommunities. .. .. 104 3.18 Fraction of proteins of particular types in functionally homogeneous
Recommended publications
  • (PWP2) Gene Mapping to 21Q22.3 Ronald G
    Downloaded from genome.cshlp.org on October 2, 2021 - Published by Cold Spring Harbor Laboratory Press LETTER Isolation and Genomic Structure of a Human Homolog of the Yeast Periodic Tryptophan Protein 2 (PWP2) Gene Mapping to 21q22.3 Ronald G. Lafrenii~re, 1'4 Daniel L. Rochefort, 1 Nathalie Chr~tien, ~ Catherine E. Neville, 2 Robert G. Korneluk, 2 Lin Zuo, 3 Yalin Wei, 3 Jay Lichter, 3 and Guy A. Rouleau ~ 1Centre for Research in Neuroscience, McGill University, and Department of Neurology, Montreal General Hospital Research Institute, Montreal, Canada H3G 1A4; 2DNA Sequencing Core Facility, Canadian Genetic Diseases Network, Children's Hospital of Eastern Ontario, Ottawa, Canada K1H 8L1; 3Sequana Therapeutics, Inc., La Jolla, California 92037 As part of efforts to identify candidate genes for diseases mapping to the 21q22.3 region, we have assembled a 770-kb cosmid and BAC contig containing eight tightly linked markers. These cosmids and BACs were restriction mapped using eight rare cutting enzymes, with the goal of identifying CpG-rich islands. One such island was identified by the clustering of lqotl, Eagl, Sstll, and BssHIl sites, and corresponded to the Nod linking clone LJI04 described previously. A 7.6-kb l-lindlll fragment containing this CpG-rich island was subcloned and partially sequenced. A homology search using the sequence obtained from either side of the Nod site identified an expressed sequence tag with homology to the yeast periodic tryptophan protein 2 (PWP2). Several cDNAs corresponding to the human PWP2 gene were identified and partially sequenced. Northern blot analysis revealed a 3.3-kb transcript that was well expressed in all tissues tested.
    [Show full text]
  • Assembly and Annotation of an Ashkenazi Human Reference Genome
    bioRxiv preprint doi: https://doi.org/10.1101/2020.03.18.997395; this version posted March 18, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Assembly and Annotation of an Ashkenazi Human Reference Genome Alaina Shumate1,2,† Aleksey V. Zimin1,2,† Rachel M. Sherman1,3 Daniela Puiu1,3 Justin M. Wagner4 Nathan D. Olson4 Mihaela Pertea1,2 Marc L. Salit5 Justin M. Zook4 Steven L. Salzberg1,2,3,6* 1Center for Computational Biology, Johns Hopkins University, Baltimore, MD 2Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 3Department of Computer Science, Johns Hopkins University, Baltimore, MD 4National Institute of Standards and Technology, Gaithersburg, MD 5Joint Initiative for Metrology in Biology, Stanford University, Stanford, CA 6Department of Biostatistics, Johns Hopkins University, Baltimore, MD †These authors contributed equally to this work. *Corresponding author. Email: [email protected] Abstract Here we describe the assembly and annotation of the genome of an Ashkenazi individual and the creation of a new, population-specific human reference genome. This genome is more contiguous and more complete than GRCh38, the latest version of the human reference genome, and is annotated with highly similar gene content. The Ashkenazi reference genome, Ash1, contains 2,973,118,650 nucleotides as compared to 2,937,639,212 in GRCh38. Annotation identified 20,157 protein-coding genes, of which 19,563 are >99% identical to their counterparts on GRCh38. Most of the remaining genes have small differences.
    [Show full text]
  • Simplified Computational Model for Generating Bio- Logical Networks
    Simplified Computational Model for Generating Bio- logical Networks† Matthew H J Bailey,∗ David Ormrod Morley,∗ and Mark Wilson∗ A method to generate and simulate biological networks is discussed. An expanded Wooten- Winer-Weaire bond switching methods is proposed which allows for a distribution of node degrees in the network while conserving the mean average node degree. The networks are characterised in terms of their polygon structure and assortativities (a measure of local ordering). A wide range of experimental images are analysed and the underlying networks quantified in an analogous manner. Limitations in obtaining the network structure are discussed. A “network landscape” of the experimentally observed and simulated networks is constructed from the underlying metrics. The enhanced bond switching algorithm is able to generate networks spanning the full range of experimental observations. Two dimensional random networks are observed in a range of 17 contexts across considerably different length scales in nature: to be limited to avoid damaging the delicate networks . How- from nanometres, in the form of amorphous graphene; to me- ever, even when a high-quality image is obtained, there is further tres, in the form of the Giant’s causeway; to tens of kilometres, in difficulty in analysis. For example, each edge in a network is a the form of geopolitical borders 1–3. A framework for describing complex molecule made up of tens of thousands of atoms which these continuous random networks for chemical systems was first may not lie strictly in a single plane. In addition, the most inter- introduced by Zachariasen to describe silica-like glasses, and has esting dynamic behaviour often occurs over very long timescales proved to be extremely versatile in the years since 4, being used to – potentially decades.
    [Show full text]
  • Identification and Characterization of TPRKB Dependency in TP53 Deficient Cancers
    Identification and Characterization of TPRKB Dependency in TP53 Deficient Cancers. by Kelly Kennaley A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Molecular and Cellular Pathology) in the University of Michigan 2019 Doctoral Committee: Associate Professor Zaneta Nikolovska-Coleska, Co-Chair Adjunct Associate Professor Scott A. Tomlins, Co-Chair Associate Professor Eric R. Fearon Associate Professor Alexey I. Nesvizhskii Kelly R. Kennaley [email protected] ORCID iD: 0000-0003-2439-9020 © Kelly R. Kennaley 2019 Acknowledgements I have immeasurable gratitude for the unwavering support and guidance I received throughout my dissertation. First and foremost, I would like to thank my thesis advisor and mentor Dr. Scott Tomlins for entrusting me with a challenging, interesting, and impactful project. He taught me how to drive a project forward through set-backs, ask the important questions, and always consider the impact of my work. I’m truly appreciative for his commitment to ensuring that I would get the most from my graduate education. I am also grateful to the many members of the Tomlins lab that made it the supportive, collaborative, and educational environment that it was. I would like to give special thanks to those I’ve worked closely with on this project, particularly Dr. Moloy Goswami for his mentorship, Lei Lucy Wang, Dr. Sumin Han, and undergraduate students Bhavneet Singh, Travis Weiss, and Myles Barlow. I am also grateful for the support of my thesis committee, Dr. Eric Fearon, Dr. Alexey Nesvizhskii, and my co-mentor Dr. Zaneta Nikolovska-Coleska, who have offered guidance and critical evaluation since project inception.
    [Show full text]
  • Prospects & Overviews Orthology Prediction Methods: a Quality Assessment Using Curated Protein Families
    Prospects & Overviews Methods, Models & Techniques Orthology prediction methods: A quality assessment using curated protein families Kalliopi Trachana1), Tomas A. Larsson1)2), Sean Powell1), Wei-Hua Chen1), Tobias Doerks1), Jean Muller3)4) and Peer Bork1)5)Ã The increasing number of sequenced genomes has Introduction prompted the development of several automated orthology prediction methods. Tests to evaluate the accuracy of pre- The analysis of fully sequenced genomes offers valuable insights into the function and evolution of biological systems dictions and to explore biases caused by biological and [1]. The annotation of newly sequenced genomes, comparative technical factors are therefore required. We used 70 man- and functional genomics, and phylogenomics depend on ually curated families to analyze the performance of five reliable descriptions of the evolutionary relationships of public methods in Metazoa. We analyzed the strengths protein families. All the members within a protein family and weaknesses of the methods and quantified the impact are homologous and can be further separated into orthologs, of biological and technical challenges. From the latter part which are genes derived through speciation from a single ancestral sequence, and paralogs, which are genes resulting of the analysis, genome annotation emerged as the largest from duplication events before and after speciation (out- and single influencer, affecting up to 30% of the performance. in-paralogy, respectively) [2, 3]. The large number of fully Generally, most methods did well in assigning orthologous sequenced genomes and the fundamental role of orthology group but they failed to assign the exact number of genes in modern biology have led to the development of a plethora of forhalfofthegroups.Thepublicly available benchmark set methods (e.g.
    [Show full text]
  • Supporting Information
    Supporting Information Figure S1. The functionality of the tagged Arp6 and Swr1 was confirmed by monitoring cell growth and sensitivity to hydeoxyurea (HU). Five-fold serial dilutions of each strain were plated on YPD with or without 50 mM HU and incubated at 30°C or 37°C for 3 days. Figure S2. Localization of Arp6 and Swr1 on chromosome 3. The binding of Arp6-FLAG (top), Swr1-FLAG (middle), and Arp6-FLAG in swr1 cells (bottom) are compared. The position of Tel 3L, Tel 3R, CEN3, and the RP gene are shown under the panels. Figure S3. Localization of Arp6 and Swr1 on chromosome 4. The binding of Arp6-FLAG (top), Swr1-FLAG (middle), and Arp6-FLAG in swr1 cells (bottom) in the whole chromosome region are compared. The position of Tel 4L, Tel 4R, CEN4, SWR1, and RP genes are shown under the panels. Figure S4. Localization of Arp6 and Swr1 on the region including the SWR1 gene of chromosome 4. The binding of Arp6- FLAG (top), Swr1-FLAG (middle), and Arp6-FLAG in swr1 cells (bottom) are compared. The position and orientation of the SWR1 gene is shown. Figure S5. Localization of Arp6 and Swr1 on chromosome 5. The binding of Arp6-FLAG (top), Swr1-FLAG (middle), and Arp6-FLAG in swr1 cells (bottom) are compared. The position of Tel 5L, Tel 5R, CEN5, and the RP genes are shown under the panels. Figure S6. Preferential localization of Arp6 and Swr1 in the 5′ end of genes. Vertical bars represent the binding ratio of proteins in each locus.
    [Show full text]
  • WO 2019/079361 Al 25 April 2019 (25.04.2019) W 1P O PCT
    (12) INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) (19) World Intellectual Property Organization I International Bureau (10) International Publication Number (43) International Publication Date WO 2019/079361 Al 25 April 2019 (25.04.2019) W 1P O PCT (51) International Patent Classification: CA, CH, CL, CN, CO, CR, CU, CZ, DE, DJ, DK, DM, DO, C12Q 1/68 (2018.01) A61P 31/18 (2006.01) DZ, EC, EE, EG, ES, FI, GB, GD, GE, GH, GM, GT, HN, C12Q 1/70 (2006.01) HR, HU, ID, IL, IN, IR, IS, JO, JP, KE, KG, KH, KN, KP, KR, KW, KZ, LA, LC, LK, LR, LS, LU, LY, MA, MD, ME, (21) International Application Number: MG, MK, MN, MW, MX, MY, MZ, NA, NG, NI, NO, NZ, PCT/US2018/056167 OM, PA, PE, PG, PH, PL, PT, QA, RO, RS, RU, RW, SA, (22) International Filing Date: SC, SD, SE, SG, SK, SL, SM, ST, SV, SY, TH, TJ, TM, TN, 16 October 2018 (16. 10.2018) TR, TT, TZ, UA, UG, US, UZ, VC, VN, ZA, ZM, ZW. (25) Filing Language: English (84) Designated States (unless otherwise indicated, for every kind of regional protection available): ARIPO (BW, GH, (26) Publication Language: English GM, KE, LR, LS, MW, MZ, NA, RW, SD, SL, ST, SZ, TZ, (30) Priority Data: UG, ZM, ZW), Eurasian (AM, AZ, BY, KG, KZ, RU, TJ, 62/573,025 16 October 2017 (16. 10.2017) US TM), European (AL, AT, BE, BG, CH, CY, CZ, DE, DK, EE, ES, FI, FR, GB, GR, HR, HU, ΓΕ , IS, IT, LT, LU, LV, (71) Applicant: MASSACHUSETTS INSTITUTE OF MC, MK, MT, NL, NO, PL, PT, RO, RS, SE, SI, SK, SM, TECHNOLOGY [US/US]; 77 Massachusetts Avenue, TR), OAPI (BF, BJ, CF, CG, CI, CM, GA, GN, GQ, GW, Cambridge, Massachusetts 02139 (US).
    [Show full text]
  • Biological Network Approaches and Applications in Rare Disease Studies
    G C A T T A C G G C A T genes Review Biological Network Approaches and Applications in Rare Disease Studies Peng Zhang 1,* and Yuval Itan 2,3 1 St. Giles Laboratory of Human Genetics of Infectious Diseases, Rockefeller Branch, The Rockefeller University, New York, NY 10065, USA 2 The Charles Bronfman Institute for Personalized Medicine, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA; [email protected] 3 Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA * Correspondence: [email protected]; Tel.: +1-646-830-6622 Received: 3 September 2019; Accepted: 10 October 2019; Published: 12 October 2019 Abstract: Network biology has the capability to integrate, represent, interpret, and model complex biological systems by collectively accommodating biological omics data, biological interactions and associations, graph theory, statistical measures, and visualizations. Biological networks have recently been shown to be very useful for studies that decipher biological mechanisms and disease etiologies and for studies that predict therapeutic responses, at both the molecular and system levels. In this review, we briefly summarize the general framework of biological network studies, including data resources, network construction methods, statistical measures, network topological properties, and visualization tools. We also introduce several recent biological network applications and methods for the studies of rare diseases. Keywords: biological network; bioinformatics; database; software; application; rare diseases 1. Introduction Network biology provides insights into complex biological systems and can reveal informative patterns within these systems through the integration of biological omics data (e.g., genome, transcriptome, proteome, and metabolome) and biological interactome data (e.g., protein-protein interactions and gene-gene associations).
    [Show full text]
  • Spatial Protein Interaction Networks of the Intrinsically Disordered Transcription Factor C(%3$
    Spatial protein interaction networks of the intrinsically disordered transcription factor C(%3$ Dissertation zur Erlangung des akademischen Grades Doctor rerum naturalium (Dr. rer. nat.) im Fach Biologie/Molekularbiologie eingereicht an der Lebenswissenschaftlichen Fakultät der Humboldt-Universität zu Berlin Von Evelyn Ramberger, M.Sc. Präsidentin der Humboldt-Universität zu Berlin Prof. Dr.-Ing.Dr. Sabine Kunst Dekan der Lebenswissenschaftlichen Fakultät der Humboldt-Universität zu Berlin Prof. Dr. Bernhard Grimm Gutachter: 1. Prof. Dr. Achim Leutz 2. Prof. Dr. Matthias Selbach 3. Prof. Dr. Gunnar Dittmar Tag der mündlichen Prüfung: 12.8.2020 For T. Table of Contents Selbstständigkeitserklärung ....................................................................................1 List of Figures ............................................................................................................2 List of Tables ..............................................................................................................3 Abbreviations .............................................................................................................4 Zusammenfassung ....................................................................................................6 Summary ....................................................................................................................7 1. Introduction ............................................................................................................8 1.1. Disordered proteins
    [Show full text]
  • Complementary Symbiont Contributions to Plant Decomposition in a Fungus-Farming Termite
    Complementary symbiont contributions to plant decomposition in a fungus-farming termite Michael Poulsena,1,2, Haofu Hub,1, Cai Lib,c, Zhensheng Chenb, Luohao Xub, Saria Otania, Sanne Nygaarda, Tania Nobred,3, Sylvia Klaubaufe, Philipp M. Schindlerf, Frank Hauserg, Hailin Panb, Zhikai Yangb, Anton S. M. Sonnenbergh, Z. Wilhelm de Beeri, Yong Zhangb, Michael J. Wingfieldi, Cornelis J. P. Grimmelikhuijzeng, Ronald P. de Vriese, Judith Korbf,4, Duur K. Aanend, Jun Wangb,j, Jacobus J. Boomsmaa, and Guojie Zhanga,b,2 aCentre for Social Evolution, Department of Biology, University of Copenhagen, DK-2100 Copenhagen, Denmark; bChina National Genebank, BGI-Shenzen, Shenzhen 518083, China; cCentre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, DK-1350 Copenhagen, Denmark; dLaboratory of Genetics, Wageningen University, 6708 PB, Wageningen, The Netherlands; eFungal Biodiversity Centre, Centraalbureau voor Schimmelcultures, Royal Netherlands Academy of Arts and Sciences, NL-3584 CT, Utrecht, The Netherlands; fBehavioral Biology, Fachbereich Biology/Chemistry, University of Osnabrück, D-49076 Osnabrück, Germany; gCenter for Functional and Comparative Insect Genomics, Department of Biology, University of Copenhagen, DK-2100 Copenhagen, Denmark; hDepartment of Plant Breeding, Wageningen University and Research Centre, NL-6708 PB, Wageningen, The Netherlands; iDepartment of Microbiology, Forestry and Agricultural Biotechnology Institute, University of Pretoria, Pretoria SA-0083, South Africa; and jDepartment of Biology, University of Copenhagen, DK-2100 Copenhagen, Denmark Edited by Ian T. Baldwin, Max Planck Institute for Chemical Ecology, Jena, Germany, and approved August 15, 2014 (received for review October 24, 2013) Termites normally rely on gut symbionts to decompose organic levels-of-selection conflicts that need to be regulated (12).
    [Show full text]
  • Peregrine and Saker Falcon Genome Sequences Provide Insights Into Evolution of a Predatory Lifestyle
    LETTERS OPEN Peregrine and saker falcon genome sequences provide insights into evolution of a predatory lifestyle Xiangjiang Zhan1,7, Shengkai Pan2,7, Junyi Wang2,7, Andrew Dixon3, Jing He2, Margit G Muller4, Peixiang Ni2, Li Hu2, Yuan Liu2, Haolong Hou2, Yuanping Chen2, Jinquan Xia2, Qiong Luo2, Pengwei Xu2, Ying Chen2, Shengguang Liao2, Changchang Cao2, Shukun Gao2, Zhaobao Wang2, Zhen Yue2, Guoqing Li2, Ye Yin2, Nick C Fox3, Jun Wang5,6 & Michael W Bruford1 As top predators, falcons possess unique morphological, 16,263 genes were predicted for F. peregrines, and 16,204 were pre- physiological and behavioral adaptations that allow them to be dicted for F. cherrug (Supplementary Table 6 and Supplementary successful hunters: for example, the peregrine is renowned as Note). Approximately 92% of these genes were functionally annotated the world’s fastest animal. To examine the evolutionary basis of using homology-based methods (Supplementary Table 7). predatory adaptations, we sequenced the genomes of both the Comparative genome analysis was carried out to assess evolution peregrine (Falco peregrinus) and saker falcon (Falco cherrug), and innovation within falcons using related genomes with comparable All rights reserved. and we present parallel, genome-wide evidence for evolutionary assembly quality (Supplementary Table 8). Orthologous genes were innovation and selection for a predatory lifestyle. The genomes, identified in the chicken, zebra finch, turkey, peregrine and saker assembled using Illumina deep sequencing with greater than using the program TreeFam3. For the five genome-enabled avian spe- 100-fold coverage, are both approximately 1.2 Gb in length, cies, a maximum-likelihood phylogeny using 861,014 4-fold degen- with transcriptome-assisted prediction of approximately 16,200 erate sites from 6,267 single-copy orthologs confirmed that chicken America, Inc.
    [Show full text]
  • De Novo Genome Assembly and Analysis of Non- Allelic Recombination in Pathogenic Yeast Candida Glabrata
    DE NOVO GENOME ASSEMBLY AND ANALYSIS OF NON- ALLELIC RECOMBINATION IN PATHOGENIC YEAST CANDIDA GLABRATA by Zhuwei Xu A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy Baltimore, Maryland November 2019 © 2019 Zhuwei Xu All Rights Reserved Abstract Candida glabrata is an opportunistic pathogen in humans, responsible for approximately 20% of disseminated candidiasis. C. glabrata’s ability to adhere to host tissue is mediated by GPI- anchored cell wall proteins (GPI-CWPs); the corresponding genes contain long tandem repeat regions and form large gene families. These tandem repeats cause mis-assemblies of GPI-CWP genes in C. glabrata genome. Subtelomeres of C. glabrata are particularly rich in GPI-CWP genes and share homology with each other. Consequently, the subtelomeres are mis-assembled in genome sequences assembled from short sequencing reads. In this thesis, we used the long single-molecule real time (SMRT) reads and performed de novo genome assembly of the C. glabrata genome to establish the correct structure of GPI-CWP genes and the subtelomeres. We assembled the genome of six C. glabrata strains: the type strain, CBS138; our lab strain, BG2; four serial clinical isolates, BG3993-96 to assess genome changes during infection. With high quality sequences in hand, we then assess recombinational exchange between GPI-CWP genes by non-allelic mitotic recombination. This question is difficult to address with normal aligners, and we developed a k-mer based method to identify recombination. Our assembly established the correct subtelomere structure of Candida glabrata and provides correct structure of the GPI-CWP gene families.
    [Show full text]