Species Tree Inference and Update on Very Large Datasets Using Approximation, Randomization, Parallelization, and Vectorization

Total Page:16

File Type:pdf, Size:1020Kb

Species Tree Inference and Update on Very Large Datasets Using Approximation, Randomization, Parallelization, and Vectorization Species tree inference and update on very large datasets using approximation, randomization, parallelization, and vectorization Siavash Mirarab Electrical and Computer Engineering University of California at San Diego 1 Phylogenetic reconstruction from data Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG 2 Phylogenetic reconstruction from data CTGCACACCG CTGCACACCG CTGCACACGG Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG 2 Phylogenetic reconstruction from data CTGCACACCG CTGCACACCG CTGCACACGG Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG 2 Phylogenetic reconstruction from data CTGCACACCG CTGCACACCG CTGCACACGG Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG Gorilla ACTGCACACCG Human ACTGC-CCCCG Chimpanzee AATGC-CCCCG Orangutan -CTGCACACGG D 2 Phylogenetic reconstruction from data CTGCACACCG CTGCACACCG CTGCACACGG Gorilla Human Chimpanzee Orangutan ACTGCACACCG ACTGCCCCCG AATGCCCCCG CTGCACACGG Orangutan Chimpanzee Gorilla ACTGCACACCG Human ACTGC-CCCCG Chimpanzee AATGC-CCCCG Orangutan -CTGCACACGG Gorilla Human D P (D T ) T | 2 Applications: HIV forensic Texas case Washington case Scaduto et al., PNAS, 2010 3 Applications: microbiome https://www.nytimes.com/2017/11/06/well/live/ unlocking-the-secrets-of-the-microbiome.html 4 Applications: microbiome https://www.nytimes.com/2017/11/06/well/live/ unlocking-the-secrets-of-the-microbiome.html Morgan, Xochitl C., Nicola Segata, and Curtis Huttenhower. "Trends in genetics (2013) 4 Applications: food safety Tracking the source of a listeriosis outbreak Jackson, Brendan R., et al. Reviews of Infectious Diseases (2016) 5 Fig. 3. Molecular dating of the 2014 outbreak. (A) BEAST dating of the separation of the 2014 lineage from Middle African lineages (SL = Sierra Leone; GN = Guinea; DRC = Democratic Republic of Congo; tMRCA: Sep 2004, 95% HPD: Oct 2002 - May 2006). (B) BEAST dating of the tMRCA of the 2014 West African outbreak (tMRCA: Feb 23, 95% HPD: Jan 27 - Mar 14) and the tMRCA of the Sierra Leone lineages (tMRCA: Apr 23, 95% HPD: Apr 2 - May 13); probability distributions for both 2014 divergence events overlayed below. Posterior support for major nodes is shown. Applications: ebola source: Gire et al., Science, 2014 6 Sequence data growth data source: NCBI (http://www.ncbi.nlm.nih.gov/genbank/statistics) 500 400 300 200 sequences (millions) 100 0 1984 1987 1990 1993 1996 1999 2002 2005 2008 2011 2014 2017 • Rapid growth in the available sequences 7 Sequence data growth data source: NCBI (http://www.ncbi.nlm.nih.gov/genbank/statistics) 500 Whole Genomes (sequences) ACTGCACACCG Individual Genes (sequences) 400 Chromosome (hundredsGene 300 of letters) 200 sequences (millions) 100 0 Genome 1984 1987 1990 1993 1996 1999 2002 2005 2008 2011 2014 2017 (billions of letters) • Rapid growth in the • Our focus has shifted available sequences to “whole genomes” 7 Billions of More genomic regions columns More sequences More A million rows 8 Phylogenetic reconstruction : a computational nightmare • Almost all problems are NP-Hard! • The search space is huge. • Focusing on topology alone, there are (2n-5)!! trees with n leaves • We also care about branch lengths: ℝ2n-3 • We are interested in n in 101…106 range Tandy Warnow 9 Dealing with uncertainty • Really no “experimental” way to validate results. Thus, we use computation-heavy procedures to calculate statistical support 10 Dealing with uncertainty • Really no “experimental” way to validate results. Thus, we use computation-heavy procedures to calculate statistical support • Genome evolution is complex. We need complex statistical models for accurate inference 10 Dealing with uncertainty • Really no “experimental” way to validate results. Thus, we use computation-heavy procedures to calculate statistical support • Genome evolution is complex. We need complex statistical models for accurate inference • We need extensive simulations to test methods 10 To scale to large datasets … • Approximate and heuristic solutions 11 To scale to large datasets … • Approximate and heuristic solutions • Make the problem easier • Divide-and-conquer • Constrained search 11 To scale to large datasets … • Approximate and heuristic solutions • Make the problem easier • Divide-and-conquer • Constrained search • Develop optimized code. 11 How about accuracy? Increased data can make problems easier, but … 12 How about accuracy? Increased data can make problems easier, but … • Larger datasets often • Are used to answer harder problems • Allow the use of more complex models • Riddled with erroneous data 12 How about accuracy? Increased data can make problems easier, but … • Larger datasets often • Are used to answer harder problems • Allow the use of more complex models • Riddled with erroneous data • Often, methods lose their accuracy on large datasets 12 Examples of improving scalability 13 To scale to large datasets • Approximate and heuristic solutions • Make the problem easier • Divide-and-conquer • Constrained search • Develop optimized code. 14 ASTRAL • Optimization problem (NP-Hard): Find the median tree among a set of input trees Distance between two trees := the number of quartet trees they do not share 15 ASTRAL • Optimization problem (NP-Hard): Find the median tree among a set of input trees Distance between two trees := the number of quartet trees they do not share • ASTRAL: an exact solution using a dynamic programming algorithm • Solves the problem exactly on a constrained search space 15 Choosing Constraints • Restricted search space should • Not compromise statistical consistency • Be large enough that accuracy is not sacrificed • Be small for scalability Speciestreeerror Number of genes 16 ASTRAL evolution • Search space should ideally grow polynomially with the dataset size • Tandy talked about ASTRAL-I and ASTRAL-II, which do not such have guarantees • ASTRAL-III [Zhang et al., 2018] • Guarantees polynomial size search space • Increases the speed of the dynamic programming 17 Changing the number of species (n) ● ● 1M 1000 500k 52 hours ● ● ● ● 200k 100 100k 1.09 1.83 Size of X Size 50k ● ● ● 10 ● Running time (minutes) 20k ● Search space size Search 10k ● Running time (minutes) 1 200 500 1000 2000 5000 10000 200 500 1000 2000 5000 10000 #Species #Species Simulations: n = 200 to 10,000 ,k = 1000 gene trees Empirical running time seems to increase close to O(n 1.8) [Zhang et al., 2018] 18 Changing the number of genes (k) 18 2 215 217 17 hours 216 0.777 2.09 210 15 e of X 2 z si 14 2 25 Running time (seconds) 213 Search space size Search Running time (minutes) 28 29 210 211 212 213 214 28 29 210 211 212 213 214 #Genes #Genes Simulations: n = 48 species, k = 256 to 16,384 gene trees Empirical running time seems to increase close to O(k 2) [Zhang et al., 2018] 19 To scale to large datasets • Approximate and heuristic solutions • Make the problem easier • Divide-and-conquer • Constrained search • Develop optimized code. 20 Profiling ASTRAL-III Running time (minutes) Running time (minutes) Many species Many genes 21 Further scaling improvements Chao Zhang • Developed a randomized algorithm for a key step in the dynamic programming. For a set X of subsets of an alphabet, find: {(A, B)|A, B ∈ X, A ∪ B ∈ X, A ∩ B = ∅} • Using, Abelian group theory, we can do compute this in time quadratic in |X|. • Has an astronomically low probability of failure 22 Further scaling improvements Chao Zhang • Developed a randomized algorithm for a key step in the dynamic programming. For a set X of subsets of an alphabet, find: {(A, B)|A, B ∈ X, A ∪ B ∈ X, A ∩ B = ∅} • Using, Abelian group theory, we can do compute this in time quadratic in |X|. • Has an astronomically low probability of failure • Implemented the kernel in C++ and added vectorization. 22 Improved scalability Running time (minutes) Running time (minutes) Many species Many genes 23 Parallelize the main step of the dynamic programming (blue) Running time (minutes) Running time (minutes) Many species Many genes 24 Scaling + GPU John Yin 108 ● • Can analyze datasets with III − 10,000 species and 1000 ● ● 73 ● genes in less than a day ● given 24 cores and a GPU ● ● 31 ● • ● No GPU https://github.com/ ● ● 1 GPU Speedup versus ASTRAL Speedup versus ● smirarab/ASTRAL/tree/ 4 ● 1 2 4 6 8 12 16 20 24 MP-similarity CPU cores Many genes 25 Bacteroidia Thermodesulfobacteria, Ca. Dadabacteria Epsilonproteobacteria Ca. Dependentiae Nitrospirae_1, Nitrospinae, Ca. Schekmanbacteria, Ca. NC10 Flavobacteriia Archaea Ca. Calescamantes Asgard group Ca.Aminicenantes, Ca. Fischerbacteria TACK group Cytophagia Crenarchaeota ChitinophagiaSphingobacteriia Alphaproteobacteria Euryarchaeota Saprospiria DPANN group Phycisphaerae, Planctomycetia_1, Planctomycetia_2 Acidobacteriia Ca. Omnitrophica, Ca. Firestonebacteria, Ca. Desantisbacteria SAR324 CPR Deferribacteres, Chrysiogenetes Ca. Marinimicrobia, Calditrichaeota Nitrospirae_2 Verrucomicrobiae_1, Opitutae, Lentisphaerae Microgenomates group Ca. Tectomicrobia Gemmatimonadetes, Ca. Glassbacteria, Ca. Eisenbacteria Aquificae ChlamydiaeFibrobacteria Parcubacteria group Ca. WOR-3, Ca. Edwardsbacteria Ca. Delongbacteria Balneolaeota Qiyun Zhu Rob Knight Ca. Latescibacteria, Ca. Zixibacteria Ignavibacteria Eubacteria Ca. Cloacimonetes Chlorobi Betaproteobacteria Ca. Hyd24-12 Ca. Kryptonia G Terrabacteria group a m Actinobacteria m Spirochaetes Acidithiobacillia a Firmicutes p Ca. Hydrogenedentes r o Chloroflexi Ca. Coatesbacteria t e Cyanobacteria Elusimicrobia
Recommended publications
  • Systema Naturae 2000 (Phylum, 6 Nov 2017)
    The Taxonomicon Systema Naturae 2000 Classification of Domain Bacteria (prokaryotes) down to Phylum Compiled by Drs. S.J. Brands Universal Taxonomic Services 6 Nov 2017 Systema Naturae 2000 - Domain Bacteria - Domain Bacteria Woese et al. 1990 1 Genus †Eoleptonema Schopf 1983, incertae sedis 2 Genus †Primaevifilum Schopf 1983, incertae sedis 3 Genus †Archaeotrichion Schopf 1968, incertae sedis 4 Genus †Siphonophycus Schopf 1968, incertae sedis 5 Genus Bactoderma Tepper and Korshunova 1973 (Approved Lists 1980), incertae sedis 6 Genus Stibiobacter Lyalikova 1974 (Approved Lists 1980), incertae sedis 7.1.1.1.1.1 Superphylum "Proteobacteria" Craig et al. 2010 1.1 Phylum "Alphaproteobacteria" 1.2.1 Phylum "Acidithiobacillia" 1.2.2.1 Phylum "Gammaproteobacteria" 1.2.2.2.1 Candidate phylum Muproteobacteria (RIF23) Anantharaman et al. 2016 1.2.2.2.2 Phylum "Betaproteobacteria" 2 Phylum "Zetaproteobacteria" 7.1.1.1.1.2 Phylum "Deltaproteobacteria_1" 7.1.1.1.2.1.1.1 Phylum "Deltaproteobacteria" [polyphyletic] 7.1.1.1.2.1.1.2.1 Phylum "Deltaproteobacteria_2" 7.1.1.1.2.1.1.2.2 Phylum "Deltaproteobacteria_3" 7.1.1.1.2.1.2 Candidate phylum Dadabacteria (CSP1-2) Hug et al. 2015 7.1.1.1.2.2.1 Candidate phylum "MBNT15" 7.1.1.1.2.2.2 Candidate phylum "Uncultured Bacterial Phylum 10 (UBP10)" Parks et al. 2017 7.1.1.2.1 Phylum "Nitrospirae_1" 7.1.1.2.2 Phylum Chrysiogenetes Garrity and Holt 2001 7.1.2.1.1 Phylum "Nitrospirae" Garrity and Holt 2001 [polyphyletic] 7.1.2.1.2.1.1 Candidate phylum Rokubacteria (CSP1-6) Hug et al.
    [Show full text]
  • Libros Sobre Enfermedades Autoinmunes: Tratamientos, Tipos Y Diagnósticos- Profesor Dr
    - LIBROS SOBRE ENFERMEDADES AUTOINMUNES: TRATAMIENTOS, TIPOS Y DIAGNÓSTICOS- PROFESOR DR. ENRIQUE BARMAIMON- 9 TOMOS- AÑO 2020.1- TOMO VI- - LIBROS SOBRE ENFERMEDADES AUTOINMUNES: TRATAMIENTOS, TIPOS Y DIAGNÓSTICOS . AUTOR: PROFESOR DR. ENRIQUE BARMAIMON.- - Doctor en Medicina.- - Cátedras de: - Anestesiología - Cuidados Intensivos - Neuroanatomía - Neurofisiología - Psicofisiología - Neuropsicología. - 9 TOMOS - - TOMO VI - -AÑO 2020- 1ª Edición Virtual: (.2020. 1)- - MONTEVIDEO, URUGUAY. 1 - LIBROS SOBRE ENFERMEDADES AUTOINMUNES: TRATAMIENTOS, TIPOS Y DIAGNÓSTICOS- PROFESOR DR. ENRIQUE BARMAIMON- 9 TOMOS- AÑO 2020.1- TOMO VI- - Queda terminantemente prohibido reproducir este libro en forma escrita y virtual, total o parcialmente, por cualquier medio, sin la autorización previa del autor. -Derechos reservados. 1ª Edición. Año 2020. Impresión [email protected]. - email: [email protected].; y [email protected]; -Montevideo, 15 de enero de 2020. - BIBLIOTECA VIRTUAL DE SALUD del S. M.U. del URUGUAY; y BIBLIOTECA DEL COLEGIO MÉDICO DEL URUGUAY. 0 0 0 0 0 0 0 0. 2 - LIBROS SOBRE ENFERMEDADES AUTOINMUNES: TRATAMIENTOS, TIPOS Y DIAGNÓSTICOS- PROFESOR DR. ENRIQUE BARMAIMON- 9 TOMOS- AÑO 2020.1- TOMO VI- - TOMO V I - 3 - LIBROS SOBRE ENFERMEDADES AUTOINMUNES: TRATAMIENTOS, TIPOS Y DIAGNÓSTICOS- PROFESOR DR. ENRIQUE BARMAIMON- 9 TOMOS- AÑO 2020.1- TOMO VI- - ÍNDICE.- - TOMO I . - - ÍNDICE. - PRÓLOGO.- - INTRODUCCIÓN. - CAPÍTULO I: -1)- GENERALIDADES. -1.1)- DEFINICIÓN. -1.2)- CAUSAS Y FACTORES DE RIESGO. -1.2.1)- FACTORES EMOCIONALES. -1.2.2)- FACTORES AMBIENTALES. -1.2.3)- FACTORES GENÉTICOS. -1.3)- Enterarse aquí, como las 10 Tipos de semillas pueden mejorar la salud. - 1.4)- TIPOS DE TRATAMIENTO DE ENFERMEDADES AUTOINMUNES. -1.4.1)- Remedios Naturales. -1.4.1.1)- Mejorar la Dieta.
    [Show full text]
  • A Metagenomics Roadmap to the Uncultured Genome Diversity in Hypersaline Soda Lake Sediments Charlotte D
    Vavourakis et al. Microbiome (2018) 6:168 https://doi.org/10.1186/s40168-018-0548-7 RESEARCH Open Access A metagenomics roadmap to the uncultured genome diversity in hypersaline soda lake sediments Charlotte D. Vavourakis1 , Adrian-Stefan Andrei2†, Maliheh Mehrshad2†, Rohit Ghai2, Dimitry Y. Sorokin3,4 and Gerard Muyzer1* Abstract Background: Hypersaline soda lakes are characterized by extreme high soluble carbonate alkalinity. Despite the high pH and salt content, highly diverse microbial communities are known to be present in soda lake brines but the microbiome of soda lake sediments received much less attention of microbiologists. Here, we performed metagenomic sequencing on soda lake sediments to give the first extensive overview of the taxonomic diversity found in these complex, extreme environments and to gain novel physiological insights into the most abundant, uncultured prokaryote lineages. Results: We sequenced five metagenomes obtained from four surface sediments of Siberian soda lakes with a pH 10 and a salt content between 70 and 400 g L−1. The recovered 16S rRNA gene sequences were mostly from Bacteria,evenin the salt-saturated lakes. Most OTUs were assigned to uncultured families. We reconstructed 871 metagenome-assembled genomes (MAGs) spanning more than 45 phyla and discovered the first extremophilic members of the Candidate Phyla Radiation (CPR). Five new species of CPR were among the most dominant community members. Novel dominant lineages were found within previously well-characterized functional groups involved in carbon, sulfur, and nitrogen cycling. Moreover, key enzymes of the Wood-Ljungdahl pathway were encoded within at least four bacterial phyla never previously associated with this ancient anaerobic pathway for carbon fixation and dissimilation, including the Actinobacteria.
    [Show full text]
  • Metabolic Roles of Uncultivated Bacterioplankton Lineages in the Northern Gulf of Mexico 2 “Dead Zone” 3 4 J
    bioRxiv preprint doi: https://doi.org/10.1101/095471; this version posted June 12, 2017. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC 4.0 International license. 1 Metabolic roles of uncultivated bacterioplankton lineages in the northern Gulf of Mexico 2 “Dead Zone” 3 4 J. Cameron Thrash1*, Kiley W. Seitz2, Brett J. Baker2*, Ben Temperton3, Lauren E. Gillies4, 5 Nancy N. Rabalais5,6, Bernard Henrissat7,8,9, and Olivia U. Mason4 6 7 8 1. Department of Biological Sciences, Louisiana State University, Baton Rouge, LA, USA 9 2. Department of Marine Science, Marine Science Institute, University of Texas at Austin, Port 10 Aransas, TX, USA 11 3. School of Biosciences, University of Exeter, Exeter, UK 12 4. Department of Earth, Ocean, and Atmospheric Science, Florida State University, Tallahassee, 13 FL, USA 14 5. Department of Oceanography and Coastal Sciences, Louisiana State University, Baton Rouge, 15 LA, USA 16 6. Louisiana Universities Marine Consortium, Chauvin, LA USA 17 7. Architecture et Fonction des Macromolécules Biologiques, CNRS, Aix-Marseille Université, 18 13288 Marseille, France 19 8. INRA, USC 1408 AFMB, F-13288 Marseille, France 20 9. Department of Biological Sciences, King Abdulaziz University, Jeddah, Saudi Arabia 21 22 *Correspondence: 23 JCT [email protected] 24 BJB [email protected] 25 26 27 28 Running title: Decoding microbes of the Dead Zone 29 30 31 Abstract word count: 250 32 Text word count: XXXX 33 34 Page 1 of 31 bioRxiv preprint doi: https://doi.org/10.1101/095471; this version posted June 12, 2017.
    [Show full text]
  • A Cellulose Loosening Protein Is One of the Most 1 Widely Distributed Tools
    bioRxiv preprint doi: https://doi.org/10.1101/637728; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 1 From morphogenesis to pathogenesis: A cellulose loosening protein is one of the most 2 widely distributed tools in nature 3 4 William R. Chase1 5 Olga Zhaxybayeva2,3 6 Jorge Rocha4,† 7 Daniel J. Cosgrove1 8 Lori R. Shapiro4,§ 9 10 Corresponding author: Lori Shapiro ([email protected]) 11 12 1 Department of Biology, Pennsylvania State University, University Park, PA, USA, 16801 13 2 Department of Biological Sciences, Dartmouth College, Hanover, NH, USA 03755 14 3 Department of Computer Science, Dartmouth College, Hanover, NH, USA 03755 15 4 Department of Microbiology and Immunology, Harvard Medical School, 77 Avenue Louis 16 Pasteur, Boston MA, USA 02115 17 † Current Address: Conacyt-Centro de Investigación y Desarrollo en Agrobiotecnología 18 Alimentaria, San Agustin Tlaxiaca, Mexico 42162 19 § Current Address: Department of Applied Ecology, North Carolina State University, 100 Brooks 20 Avenue, Raleigh, NC, USA 27695 21 bioRxiv preprint doi: https://doi.org/10.1101/637728; this version posted May 16, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.
    [Show full text]
  • De Bruijn Graphs Face Harsh Realities of Assembly What’S Inside the Cell Nucleus? Contents of the Nucleus
    How Do Biologists Assemble Genomes? ! Graph Algorithms Phillip Compeau and Pavel Pevzner Bioinformatics Algorithms: An Active Learning Approach ©2015 by Compeau and Pevzner. All rights reserved. The Newspaper Problem The Newspaper Problem The Newspaper Problem The Newspaper Problem The Newspaper Problem The Newspaper Problem Outline • What Is Genome Sequencing? • The String Reconstruction Problem • String Reconstruction as a Walk in the Overlap Graph • Another Graph for String Reconstruction • The Seven Bridges of Konigsberg • Euler’s Theorem • De Bruijn Graphs Face Harsh Realities of Assembly What’s inside the cell nucleus? Contents of the Nucleus • The nucleus contains chromosomes. • Humans have 23 pairs of! chromosomes (one in each! pair comes from each parent).! • But what are chromosomes! made of? DNA: The Building Block of Life • One more zoom, and we reach the molecular level. • Early 1950s: Researchers start! uncovering properties of ! chromosomal substance,! now called “deoxyribose! nucleic acid”: DNA • 1953: Watson and Crick publish! “double helix” structure of DNA. Molecular Structure of DNA DNAs Double Helix DNAs Molecular Structure Molecular Structure of DNA • Nucleotide: Half of one! “rung” of DNA. • Four choices for the nucleic acid of a nucleotide: 1. Adenine (A) 2. Cytosine (C) 3. Guanine (G)—bonds to C 4. Thymine (T)—bonds to A DNAs Molecular Structure Why is DNA Important? Central Dogma of Molecular Biology: DNA is transcribed into RNA, which is then translated into protein (chain of amino acids). Courtesy: Rachel Raynes!
    [Show full text]
  • Lawrence Berkeley National Laboratory Recent Work
    Lawrence Berkeley National Laboratory Recent Work Title Genomic resolution of a cold subsurface aquifer community provides metabolic insights for novel microbes adapted to high CO2 concentrations. Permalink https://escholarship.org/uc/item/400782gj Journal Environmental microbiology, 19(2) ISSN 1462-2912 Authors Probst, Alexander J Castelle, Cindy J Singh, Andrea et al. Publication Date 2017-02-01 DOI 10.1111/1462-2920.13362 Peer reviewed eScholarship.org Powered by the California Digital Library University of California Genomic resolution of a cold subsurface aquifer community provides metabolic insights for novel microbes adapted to high CO2 concentrations Alexander J. Probst,1 Cindy J. Castelle,1 Andrea Singh,1 Christopher T. Brown,2 Karthik Anantharaman,1 Itai Sharon,1† Laura A. Hug,1† David Burstein,1 Joanne B. Emerson,1§ Brian C. Thomas1 and Jillian F. Banfield1,3,4* 1Department of Earth and Planetary Sciences, University of California, Berkeley, 307 McCone Hall, CA 94720, USA. 2Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA. 3Department of Environmental Science, Policy, and Management, University of California, Berkeley, CA, USA. 4Earth Science Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA, USA. *For correspondence. E-mail [email protected]; Tel. (+1) 510 316 4334; Fax (+1) 510 643 9980 Present addresses: † Migal – Galilee Research Institute, South Industrial Zone, Kiryat Shmona 11016, Israel, and Tel Hai College, M.P. Upper Galilee 12210, Israel; ‡ Department of Biology, University of Waterloo, 200 University Avenue West, Waterloo, Ontario, Canada N2L 3G1; § Department of Microbiology, The Ohio State University, 496 West 12th Ave., Columbus, OH 43210, USA.
    [Show full text]
  • 290 Metagenome-Assembled Genomes from the Mediterranean Sea: Ongoing Effort to 2 Generate Genomes from the Tara Oceans Dataset 3 4 Authors 5 Benjamin J
    bioRxiv preprint doi: https://doi.org/10.1101/069484; this version posted August 16, 2016. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 290 Metagenome-assembled Genomes from the Mediterranean Sea: Ongoing Effort to 2 Generate Genomes from the Tara Oceans Dataset 3 4 Authors 5 Benjamin J. Tully1‡, Rohan Sachdeva2, Elaina D. Graham2, and John F. Heidelberg1,2 6 7 ‡ - corresponding author, [email protected] 8 1 – Center for Dark Energy Biosphere Investigations, University of Southern California, Los 9 Angeles, CA 90089 10 2 – Department of Biological Sciences, University of Southern California, Los Angeles, CA 11 90089 12 13 Abstract 14 The Tara Oceans Expedition has provided large, publicly-accessible microbial metagenomic 15 datasets from a circumnavigation of the globe. Utilizing several size fractions from the samples 16 originating in the Mediterranean Sea, we have used current assembly and binning techniques to 17 reconstruct 290 putative high-quality metagenome-assembled bacterial and archaeal genomes, 18 with an estimated completion of ≥50%, and an additional 2,786 bins, with estimated completion 19 of 0-50%. We have submitted our results, including initial taxonomic and phylogenetic 20 assignments for the putative high-quality genomes, to open-access repositories (iMicrobe and 21 FigShare) for the scientific community to use in ongoing research. 22 23 Introduction 24 Microorganisms are a major constituent of the biology within the world’s oceans and act as the 25 important linchpins in all major global biogeochemical cycles1.
    [Show full text]
  • This Is a Pre-Copyedited, Author-Produced Version of an Article Accepted for Publication, Following Peer Review
    This is a pre-copyedited, author-produced version of an article accepted for publication, following peer review. Spang, A.; Stairs, C.W.; Dombrowski, N.; Eme, E.; Lombard, J.; Caceres, E.F.; Greening, C.; Baker, B.J. & Ettema, T.J.G. (2019). Proposal of the reverse flow model for the origin of the eukaryotic cell based on comparative analyses of Asgard archaeal metabolism. Nature Microbiology, 4, 1138–1148 Published version: https://dx.doi.org/10.1038/s41564-019-0406-9 Link NIOZ Repository: http://www.vliz.be/nl/imis?module=ref&refid=310089 [Article begins on next page] The NIOZ Repository gives free access to the digital collection of the work of the Royal Netherlands Institute for Sea Research. This archive is managed according to the principles of the Open Access Movement, and the Open Archive Initiative. Each publication should be cited to its original source - please use the reference as presented. When using parts of, or whole publications in your own work, permission from the author(s) or copyright holder(s) is always needed. 1 Article to Nature Microbiology 2 3 Proposal of the reverse flow model for the origin of the eukaryotic cell based on 4 comparative analysis of Asgard archaeal metabolism 5 6 Anja Spang1,2,*, Courtney W. Stairs1, Nina Dombrowski2,3, Laura Eme1, Jonathan Lombard1, Eva Fernández 7 Cáceres1, Chris Greening4, Brett J. Baker3 and Thijs J.G. Ettema1,5* 8 9 1 Department of Cell- and Molecular Biology, Science for Life Laboratory, Uppsala University, SE-75123, 10 Uppsala, Sweden 11 2 NIOZ, Royal Netherlands Institute for Sea Research, Department of Marine Microbiology and 12 Biogeochemistry, and Utrecht University, P.O.
    [Show full text]
  • Genomic Resolution of a Cold Subsurface Aquifer Community Provides Metabolic Insights for Novel Microbes Adapted to High
    Genomic resolution of a cold subsurface aquifer community provides metabolic insights for novel microbes adapted to high CO2 concentrations Alexander J. Probst,1 Cindy J. Castelle,1 Andrea Singh,1 Christopher T. Brown,2 Karthik Anantharaman,1 Itai Sharon,1† Laura A. Hug,1† David Burstein,1 Joanne B. Emerson,1§ Brian C. Thomas1 and Jillian F. Banfield1,3,4* 1Department of Earth and Planetary Sciences, University of California, Berkeley, 307 McCone Hall, CA 94720, USA. 2Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA. 3Department of Environmental Science, Policy, and Management, University of California, Berkeley, CA, USA. 4Earth Science Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA, USA. *For correspondence. E-mail [email protected]; Tel. (+1) 510 316 4334; Fax (+1) 510 643 9980 Present addresses: † Migal – Galilee Research Institute, South Industrial Zone, Kiryat Shmona 11016, Israel, and Tel Hai College, M.P. Upper Galilee 12210, Israel; ‡ Department of Biology, University of Waterloo, 200 University Avenue West, Waterloo, Ontario, Canada N2L 3G1; § Department of Microbiology, The Ohio State University, 496 West 12th Ave., Columbus, OH 43210, USA. Summary As in many deep underground environments, the microbial communities in subsurface high‐CO2 ecosystems remain relatively unexplored. Recent investigations based on single‐gene assays revealed a remarkable variety of organisms from little studied phyla in Crystal Geyser (Utah, USA), a site where deeply sourced CO2‐saturated fluids are erupted at the surface. To provide genomic resolution of the metabolisms of these organisms, we used a novel metagenomic approach to recover 227 high‐quality genomes from 150 microbial species affiliated with 46 different phylum‐level lineages.
    [Show full text]
  • Differential Depth Distribution of Microbial Function and Putative Symbionts Through Sediment-Hosted Aquifers in the Deep Terrestrial Subsurface
    UC Berkeley UC Berkeley Previously Published Works Title Differential depth distribution of microbial function and putative symbionts through sediment-hosted aquifers in the deep terrestrial subsurface. Permalink https://escholarship.org/uc/item/95x122xv Journal Nature microbiology, 3(3) ISSN 2058-5276 Authors Probst, Alexander J Ladd, Bethany Jarett, Jessica K et al. Publication Date 2018-03-01 DOI 10.1038/s41564-017-0098-y Peer reviewed eScholarship.org Powered by the California Digital Library University of California ARTICLES https://doi.org/10.1038/s41564-017-0098-y Differential depth distribution of microbial function and putative symbionts through sediment-hosted aquifers in the deep terrestrial subsurface Alexander J. Probst1,5,7, Bethany Ladd2,7, Jessica K. Jarett3, David E. Geller-McGrath1, Christian M. K. Sieber1,3, Joanne B. Emerson1,6, Karthik Anantharaman1, Brian C. Thomas1, Rex R. Malmstrom3, Michaela Stieglmeier4, Andreas Klingl4, Tanja Woyke 3, M. Cathryn Ryan 2* and Jillian F. Banfield 1* An enormous diversity of previously unknown bacteria and archaea has been discovered recently, yet their functional capacities and distributions in the terrestrial subsurface remain uncertain. Here, we continually sampled a CO2-driven geyser (Colorado Plateau, Utah, USA) over its 5-day eruption cycle to test the hypothesis that stratified, sandstone-hosted aquifers sampled over three phases of the eruption cycle have microbial communities that differ both in membership and function. Genome- resolved metagenomics, single-cell genomics and geochemical analyses confirmed this hypothesis and linked microorganisms to groundwater compositions from different depths. Autotrophic Candidatus “Altiarchaeum sp.” and phylogenetically deep- branching nanoarchaea dominate the deepest groundwater. A nanoarchaeon with limited metabolic capacity is inferred to be a potential symbiont of the Ca.
    [Show full text]
  • The Rich Repertoire of Antibiotic Resistance Genes in Candidate Phyla Radiation Genomes
    bioRxiv preprint doi: https://doi.org/10.1101/2021.07.02.450847; this version posted July 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 1 Title: Small and equipped: the rich repertoire of antibiotic resistance genes in Candidate 2 Phyla Radiation genomes 3 Running title: Microbial competition: CPR are players 4 Authors list: Mohamad Maatouk1,2, Ahmad Ibrahim1,2, Jean-Marc Rolain1,2, Vicky 5 Merhej1,2*, Fadi Bittar1,2* 6 Affiliations : 1 Aix-Marseille Univ, IRD, APHM, MEPHI, 27 boulevard Jean Moulin, 13385 Marseille 7 CEDEX 05, France. 8 2IHU Méditerranée Infection, 19-21 boulevard Jean Moulin, 13385 Marseille CEDEX 05, France 9 * Corresponding author: Fadi Bittar. Email: [email protected]; ORCID: 0000-0003-4052- 10 344X. 11 Vicky Merhej. Email: [email protected] 12 13 14 15 Abstract: 256 words 16 Text: 4,005 words 17 Reference: 40 18 Figures: 5 figures and 2 supplementary figures 19 Tables: 1 supplementary table 20 21 22 1 bioRxiv preprint doi: https://doi.org/10.1101/2021.07.02.450847; this version posted July 2, 2021. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. 23 ABSTRACT 24 Microbes belonging to Candidate Phyla Radiation (CPR) have joined the tree of life as a new 25 unique branch, thanks to the intensive application of metagenomics and advances of 26 sequencing technologies.
    [Show full text]