Lawrence Berkeley National Laboratory Recent Work

Lawrence Berkeley National Laboratory Recent Work Title 1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life. Permalink https://escholarship.org/uc/item/7cx5710p Journal Nature biotechnology, 35(7) ISSN 1087-0156 Authors Mukherjee, Supratim Seshadri, Rekha Varghese, Neha J et al. Publication Date 2017-07-01 DOI 10.1038/nbt.3886 Peer reviewed eScholarship.org Powered by the California Digital Library University of California RESOU r CE OPEN 1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life Supratim Mukherjee1,10, Rekha Seshadri1,10, Neha J Varghese1, Emiley A Eloe-Fadrosh1, Jan P Meier-Kolthoff2 , Markus Göker2 , R Cameron Coates1,9, Michalis Hadjithomas1, Georgios A Pavlopoulos1 , David Paez-Espino1 , Yasuo Yoshikuni1, Axel Visel1 , William B Whitman3, George M Garrity4,5, Jonathan A Eisen6, Philip Hugenholtz7 , Amrita Pati1,9, Natalia N Ivanova1, Tanja Woyke1, Hans-Peter Klenk8 & Nikos C Kyrpides1 We present 1,003 reference genomes that were sequenced as part of the Genomic Encyclopedia of Bacteria and Archaea (GEBA) initiative, selected to maximize sequence coverage of phylogenetic space. These genomes double the number of existing type strains and expand their overall phylogenetic diversity by 25%. Comparative analyses with previously available finished and draft genomes reveal a 10.5% increase in novel protein families as a function of phylogenetic diversity. The GEBA genomes recruit 25 million previously unassigned metagenomic proteins from 4,650 samples, improving their phylogenetic and functional interpretation. We identify numerous biosynthetic clusters and experimentally validate a divergent phenazine cluster with potential new chemical structure and antimicrobial activity. This Resource is the largest single release of reference genomes to date. Bacterial and archaeal isolate sequence space is still far from saturated, and future endeavors in this direction will continue to be a valuable resource for scientific discovery. Systematic surveys of the diversity of cultivated microorganisms have subsequent experiments. Typically, a type strain has well-character- lagged behind improvements in sequencing technologies. Traditionally, ized taxonomic and phenotypic data, isolation source metadata, and most isolate sequencing projects are chosen based on the clinical or other criteria, as defined by the International Code of Nomenclature biotechnological relevance of the target organisms or their physiology1. of Prokaryotes (ICNP)10. As of December 5, 2015, there were 12,981 In 2015, 43% of sequenced bacterial genomes comprised just ten human bacterial and archaeal species with valid, published names, with 650 pathogenic species. While sequencing different strains of the same spe- new type strains added (on average) every year11,12. However, despite cies aided our understanding of pathogenesis, the focus on specific bac- their importance, the genomes of only 826 type strains were publicly terial species results in a biased phylogenetic representation of sequence available at the start of this study. space. This skewed phylogeny narrowed our view of the functional and The Genomic Encyclopedia of Bacteria and Archaea (GEBA) pilot evolutionary diversity of microbial life. There is a direct correlation project presented the analysis of 56 type-strain genomes and validated between phylogenetic distance and novel function discovery2,3, which the usefulness of a phylogeny-driven ‘encyclopedia’ of bacteria and suggests that filling the gaps in the phylogenetic tree might result in a archaea3. We now present a substantially expanded data set (GEBA-I) Nature America, Inc., part of Springer Nature. All rights reserved. 4 7 substantial increase in new genes, protein families and pathways . comprising 1,003 reference genomes from 974 bacterial and 29 Reference genomes can fill phylogenetic gaps, but also serve as archaeal type strains. Our objectives were to provide an expanded 201 © anchors for the identification of sequence fragments from metage- reference genome catalog of broad phylogenetic and physiological nomic studies. Previous efforts to expand the bacterial and archaeal diversity, to determine how this catalog facilitates the discovery of reference genomes by targeted sequencing of phylogenetically under- protein families and expands the diversity of known functions, and to represented lineages have enabled vast improvements in taxonomic ascertain whether these type-strain genomes improve the recruitment assignment in metagenomic data sets5. Furthermore, access to and phylogenetic assignment of existing metagenomic sequences. completed genomes enables more accurate whole-genome-based taxonomic assignments6,7 and improved phylogenies8,9. RESULTS Bacterial and archaeal type strains are the representative unit of Increased phylogenetic diversity of microbial genomes a microbial species, and are chosen when the species name is estab- 974 bacterial and 29 archaeal genomes (from 579 genera in 21 phyla and lished. Type strains are maintained in at least two different culture 43 classes) were sequenced as part of the GEBA Initiative (GEBA-I), collections and provide easy access to source strain material for using a phylogeny-based scoring system for strain selection6,13. 1Department of Energy, Joint Genome Institute, Walnut Creek, California, USA. 2Leibniz Institute DSMZ - German Collection of Microorganisms and Cell Cultures, Braunschweig, Germany. 3Department of Microbiology, University of Georgia, Athens, Georgia, USA. 4Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, Michigan, USA. 5NamesforLife, LLC, East Lansing, Michigan, USA. 6University of California Davis Genome Center, Davis, California, USA. 7Australian Centre for Ecogenomics, The University of Queensland, Brisbane, Queensland, Australia. 8School of Biology, Newcastle University, Newcastle upon, Tyne, UK. 9Present addresses: Zymergen Inc., Emeryville, California, USA (R.C.C.) and Roche Molecular Systems Inc., Pleasanton, California, USA (A.P.). 10These authors contributed equally to this work. Correspondence should be addressed to N.C.K. ([email protected]). Received 8 November 2016; accepted 21 April 2017; published online 12 June 2017; doi:10.1038/nbt.3886 NATURE BIOTECHNOLOGY ADVANCE ONLINE PUBLICATION 1 RESOU r CE 70 2 Of the 1,003 genomes presented, 396 GEBA-I genomes were the a 3 3 2 ActinobacteriaA first sequenced representative of a genus (Fig. 1a). The Caldithrixae, Chloroflexi 4 Deinococcus-Thermus Deferribacteres, Synergistetes and Thermodesulfobacteria (Fig. 1a) Saccharibacteria Crenarchaeot Lentisphaerae phyla have the most new genera. The most populous phyla, in terms Planctomycetes Thaumarchaeota Verrucomicrobia Chlamydiae 1 of numbers of genomes sequenced, were the Proteobacteria (with 330 a 12 Euryarchaeota genomes), Firmicutes (178), Bacteroidetes (163) and Actinobacteria GemmatimonadetesFibrobacteres Caldithrixae (157). The remaining 175 genomes belonged to 17 additional phyla, 72 including the only sequenced representative of the Caldithrixae phy- Ignavibacteriae Bacteroidetes 4 lum (Supplementary Table 1). The GEBA-I strains originate from Deferribacteres a multitude of habitats including extreme environments, terrestrial Chlorobi biomes, industrial waste and human body sites (Supplementary Fig. 1) Chrysiogenetes 52 Firmicutes/Tenericutes Phylum containing GEBA-I genome and unsurprisingly have diverse physiology, genome size and aver- Phylum without GEBA-I genome 2 Nitrospirae Prior to GEBA-I age G+C content (Supplementary Fig. 2). GEBA-I is a high-quality New genus added by GEBA-I Armatimonates Cyanobacteria reference resource with 99.4% (on average) genome completeness AcidobacteriaAc 14 2 (assessed using CheckM ; Supplementary Table 1). Annotation of Synergistetes Thermodesulfobi ThermodesulfobacteriaTh the 1,003 GEBA-I genomes resulted in 3,472,483 predicted genes from 2 Aquificae 9 Coprothermoba um narugense 3.75 Gbp of assembled sequence data (Supplementary Fig. 3 and icae Supplementary Table 1). All GEBA-I genomes are publicly avail- Dictyoglom 0 Caldiseric cter platensi ProteobacteriaPro i able through the Integrated Microbial Genomes with Microbiomes Thermotogae Fusobacteria 144 a (IMG/M) system15 and GenBank, and the corresponding strains a s through the respective culture collection (Supplementary Table 1). Spirochaetes Elusimicrobi To quantify the increase in phylogenetic diversity contributed by 2 2 GEBA-I genomes compared with all previously available, validly 4 named archaeal and bacterial species (i.e., type strains), we meas- b ured the diversity distance of all sequenced type strains in a compre- 100 hensive 16S rRNA gene tree6. The GEBA-I genomes increased the 90 phylogenetic distance threefold, expanding the overall diversity of 80 the type-strain sequence space by ~24% (Fig. 1b). Further, we applied 70 a whole-genome comparative analysis based on the average nucle- 60 otide identity to verify the relative novelty of the GEBA-I genomes 50 compared to a set of 14,625 control genomes. We found that the vast 40 majority (845/1,003) of the GEBA-I genomes were ‘singletons’ on the 30 basis of the proposed criteria for defining a “species group”7, verifying 20 that no other sequenced representative of that species is available. 10 Cumulative percentage of total bRPD 0 Before GEBA-1 Other type strains Expanding the universe of known proteins 0 A total of 3,402,887 protein-coding sequences were predicted from

Lawrence Berkeley National Laboratory Recent Work

Reference Genome Sequence of the Model Plant Setaria

Y Chromosome Dynamics in Drosophila

The Bacteria Genome Pipeline (BAGEP): an Automated, Scalable Workflow for Bacteria Genomes with Snakemake

Telomere-To-Telomere Assembly of a Complete Human X Chromosome W

Human Genome Reference Program (HGRP)

De Novo Genome Assembly Versus Mapping to a Reference Genome

Practical Guideline for Whole Genome Sequencing Disclosure

Y Chromosome

Reference Genomes and Common File Formats Overview

Informatics and Clinical Genome Sequencing: Opening the Black Box

Using DECIPHER V2.0 to Analyze Big Biological Sequence Data in R by Erik S

Building De Novo Reference Genome Assemblies of Complex Eukaryotic