A Phylogeny-Driven Genomic Encyclopaedia of Bacteria and Archaea
Total Page:16
File Type:pdf, Size:1020Kb
Vol 462 | 24/31 December 2009 | doi:10.1038/nature08656 LETTERS A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea Dongying Wu1,2, Philip Hugenholtz1, Konstantinos Mavromatis1,Ru¨diger Pukall3, Eileen Dalin1, Natalia N. Ivanova1, Victor Kunin1, Lynne Goodwin4, Martin Wu5, Brian J. Tindall3, Sean D. Hooper1, Amrita Pati1, Athanasios Lykidis1, Stefan Spring3, Iain J. Anderson1, Patrik D’haeseleer1,6, Adam Zemla6, Mitchell Singer2, Alla Lapidus1, Matt Nolan1, Alex Copeland1, Cliff Han4, Feng Chen1, Jan-Fang Cheng1, Susan Lucas1, Cheryl Kerfeld1, Elke Lang3, Sabine Gronow3, Patrick Chain1,4, David Bruce4, Edward M. Rubin1, Nikos C. Kyrpides1, Hans-Peter Klenk3 & Jonathan A. Eisen1,2 Sequencing of bacterial and archaeal genomes has revolutionized our gene from across the tree of life7. Working from the root to the tips of understanding of the many roles played by microorganisms1.There the tree, we identified the most divergent lineages that lacked repre- are now nearly 1,000 completed bacterial and archaeal genomes sentatives with sequenced genomes (completed or in progress)8 and available2, most of which were chosen for sequencing on the basis for which a species has been formally described9 and a type strain of their physiology. As a result, the perspective provided by designated and deposited in a publicly accessible culture collection10. the currently available genomes is limited by a highly biased phylo- From hundreds of candidates, 200 type strains were selected both to genetic distribution3–5. To explore the value added by choosing obtain broad coverage across Bacteria and Archaea and to perform microbial genomes for sequencing on the basis of their evolutionary in-depth sampling of a single phylum. The Gram-positive bacterial relationships, we have sequenced and analysed the genomes of 56 phylum Actinobacteria was chosen for the latter purpose because of culturable species of Bacteria and Archaea selected to maximize the availability of many phylogenetically and phenotypically diverse phylogenetic coverage. Analysis of these genomes demonstrated cultured strains, and because it had the lowest percentage of sequenced pronounced benefits (compared to an equivalent set of genomes isolates of any phylum (1% versus an average of 2.3%)11.Ofthe200 randomly selected from the existing database) in diverse areas includ- targeted isolates, 159 were designated as ‘high’ priority primarily on the ing the reconstruction of phylogenetic history, the discovery of basis of phylum-level novelty and the ability to obtain microgram quan- new protein families and biological properties, and the prediction tities of high quality DNA. The genomes of these 159 are being of functions for known genes from other organisms. Our results sequenced, assembled, annotated (including recommended metadata12) strongly support the need for systematic ‘phylogenomic’ efforts to and finished, and relevant data are being released through a dedicated compile a phylogeny-driven ‘Genomic Encyclopedia of Bacteria Integrated Microbial Genomes database portal13 and deposited into and Archaea’ in order to derive maximum knowledge from exist- GenBank. Currently, data from 106 genomes (62 of which are finished) ing microbial genome data as well as from genome sequences to are available. come. To assess the ramifications of this tree-based selection of organisms, Since the publication of the first complete bacterial genome, we focused our analyses on the first 56 genomes for which the shotgun sequencing of the microbial world has accelerated beyond expecta- phase of sequencing was completed. The 53 bacteria and 3 archaea tions. The inventory of bacterial and archaeal isolates with complete or (Supplementary Table 1) represent both a broad sampling of bacterial draft sequences is approaching the two thousand mark2. Most of these diversity and a deeper sampling of the phylum Actinobacteria (26 genome sequences are the product of studies in which one or a few GEBA genomes). An initial question we addressed was whether selec- isolates were targeted because of an interest in a specific characteristic tion on the basis of phylogenetic novelty of SSU rRNA genes reliably of the organism. Although large-scale multi-isolate genome sequen- identifies genomes that are phylogenetically novel on the basis of other cing studies have been performed, they have tended to be focused on criteria. This question arises because it is known that single genes, even particular habitats or on the relatives of specific organisms. This over- SSU rRNA genes, do not perfectly predict genome-wide phylogenetic all lack of broad phylogenetic considerations in the selection of micro- patterns14,15. To investigate this, we created a ‘genome tree’ (ref. 16) of bial genomes for sequencing, combined with a cultivation bottleneck6, completed bacterial genomes (Fig. 1) and then measured the relative has led to a strongly biased representation of recognized microbial contribution of the GEBA project using the phylogenetic diversity phylogenetic diversity3–5. Although some projects have attempted to metric17. We found that the 53 GEBA bacteria accounted for 2.8–4.4 correct this (for example, see ref. 5), they have all been small in scope. times more phylogenetic diversity than randomly sampled subsets of To evaluate the potential benefits of a more systematic effort, we 53 non-GEBA bacterial genomes. A similar degree of improvement in embarked on a pilot project to sequence approximately 100 genomes phylogenetic diversity was seen for the more intensively sampled acti- selected solely for their phylogenetic novelty: the ‘Genomic nobacteria (Table 1). These analyses indicate that although SSU rRNA Encyclopedia of Bacteria and Archaea’ (GEBA). genes are not a perfect indicator of organismal evolution, their phylo- Organisms were selected on the basis of their position in a phylo- genetic relationships are a sound predictor of phylogenetic novelty genetic tree of small subunit (SSU) ribosomal RNA, the best sampled within the universal gene core present in bacterial genomes. 1DOE Joint Genome Institute, Walnut Creek, California 94598, USA. 2University of California, Davis, Davis, California 95616, USA. 3DSMZ, German Collection of Microorganisms and Cell Cultures, 38124 Braunschweig, Germany. 4DOE Joint Genome Institute-Los Alamos National Laboratory, Los Alamos, California 87545, USA. 5University of Virginia, Charlottesville, Virginia 22904, USA. 6Lawrence Livermore National Laboratory, Livermore, California 94550, USA. 1056 ©2009 Macmillan Publishers Limited. All rights reserved NATURE | Vol 462 | 24/31 December 2009 LETTERS 365 A C A Clost aldicellulosiruptor saccharolyticus Clostridium bot B Clostridium C Candidatus Ther 5334 C Syntrophomonas wolfei Carbox ridium botulinum T C 1435 Clostridium thermocellum Clostridium phytofermentans A Al K11 moanaer 1 B4 LC kaliph 3 SC 0 Pelotomaculum thermopropioni 26 S S Clostridium kluyve Clostridium p cus s 5 i ydothermus hydrogenoforman Alkalip Finegoldia magna 7 r 1 JC ri Symbiobacterium thermophilum KBA 0 Natranaerob 4 o Desulfotomaculum Desulfotom reducens 255 15 str arius s ga acetobutyli 3K i Desulforudis audaxvi ilus metalliredigens m HTE8 2 ulinum HTA4 6b M Heliobacterium m obacter tengc cus 6 ns 3 9 bul De NCIMB 8052 C3 hilus oremland 125 oo Clostrid 1 sis cre yti 83 Clostridium difficile Anaer ne sulfitobact Clostridium novy ocald 5 H3 r C sakei 4571 ella Y MGAS1 ibiricum yen V A3 str Loch Maree ha 745 s B str Eklund 17B bsp e UA15 33323 p 5 s ophilus e subsp 118 erfringens SM K lis ius thermophilus acid t t aemol i C SAFR 032 hermoacetica a c su K haericus 05Z sub ococcus prevotii ium tet h en DPC urans s ih culum acetoxidans um CC ii TCC 334 is tans g nste eca cki 3956 Ureaplasma parvum d TC s sp subsp A us U 9 M Myc ri TCC 2 illu A u ct i e suis yo ue Mycoplasma agalactiaes ATCC 29328 ATCC 824 s yco ongensis cus mu 1 ihe tic DSM 555 FS1 r alo A acillus la p cterium KM20 IFO Mycoplasma ium hafn c la iu ATCC 27405 umilus br Mycoplasma arthritidi h shimeri serovar eri oplasma pulmo c i l s p wolfei o DSM 8903 ase ani us plasma i p lve ba 1 OhILA e SU desticaldum str 13 cillus co Mycoplasma synoviae obac F275 WC ator MP104C Q c s del i us cus ccus cu a o ass livar clob w r acillus kaus ccus ISDg P 1 CC 2377 E88 uo c i YMF ill an e As i illus V ib V lus M NT co o lus sa t um i illu TCC 367 Mycoplasma mobile st ATCC 39073 c e illus we hyl co lus 630 ococcus fa il osace ter yellows witches mycoides b Mycopla eillonell licy coc cillus sakei AT ycoplasma pneumoniae MB ac JW c c o c llus he iense s Z 2901 eria a A ph s 1 ac o c citreum 21 cum Clostridium beijerinckii A r Goett s Exig B s u Ba ermentum O ysin ter 13 MI 1 Geob st M B Ba ct epto 74 Fusobacterium nucleatum hyopneumoniae IAM 14863 L n ob us oen pent lantar mo 2 Stap sto us reu evi sp BA DSM 1394 BP serovary 3 str NM W 4 Li a baci l p J 10 fl a coplasma E toba bacillus g l s rrenum Y51 SI L treptoc o i My Str c us s e e 1017 nis a Ice1 Strept act ctobacil o no B sma genitalium S a t u olzii subsp o PCC 3 coplasm parvula ingen L a r ther 2 L lus br urantiac 2 IC1 101 UAB CTIP L ctobacillus ngatus S Candi N LF Lact a um t us Lac nococc acill l tenh e M s PG2 L b u JA elo 158L3 1 Leuc c MBs I 2 Mesoplasmamy galliseptic Oe a Acholeplasma laidlawii 53 Lactobac obacil cas sp 75 roomdatus ph a penetr Lactobacillus f iolac 1 3 coides ATCC 7009 163K Pediococcacto v ccu osiphonus a r PCC 7310 Streptobaci 7448 L mobt x C 700 43 Lact e oco 8 630 MP1 Phytoplasm M129 G37 Dehalococcoid S rpe 1142 Therma s SC