<<

Vol 462 | 24/31 December 2009 | doi:10.1038/nature08656 LETTERS

A phylogeny-driven genomic encyclopaedia of and Archaea

Dongying Wu1,2, Philip Hugenholtz1, Konstantinos Mavromatis1,Ru¨diger Pukall3, Eileen Dalin1, Natalia N. Ivanova1, Victor Kunin1, Lynne Goodwin4, Martin Wu5, Brian J. Tindall3, Sean D. Hooper1, Amrita Pati1, Athanasios Lykidis1, Stefan Spring3, Iain J. Anderson1, Patrik D’haeseleer1,6, Adam Zemla6, Mitchell Singer2, Alla Lapidus1, Matt Nolan1, Alex Copeland1, Cliff Han4, Feng Chen1, Jan-Fang Cheng1, Susan Lucas1, Cheryl Kerfeld1, Elke Lang3, Sabine Gronow3, Patrick Chain1,4, David Bruce4, Edward M. Rubin1, Nikos C. Kyrpides1, Hans-Peter Klenk3 & Jonathan A. Eisen1,2

Sequencing of bacterial and archaeal genomes has revolutionized our gene from across the tree of life7. Working from the root to the tips of understanding of the many roles played by microorganisms1.There the tree, we identified the most divergent lineages that lacked repre- are now nearly 1,000 completed bacterial and archaeal genomes sentatives with sequenced genomes (completed or in progress)8 and available2, most of which were chosen for sequencing on the basis for which a has been formally described9 and a type strain of their physiology. As a result, the perspective provided by designated and deposited in a publicly accessible culture collection10. the currently available genomes is limited by a highly biased phylo- From hundreds of candidates, 200 type strains were selected both to genetic distribution3–5. To explore the value added by choosing obtain broad coverage across Bacteria and Archaea and to perform microbial genomes for sequencing on the basis of their evolutionary in-depth sampling of a single phylum. The Gram-positive bacterial relationships, we have sequenced and analysed the genomes of 56 phylum was chosen for the latter purpose because of culturable species of Bacteria and Archaea selected to maximize the availability of many phylogenetically and phenotypically diverse phylogenetic coverage. Analysis of these genomes demonstrated cultured strains, and because it had the lowest percentage of sequenced pronounced benefits (compared to an equivalent set of genomes isolates of any phylum (1% versus an average of 2.3%)11.Ofthe200 randomly selected from the existing database) in diverse areas includ- targeted isolates, 159 were designated as ‘high’ priority primarily on the ing the reconstruction of phylogenetic history, the discovery of basis of phylum-level novelty and the ability to obtain microgram quan- new protein families and biological properties, and the prediction tities of high quality DNA. The genomes of these 159 are being of functions for known genes from other organisms. Our results sequenced, assembled, annotated (including recommended metadata12) strongly support the need for systematic ‘phylogenomic’ efforts to and finished, and relevant data are being released through a dedicated compile a phylogeny-driven ‘Genomic Encyclopedia of Bacteria Integrated Microbial Genomes database portal13 and deposited into and Archaea’ in order to derive maximum knowledge from exist- GenBank. Currently, data from 106 genomes (62 of which are finished) ing microbial genome data as well as from genome sequences to are available. come. To assess the ramifications of this tree-based selection of organisms, Since the publication of the first complete bacterial genome, we focused our analyses on the first 56 genomes for which the shotgun sequencing of the microbial world has accelerated beyond expecta- phase of sequencing was completed. The 53 bacteria and 3 archaea tions. The inventory of bacterial and archaeal isolates with complete or (Supplementary Table 1) represent both a broad sampling of bacterial draft sequences is approaching the two thousand mark2. Most of these diversity and a deeper sampling of the phylum Actinobacteria (26 genome sequences are the product of studies in which one or a few GEBA genomes). An initial question we addressed was whether selec- isolates were targeted because of an interest in a specific characteristic tion on the basis of phylogenetic novelty of SSU rRNA genes reliably of the organism. Although large-scale multi-isolate genome sequen- identifies genomes that are phylogenetically novel on the basis of other cing studies have been performed, they have tended to be focused on criteria. This question arises because it is known that single genes, even particular habitats or on the relatives of specific organisms. This over- SSU rRNA genes, do not perfectly predict genome-wide phylogenetic all lack of broad phylogenetic considerations in the selection of micro- patterns14,15. To investigate this, we created a ‘genome tree’ (ref. 16) of bial genomes for sequencing, combined with a cultivation bottleneck6, completed bacterial genomes (Fig. 1) and then measured the relative has led to a strongly biased representation of recognized microbial contribution of the GEBA project using the phylogenetic diversity phylogenetic diversity3–5. Although some projects have attempted to metric17. We found that the 53 GEBA bacteria accounted for 2.8–4.4 correct this (for example, see ref. 5), they have all been small in scope. times more phylogenetic diversity than randomly sampled subsets of To evaluate the potential benefits of a more systematic effort, we 53 non-GEBA bacterial genomes. A similar degree of improvement in embarked on a pilot project to sequence approximately 100 genomes phylogenetic diversity was seen for the more intensively sampled acti- selected solely for their phylogenetic novelty: the ‘Genomic nobacteria (Table 1). These analyses indicate that although SSU rRNA Encyclopedia of Bacteria and Archaea’ (GEBA). genes are not a perfect indicator of organismal evolution, their phylo- Organisms were selected on the basis of their position in a phylo- genetic relationships are a sound predictor of phylogenetic novelty genetic tree of small subunit (SSU) ribosomal RNA, the best sampled within the universal gene core present in bacterial genomes.

1DOE Joint Genome Institute, Walnut Creek, California 94598, USA. 2University of California, Davis, Davis, California 95616, USA. 3DSMZ, German Collection of Microorganisms and Cell Cultures, 38124 Braunschweig, Germany. 4DOE Joint Genome Institute-Los Alamos National Laboratory, Los Alamos, California 87545, USA. 5University of Virginia, Charlottesville, Virginia 22904, USA. 6Lawrence Livermore National Laboratory, Livermore, California 94550, USA. 1056 ©2009 Macmillan Publishers Limited. All rights reserved NATURE | Vol 462 | 24/31 December 2009 LETTERS

365

A C A Clost aldicellulosiruptor saccharolyticus

Clostridium bot B

Clostridium C Candidatus Ther 5334 C Carbox ridium botulinum T C

1435 Clostridium thermocellum Clostridium phytofermentans A Al K11 moanaer 1 B4 LC

kaliph 3 SC 0 Pelotomaculum thermopropioni 26 S S Clostridium kluyve Clostridium p cus s 5 i ydothermus hydrogenoforman Alkalip magna 7 r 1 JC ri

Symbiobacterium thermophilum KBA 0 Natranaerob 4 o Desulfotomaculum reducens 255 15 str Desulfotom

arius s ga acetobutyli 3K i Desulforudis audaxvi ilus metalliredigens m HTE8 2 ulinum HTA4 6b M Heliobacterium m obacter tengc cus 6 ns 3 9 bul De NCIMB 8052 C3 hilus oremland 125 oo Clostrid 1 sis

cre yti 83 Clostridium difficile Anaer ne sulfitobact Clostridium novy ocald 5 H3 r C sakei 4571 ella Y MGAS1 ibiricum yen V A3 str Loch Maree ha 745 s B str Eklund 17B bsp e UA15 33323 p 5 s ophilus e subsp 118 erfringens SM K lis ius thermophilus acid t t aemol i C SAFR 032 hermoacetica a c su K haericus 05Z sub ococcus prevotii ium tet h en DPC urans s ih culum acetoxidans um CC ii TCC 334 is tans g nste eca cki 3956 Ureaplasma parvum d TC s sp subsp A us U 9 M Myc ri TCC 2 illu A u ct i e suis yo ue agalactiaes ATCC 29328 ATCC 824 s yco ongensis cus mu 1 ihe tic DSM 555 FS1 r alo A acillus la p cterium KM20 IFO Mycoplasma ium hafn c la iu ATCC 27405 umilus br Mycoplasma arthritidi h shimeri serovar eri oplasma pulmo c i l s p wolfei o DSM 8903 ase ani us plasma i p lve

ba 1 OhILA e SU desticaldum str 13 cillus co Mycoplasma synoviae obac F275 WC ator MP104C Q c s del i us cus ccus cu a o ass livar clob w r acillus kaus ccus ISDg P 1 CC 2377 E88 uo c i YMF ill an e As i illus V ib V lus M NT co o lus sa t um i illu TCC 367 Mycoplasma mobile st ATCC 39073 c e illus we hyl co lus 630 ococcus fa il osace

ter yellows witches mycoides b Mycopla eillonell licy coc cillus sakei AT ycoplasma pneumoniae MB ac JW c c o c llus he iense s Z 2901 eria a A ph s 1 ac o c citreum 21 cum Clostridium beijerinckii A r Goett s Exig B s u Ba ermentum O ysin ter 13 MI 1 Geob st M B Ba ct epto 74 Fusobacterium nucleatum hyopneumoniae IAM 14863 L n ob us oen pent lantar mo 2 Stap sto us reu evi sp BA DSM 1394 BP serovary 3 str NM W 4 Li a baci l p J 10 fl a coplasma E toba bacillus g l s rrenum Y51 SI L treptoc o i My Str c us s e e 1017 nis a Ice1 Strept act ctobacil o no B sma genitalium S a t u olzii subsp o PCC 3 coplasm parvula ingen L a r ther 2 L lus br urantiac 2 IC1 101 UAB CTIP L ctobacillus ngatus S Candi N LF Lact a um t us Lac nococc acill l tenh e M s PG2 L b u JA elo 158L3 1 Leuc c MBs I 2 Mesoplasmamy galliseptic Oe a laidlawii 53 Lactobac obacil cas sp 75 roomdatus ph a penetr Lactobacillus f iolac 1 3 coides ATCC 7009 163K Pediococcacto v ccu osiphonus a r PCC 7310 Streptobaci 7448 L mobt x C 700 43 Lact e oco 8 630 MP1 Phytoplasm M129 G37 Dehalococcoid S rpe 1142 Therma s SC str PG1 Sphaerobactee oris marinaiforme 5 r CC Det ytoplasma AYW ans um Ther PCC t u H loroflexus aurantiacus sp PC NIE s 1 s b nct 1 s sp Leptotrichia buccali florum HF 2 R Roseifl ochl tu 2 u hiosulfovibrio nucleatum peptidovor 70 Ch echococcusy u sp PCC 6803 307 naerovibrio acidaminovorans Sebaldellallu termitidis Gloeobacter C T 9 arin s m rmosynech p ATCC str CCMP1986 Fervidobacterium a Syn e coccus RC m L1 Aca o s oris Thermotoga oniliformi m Th str MIbsp t T Thermo PG 8A al s aeruginosaus elonga sp CC9902C9311 u hermosipho melanesiens i Nostoc pech cc C s NATL2Apas ATCC 25586 B Trichodesmiumthece erythraeum sp s 1 Syn co sp str marinus toga maritima s Synechocystis coccus s arinu subsp Cyano cho o cus inus lettingae s e s m ar no Microcysti hococcusoc u s DSM 994 Petrotoga mobilis Syn c oc arinus dosum ans ococcu B Synechyne ch cocc uchnera aphidicola TMO S lor o Thermus Meiothermu thermophilu Meiothermus ruber Rt1 MSB8 ococcus m Buchnera aphidicola Deinococcus radiodurans Syneroch hlor is 7 B1 P er xylanophilulum Wigglesworthia glossinidia BI429 Proc ct u ducens Buch SJ95 Prochlor re s m parv an ner str APS s silvanus Prochlorococcusu motrini 7 a aphidicola Rubroba nta 8 2 str heli e 705 Buchnera aphidicola Acyrthosiphon p s Conexibacter woeseia l TW0 Sg HB8 Atopobickia hell NCC2411 end str BpSchizaphis Sla ert ium ferrooxidpplei K Candidatus R1 gg i DSM C7109 13129 Baumannia cicade osymbiont E icrob m Ba ryptobacteriumu mcurtum longum NCT Candidatus izongia graminumpistaciaeisum C m jeikeium Blochmannia penn str Cc Acidim cteri of Glossina m urealyticu YS 3140216 llinicola BlochmanniaCinar flo Tropheryma wh iphtheriae Bifidoba SRS3 str Hc brevipalpisa cedri orynebacteriuebacteriu sylvanicus str C bacterium d Homalod Coryn ActinobacillusShigel succinoge ridanus nae isca coagulata Coryne caver la flexne BPEN Corynebacteriumoccus radiotoleransa efficiens ducreyiri Kineoc bergi losilytica 2a str 301 ellu Beuten nas c nes 130Z Cellulomonas flavigena Aeromonas salmoni 35000HP ylanimo PhotobacteriumVibrio pro fische X Sanguibacter keddieiientarius O395 bsp michiganensis cida subsp ri ES114 Jonesiaoccus denitrificans sedum faecium su Shewanella sediminissalmonicidafundum Kytoc iganensis S SS9 str CTCB07 hewan A449 Brachybacteri xyli ella frigidimarina HAW EB3 Clavibacter mich subsp Psychromonas ingrahamii ila DC2201 ATCC 33209 NCIMB 400 Leifsoniaia rh xyliizoph Colwellia psyc Kocur Pseudoalterom 37 ibacterium salmoninarum onas hrerythraea Ren sp FB24 KPA171202 haloplanktis 34H Arthrobacter rium acnes Idiomarina ioihiensis TAC125 Propionibacte Pseudoalteromonas atlantica L2TR a flavida Kribbell sp JS614 T6c Nocardioides a Kangiella k oreensis Catenulispora acidiphil A3 2 Marinomonas sp MWYL1 Chromoha Streptomyces coelicolor us 11B lobacter salexigens DSM 3043 Acidothermus cellulolytic Hahella chejuensis KCTC 2396 Thermobispora bispora Marinobacter aquaeolei VT8 Streptosporangium roseum Cellvibrio japonicus Ueda107 Thermomonospora curvata 2 40 Thermobifida fusca YX Saccharophagus degradans Nocardiopsis pv phaseolicola 1448A dassonvillei Pseudomonas syringae s 273 4 Frankia sp CcI3 Salinispor Psychrobacter arcticu sp PRwf 1 a arenicola C Psychrobacter Stackebrandtia nassauensNS 205 sp ADP1 Geodermatophilus obscurusis Acinetobacter SK2 Nakamurell Alcanivorax borkumensis RSA 331 Actinosynnemaa mult mirumipartita str Lens Saccharomonospora viridis anii HA gionella pneumophila Saccharopolyspor Le XCL 2 Mycobacte a eryth Vesicomyosocius okutholarctica Mycobacteriumrium gilvumabscessusr Candidatus 3A aea NRRL 2338 Thiomicrospira crunogena subsp VCS170 Myco tularensis Rh bacter ncisella er nodosus Temecula1279a odoco ium lepra PYR Fra lobact sa K Nocard ccus e GCK Diche philia Bath Go ia fa sp RHA1 TN Xylella fastidio malto str SL1 rdon rcini as sulatus hila Tsukamurellaia b p ca homon ap op HE 1 Lep ronc IFM 10 otrop s c hal ML tospira bifl hialis 15 Sten lococcu ichei 197071 Le aur 2 ethy hodospira l CC H Bracpto ometa M ehr AT SP spir exa Halor eani 1 2 Trepohy a interr se bola lilimnicolac vorans EF0 J2 spira rova lka cus o ido C Tre ne murdooga r P A ac rans 118 pon ma ns ato sococ lftia r eiseniae T Bo ema de chii sero c strain itro De te ens Bo rrelia nt va N c halenivo i SP 6 rrel h p ico r Copenha Pa ht educ PM1 Rhodo er allidu la A toc ephrobanap rrir odni 1 Pla ia g msii m TC 1 A s e chol ari D s C geni me ermin ona rax f x s STIR M nct pirel nii AH ubs 35405 str F s V m riu 3 et om lu PB p i ro dofe tothri nensis 38 Akkermh y la i pal ocruz Pola o Lep petroleiphilum yla ce ba lidu L1 Rh cessa taiwa sp O cidi s lim ltica m s ia ydansN C pi p str Ni ibiumter ne er ox 1 a tutusansia mhi nophilusSH 1 ethyl c old c 197 Chl ndi lum chols M oba priavidu d te i le Cu Chlamydopam atus rr u nfern Burkh a avium sp EbN RCB Chl ydi ae cin nuc nas arseni s C91 Pr P ip or o cu atica Chl amy a tr B h um Poly m pha 59 oto 90 ila nii ordetell 251962 T Ch o ac c 1 ATCC BAAV4 8 i B Azoar ro 5 K C lorobro dophilah h hla s aromeut 2 s hl he ila omatism Herm Chloroo rpe abort ydi s ATCC atu 1945 P ro iu mona i ATCCll 1 1 ro bac m ton thalapneumo A a a ro monas Chlorobs phaeobact us H moeb 35 lo o CP 12472283 R t bi ulum S2 A h tiform flageC 9 hecochum pha R ros ul N M Salihod niae6 1 o Dec Nit illus TCCC 03 C ss 3 3 p a m itrificansc ae A J sp 4 46 a n o iu pa ium hil oe C 9 Py2 Cytophagandi iba hthe m loris vieobacterv T a UW pir den C us D um er AT W os s loba 5 Spiryad d c rmusc oid 18 s u erans AT P a t hl NCIB C E o aceum BisB18 e t e or b e C 3 25 itr onorrhol otol s edo o u r oc ri s 351 N Methyg rophic b 25u 3 Chi o b s rub m of ro BS1 ia N lo Bactero somaacter l A h ides m vi radi u 58 tin b moeer ar romatiiorm 83 10 Thiobacill indica sp ORS278 P a ut D inus 27 isserteriu um ethylobacterium KC 16M P ara o c c is D e r To 3099 1 Canorp phagater f bo SM DS SM N ac M adskyi 4 Capn bact in e h subsp st NC1 gu rme ins p CaD3 acter autot 8 Flavoba hyromonide h hil 13 M 26 2 ca b itensis 3 G didatus eparial onii 6 inograna l eroidess t d ntan us 85 6 romob di ho t sp B DS 1 E ra oc p e 5 t domonasyrhizobium palustrisw in cilliformis iae 3 Su lusimicrob he in ATCC asia 5 Ch n a me MAFF30ic Aq m n u um v ans yt ensi us s Xa ter q lla N lf e o taio Methylobacterickia in Brad oti or SS W itr ui u l c t n ac e l MCS10sp K31 114 H la forsetiiterium p psSu a icu ri c bv v D r ha s s 33 u izobi ris i 22 Helic el olinellaat s f ihydro t ije r Sulfurovum ex ae lc gin is ao s trob rh ATCC 15444 4 1 Arco i irupt ga oc tas 40 5 e i tonellaB b oy Sul cobacter hep ia m B Rhodopseu rtonellar o OCh S gi 6 a2 N arum DFL 12 2 ium minut mue icron sp CCS1 Ca obac valisonis Ba es ns 2444 Campyloba ulfu o ge Ba lis ma mer a PD12 Campyl furi o M u C b l unium s D m r i KT0803ychroph inos a ZM4 a ucci sp cus n rac po RW1 Acidoba am ll A sorhizobium S en p ro monasc i e W V m IH1 B t bium TC m lavamentiept Caulobacter roides 621H Ha Myxo ter but r P u An o y e e i u RB2256 PAl 5 d s VF5 83 ter e HTCC2594 JF 5 S itrov I 548 Pe li py l r pylori SB155 2 a GWS Me es Geo o piril n

Desulfomicrobium baculatum C ifican DSM 1 HZ Ge a G ello Maric Pelobacter p ora b s D S um Sy Desulfo oge m De a bacter h h leg 11170 lia o p NB 85 acte lobac eo ilum acul y l erom bac GDN u es s mobilis ma o bact a Jannaschi AMB 1 ob l 2 b n coccus x de p n ngium ochra ibrio vibrio u S 0 ci sulf ng ticu P i oralis gton bacter b c cter zle n YO3 vi C t b t ulfohalob 3 r sph phicus m e ro roph acter e t Mari G20 2 JIP02 86 a t acter lov oxydans DP4 r u ni C i191 Sil cter denitrifica o str Jake ium ce e r s vorans c te e Par crypt C1062 d i 66 s N100 ph o yx DS izobium cte r tri 3 is te ia bac a r f ATCC ticum jej eleyianum RM4018 subsp c s h a er str ta r c 7 AT AOP1 gia malayi om Boryong cet sp MC 1 o b i fi 95 R o u r sulfurre etus ob tatu Hyphomonas n u 1 lea u acter lit ne str S bact can M 1740 u c acterio C seoba c s o RML369 C ni i yxis alaskensis arb ant ac r iphilus n Br c nc C 5 a bilis esdensis tr Wilmin an l p a can ropionicus s str Miyaya ps lulosu i us Dinoroseobacter shibae vulgaris Ro c s te s ium s hodob o agocytophilum le subsp aracoccus denitr s ellii i u A DSM 1 t PHE M er hus id i i s str Welgevonden ce E 1 R h e i nol r chi y bs P r ium us ychro o ll T 4 cter diazotr r sp itroph ed m aromatici i v fu in le 49 Wolbachia pipientis ococcus um CC B Erythrobu hing retb SZ o 1 p Sphingomonas wittichii duc ic m D 6 p Acidiphilium o ucens rus 3826 Ehrli marox doylei subsp us Fw10 K 16 0 Ellin3 S Gluconobact vo f S e 7 2 desulfuricans utsugamushi phila aens e t

ic 6 AA 381 5 rans o us D HD10 ns bacter beth D 1 c u SM 2 22

nacetoba tospirillumaplasma mag p Magnet 45 s strain TRS of SM P Rf 9 82 b e 56 2 idan

n L e 69 9 Zymomonas mo S 5 ne C co Hxd3 A 4 subsp S 40 g B 0 23 A Granuli Rhodospirillum rubrum v 38 s

Glu Anaplasma marginale 54 7 MPOB Novosphingobi Ma Orientia ts 79 0 Pelagibacter ubique HTC

Neorickettsia sennetsu ruminantium

Lawsonia intracellularis

Desulfovibrio vulgaris

Gammaproteobacteria Aquificae Candidatus Actinobacteria

achia endosymbiont

vibrio desulfuricans

Betaproteobacteria Wolb

Desulfo Chlorobi Chloroflexi Deinococcus/Thermus / Figure 1 | Maximum-likelihood phylogenetic tree of the bacterial based on a concatenated alignment of 31 broadly conserved protein-coding genes16. Phyla are distinguished by colour of the branch and GEBA genomes are indicated in red in the outer circle of species names.

The discovery and characterization of new gene families and their as ecological niche; nevertheless, higher rates of novel protein family associated novel functions provide one incentive for sequencing discovery were found in the more phylogenetically diverse taxa additional genomes, analysis of which has helped to redefine the (Fig. 2). In addition, of the 16,797 families identified in the 56 protein family universe18. We explored the quantitative effect of GEBA genomes, 1,768 showed no significant sequence similarity to tree-based genome selection on the pace of discovery of novel proteins any proteins, indicating the presence of novel functional diversity. and functions. Specifically, we compared the rate of discovery of These results highlight the utility of tree-based genome selection as novel protein families when progressively adding more closely related a means to maximize the identification of novel protein families and genomes versus when adding more distantly related ones (Fig. 2). argues against lateral gene transfer significantly redistributing genetic Granted, many factors contribute to protein family diversity, such novelty between distantly related lineages. 1057 ©2009 Macmillan Publishers Limited. All rights reserved LETTERS NATURE | Vol 462 | 24/31 December 2009

Table 1 | Effect of SSU rRNA tree-based selection of organisms on compar- families, respectively. Halorhabdus utahensis, a halophilic archaeon ative genomic metrics known to have b-xylanase and b-xylosidase activities21,hasachromo- Comparative genomic metric GEBA set Random sets Fold somal cluster including two GH10 family b-xylanases and six novel (number of resamplings) improvement GH5 family proteins of unknown specificity. The enrichment of genetic diversity is also seen within families of 17 Genome tree phylogenetic diversity non-coding RNAs, transposable elements, and other cellular compo- Bacteria (domain) 11.03.2 6 0.7 (100) 2.8–4.4 Actinobacteria (phylum) 4.31.4 6 0.3 (100) 2.5–3.9 nents. For example, the genome of the marine myxobacterium ochraceum contains 807 CRISPR (clustered regularly New protein family links 46 3 6 4 (5) 6.6 to .15.3 interspaced short palindromic repeats) units including the largest Genes in new chromosomal cassettes 71,579 16,579 6 5,523 (20) 3.2–6.5 single CRISPR array known, comprising 382 spacer/repeat units. New gene fusions 433 65 6 31 (20) 4.5–12.7 CRISPR is a newly recognized, but ancient and widespread, system in bacteria and archaea that confers resistance to viruses and other GEBA genomes were compared to equivalently sized random sets of reference genomes to 22 quantify the effect of phylogenetic selection. invading foreign DNAs . Results from the GEBA pilot project challenge our current under- Novel proteins also can serve to link distantly related homologues standing for the taxonomic distribution of known gene families. The whose relatedness would otherwise go undetected. Forty-six such most striking example of which is the discovery of an actin homo- links were identified in the 56 GEBA genomes compared to an average logue in H. ochraceum. Actin and its close relatives are structural of only three new links in equivalent sets of randomly sampled non- components of the eukaryotic cytoskeleton that are found in every GEBA genomes (Table 1). A useful complement to homology-based eukaryote and only in eukaryotes. Bacteria and archaea encode predictions of gene function are ‘non-homology methods’ (ref. 19) instead the shape-determining protein MreB. Although MreBs have such as gene context-based inference that relies on the conserved some functional and structural similarities to eukaryotic actins, they clustering of functionally related genes across multiple genomes, often are regarded, at best, distantly related homologues23 and possibly not in operons or as gene fusions20. We identified over 70,000 genes in new chromosomal cassettes of two or more genes in the GEBA genomes. This represents a three- to sixfold increase over equivalent sets of non- ab GEBA genomes (Table 1). Similarly, the number of new gene fusions Peptidase M4 identified in the GEBA genomes is 4 to ,13 times greater than in thermolysin

randomly selected genome sets (Table 1). Because the GEBA data Markers DNA RNA RNase treated set produced a several-fold improvement over random sets for all metrics examined (Table 1), we predict that other aspects of Hypothetical protein sequence-based biological discovery will similarly benefit from tree- based genome sequencing. Hypothetical protein 1.6 kb The GEBA genomes also show significant phylogenetic expansions 1.0 kb within known protein families. For example, although only two of the Hypothetical protein 0.5 kb 56 GEBA organisms are known cellulose degraders, we identified in the set of genomes a variety of glycoside hydrolase (GH) genes that may BARP participate in the breakdown of cellulose and hemicelluloses. Among Phosphatase c these are 28 and 7 phylogenetically divergent members of the endo- 1 kb glucanase- and processive exoglucanase-containing GH6 and GH48 Oxidoreductase

70,000 Conserved hypothetical protein Hypothetical protein Methyltransferase 60,000 Bacteria from GEBA project (1,060.6 new families/genome) d 1C0F_A DGEDVQAL VI DNGSGMC K AGF A GDDA P RAVFPSIVGR P R HTG KDSYV... 50,000 BARP SS....PI II HPGSDTL Q AGL A DEEH P GSIFPNIVGR H K LAG LMEWVDQR

1C0F_A ....G D EA QS KRGIL T LKYPIEHGIVTNWDDMEKIW HHTFYNELR VAPEE BARP VLCVG Q EA ID QSATV L LRHPVWSGIVGDWEAFAAVL RHTFYRAL W VAPEE 40,000 Actinobacteria 1C0F_A HPVLLTEA P L...N PKA N REK MTQ IM FETFNTP A MY VAIQAVLSLYAS G R HPIVVTES P HVYRS FQL R REQ LTR LL FETFHAP Q VA VCSEAAMSLYAC G L (649.6 new families/genome) BARP 30,000 1C0F_A T TGIVMDS GD G VSHT VPIYE G Y AL PH AI LRLDLA GRD LTDYMM KILT ERG BARP D TGLVVSL GD F VSYV APVHR G A IV DA GL TFLEPD GRS ITEYLS RLLL E RG 1C0F_A Y S FT TTAEREIVRDIKEK LAYVAL D FEA E MQTA A SSSALEK SYE LPD G QV Protein family number 20,000 (307.7 new families/genome) BARP H V FT SPEALRLVRDIKET LCYVAD D VAK E ..AA R NADSVEA TYL LPN G ET

1C0F_A I T IGNERF RC PEALFQ P SFLGM ES AG IH E TTYN S IMK CD VD I RKDLYGNV 10,000 BARP L V LGNERF RC PEVLFH P DLLGW ES PG LT D AVCN A IMK CD PS L QAELFGNI (including families with single members) Streptococcus agalactiae 1C0F_A VLSGGT TM FP GIA DRMNKELTALAPSTM K IKIIAPPERKYS VWIG GSILA (121.3 new families/genome) BARP VVTGGG SL FP GLS ERLQRELEQRAPAEA P VHLLTRDDRRHL PWKG AARFA 0 0 10 20 30 40 50 60 70 80 1C0F_A SLSTFQQMWISKEEYDESGPS IVHRK CF RDAQFAGFALTRQAYERHGAE LIYQM .. Genome number BARP Figure 2 | Rate of discovery of protein families as a function of Figure 3 | A bacterial homologue of actin. a, Genomic context of the bacterial phylogenetic breadth of genomes. For each of four groupings (species, actin-related protein (BARP) gene within the genome of the marine different strains of Streptococcus agalactiae; family, Enterobacteriaceae; Deltaproteobacterium H. ochraceum. Red, gene encoding BARP; white, genes phylum, Actinobacteria; domain, GEBA bacteria), all proteins from that encoding hypothetical proteins; black, genes with functional annotations. group were compared to each other to identify protein families. Then the b, RT–PCR demonstration of expression of the gene encoding BARP in total number of protein families was calculated as genomes were H. ochraceum. c, Ribbon plot of the putative structure of BARP. d,Alignment progressively sampled from the group (starting with one genome until all of BARP with actin from Dictyostelium discoideum29 with similarities in black were sampled). This was done multiple times for each of the four groups shaded text. Secondary structure elements (arrows, beta-strands; bars, alpha- using random starting seeds; the average and standard deviation were then helices) are colour-coded as in c. A phylogenetic tree including this protein is in plotted. Supplementary Figure 1. 1058 ©2009 Macmillan Publishers Limited. All rights reserved NATURE | Vol 462 | 24/31 December 2009 LETTERS even homologous. Like other bacteria, H. ochraceum encodes a bona tree of life. The pilot study presented here is a dedicated first step in fide MreB protein, but in addition, it encodes a protein that is clearly this direction. a member of the actin family, which we have named BARP (bacterial actin-related protein; Fig. 3). Although we do not yet have evidence METHODS SUMMARY for its precise function, BARP is expressed in H. ochraceum (Fig. 3b). Starting with a phylogenetic tree of SSU rRNA genes7, we identified major Assuming that the H. ochraceum mreB orthologue performs the same branches that had no available genome sequences but for which cultured isolates function as in other bacteria, and given that the , to were available in the DSMZ or ATCC culture collections. Selected isolates which this species belongs, are known to synthesize actin-targeting (Supplementary Table 1a, b) from these branches were grown and DNA isolated 24 (Supplementary Table 1c) and quality checked. DNA was then used for shotgun , we propose that this BARP may be a dominant-negative genome sequencing by Sanger/ABI, Roche/454 and/or Illumina/Solexa technolo- inhibitor of eukaryotic actin polymerization. Regardless of its precise gies (Supplementary Table 2). Sequence reads were assembled separately with function, this first—and so far only—discovery of an expressed different assembly methods and the best draft assembly was used for annotation homologue of eukaryotic actin in a member of the Bacteria highlights and as a starting point for genome completion (current genome status is in the potential for novel and surprising biological discoveries given a Supplementary Table 2). Annotation (gene identification, functional prediction, wider genomic sampling of the tree of life. etc.) was performed using the IMG system (http://img.jgi.doe.gov/geba); this was We conclude that targeting microorganisms for genome sequencing done both after shotgun sequencing and again after genome completion. For ‘whole solely on the basis of phylogenetic considerations offers significant far- genome tree’ analysis, a PHYML maximum likelihood phylogenetic tree of a con- catenatedalignment of 31 marker genes wasbuilt using AMPHORA16. Phylogenetic reaching benefits in diverse areas. Furthermore, the benefits of phylo- diversity was calculated as the sum of branch lengths in this and other trees. Protein genetically driven genome sequencing show no sign of saturating with families were built for various genome sets by using the Markov clustering algo- these first 56 genomes. A key question then lies in determining how rithm (MCL)28 to group proteins on the basis of ‘all versus all’ blastp searches. For much bacterial and archaeal diversity remains to be sampled. Using analysis of phylogenetic diversity of organisms, a phylogenetic tree was built for a SSU rRNA gene sequences as a proxy for organismal diversity (Fig. 4), combined alignment of SSU rRNA sequences from published genomes and a non- we estimate that sequencing the genomes of only 1,520 phylogenetically redundant subset of greengenes SSU rRNA7. Further analysis of the genomes was selected isolates could encompass half of the phylogenetic diversity done using IMG database queries and new computational analyses as described in represented by known cultured bacteria and archaea. Given the continu- the main text, legends and Supplementary Methods. ing reductions in both the cost and difficulty in sequencing genomes25, Received 3 June; accepted 30 October 2009. this is certainly a tractable target in the next few years. 1. Fraser, C. M., Eisen, J. A. & Salzberg, S. L. Microbial genome sequencing. Nature However, the great majority of recognized bacterial and archaeal 406, 799–803 (2000). diversity is not represented by pure cultures and an additional 9,218 2. Liolios, K., Mavromatis, K., Tavernarakis, N. & Kyrpides, N. C. The Genomes On genome sequences from currently uncultured species would be Line Database (GOLD) in 2007: status of genomic and metagenomic projects and required to capture 50% of this recognized diversity (Fig. 4). Such their associated metadata. Nucleic Acids Res. 36 (database issue), D475–D479 (2008). an undertaking will require new approaches to culturing or proces- 3. Hugenholtz, P. Exploring prokaryotic diversity in the genomic era. Genome Biol. 3, 26 sing of multi-species samples using methods such as metagenomics REVIEWS0003.1–REVIEWS0003.8 (2002). or physical isolation of cells from mixed populations followed by 4. Eisen, J. A. Assessing evolutionary relationships among microbes from whole- whole genome amplification methods27. Obtaining reference gen- genome analysis. Curr. Opin. Microbiol. 3, 475–480 (2000). 5. Wu, D. et al. Complete genome sequence of the aerobic CO-oxidizing omes for the uncultured microbial majority will be a natural exten- thermophile Thermomicrobium roseum. PLoS One 4, e4207 (2009). sion of the GEBA project, the ultimate goal of which is to provide a 6. Pace, N. R. A molecular view of microbial diversity and the biosphere. Science 276, phylogenetically balanced genomic representation of the microbial 734–740 (1997). 7. DeSantis, T. Z. et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl. Environ. Microbiol. 72, 5069–5072 (2006). 1,200 GEBA genomes 8. Bernal, A., Ear, U. & Kyrpides, N. Genomes OnLine Database (GOLD): a monitor of Pre-GEBA genomes genome projects world-wide. Nucleic Acids Res. 29, 126–127 (2001). Organisms from the greengenes database 9. Lapage, S. P. et al., International Code of Nomenclature of Bacteria, 1990 Revision. 1,000 Organisms from the greengenes database (American Society for Microbiology, 1992). (excluding environmental samples) 10. Ward, N., Eisen, J., Fraser, C. & Stackebrandt, E. Sequenced strains must be saved from extinction. Nature 414, 148 (2001). 800 11. Hugenholtz, P. & Kyrpides, N. C. A changing of the guard. Environ. Microbiol. 11, 120 551–553 (2009). 12. Field, D. et al. The minimum information about a genome sequence (MIGS) 100 specification. Nature Biotechnol. 26, 541–547 (2008). 600 80 13. Markowitz, V. M. et al. The integrated microbial genomes (IMG) system in 2007: data content and analysis tool extensions. Nucleic Acids Res. 36 (database issue), 60 D528–D533 (2008). 400 14. Achtman, M. & Wagner, M. Microbial diversity and the genetic nature of Phylogenetic diversity 40 microbial species. Nature Rev. Microbiol. 6, 431–440 (2008). 20 15. Beiko, R. G., Doolittle, W. F. & Charlebois, R. L. The impact of reticulate evolution 200 on genome phylogeny. Syst. Biol. 57, 844–856 (2008). 0 0 400 800 1,200 16. Wu, M. & Eisen, J. A. A simple, fast, and accurate method of phylogenomic inference. Genome Biol. 9, R151 (2008). 17. Pardi, F. & Goldman, N. Resource-aware taxon selection for maximizing 0 0 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 phylogenetic diversity. Syst. Biol. 56, 431–444 (2007). Number of organisms 18. Kunin, V., Cases, I., Enright, A. J., de Lorenzo, V. & Ouzounis, C. A. Myriads of protein families, and still counting. Genome Biol. 4, 401 (2003). Figure 4 | Phylogenetic diversity of bacteria and archaea on the basis 19. Marcotte, E. M. et al. Detecting protein function and protein-protein interactions of SSU rRNA genes. Using a phylogenetic tree of unique SSU rRNA gene from genome sequences. Science 285, 751–753 (1999). sequences7, phylogenetic diversity was measured for four subsets of this tree: 20. Enright, A. J., Iliopoulos, I., Kyrpides, N. C. & Ouzounis, C. A. Protein interaction organisms with sequenced genomes pre-GEBA (blue), the GEBA organisms maps for complete genomes based on gene fusion events. Nature 402, 86–90 (1999). (red), all cultured organisms (dark grey), and all available SSU rRNA genes 21. Wainø, M. & Ingvorsen, K. Production of b-xylanase and b-xylosidase by the (light grey). For each subtree, taxa were sorted by their contribution to the extremely halophilic archaeon Halorhabdus utahensis. Extremophiles 7, 87–93 (2003). 30 subtree phylogenetic diversity and the cumulative phylogenetic diversity 22. Barrangou, R. et al. CRISPR provides acquired resistance against viruses in was plotted from maximal (left) to the least (right). The inset magnifies the . Science 315, 1709–1712 (2007). first 1,500 organisms. Comparison of the plots shows the phylogenetic ‘dark 23. Doolittle, R. F. & York, A. L. Bacterial actins? An evolutionary perspective. matter’ left to be sampled. Bioessays 24, 293–296 (2002). 1059 ©2009 Macmillan Publishers Limited. All rights reserved LETTERS NATURE | Vol 462 | 24/31 December 2009

24. Sasse, F., Kunze, B., Gronewold, T. M. & Reichenbach, H. The chondramides: Author Contributions D.W. (rRNA analysis, gene families, actin tree, manuscript cytostatic agents from myxobacteria acting on the actin cytoskeleton. J. Natl. preparation), P.H. (selection of strains, analysis, manuscript preparation, project Cancer Inst. 90, 1559–1563 (1998). coordination), L.G. and D.B. (project management), R.P., B.J.T., E.L., S.G., S.S. (strain 25. Shendure, J. & Ji, H. Next-generation DNA sequencing. Nature Biotechnol. 26, curation and growth), K.M., N.N.I., I.J.A., S.D.H., A.P., A.Ly. (annotation, genome 1135–1145 (2008). analysis), V.K. (CRISPRs, actin), M.W. (whole genome tree), P.D., C.K., A.Z. and 26. Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K. & Hugenholtz, P. A M.S. (actin studies), M.N., S.L., J.-F.C., F.C. and E.D. (sequencing), C.H., A.La., M.N. bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev. 72, 557–578 and A.C. (finishing), P.C. (analysis), E.M.R. (manuscript preparation), N.C.K. (2008). (selection of strains, annotation, analysis), H.-P.K. (strain selection and growth, 27. Ishoey, T., Woyke, T., Stepanauskas, R., Novotny, M. & Lasken, R. S. Genomic DNA preparation, manuscript preparation), J.A.E. (project lead and coordination, sequencing of single microbial cells from environmental samples. Curr. Opin. analysis, manuscript preparation). Microbiol. 11, 198–204 (2008). 28. Enright, A. J., Van Dongen, S. & Ouzounis, C. A. An efficient algorithm for Author Information Genome sequence and annotation data is available at the JGI large-scale detection of protein families. Nucleic Acids Res. 30, 1575–1584 (2002). IMG-GEBA page http://img.jgi.doe.gov/geba and has been submitted to GenBank 29. Matsuura, Y. et al. Structural basis for the higher Ca21-activation of the regulated with accessions ABSZ00000000, ABTA00000000, ABTB00000000, actin-activated myosin ATPase observed with Dictyostelium/Tetrahymena actin ABTC00000000, ABTD00000000, CP001618, ABTF00000000, chimeras. J. Mol. Biol. 296, 579–595 (2000). ABTG00000000, ABTH00000000, ABTI00000000, ABTJ00000000, 30. Moulton, V., Semple, C. & Steel, M. Optimizing phylogenetic diversity under ABTK00000000, ABTM00000000, NZ_ABTN00000000, constraints. J. Theor. Biol. 246, 186–194 (2007). NZ_ABTO00000000, ABTP00000000, ABTQ00000000, NZ_ABTR00000000, NZ_ABTS00000000, NZ_ABTT00000000, Supplementary Information is linked to the online version of the paper at NZ_ABTU00000000, NZ_ABTV00000000, NZ_ABTW00000000, www.nature.com/nature. NZ_ABTX00000000, NZ_ABTY00000000, NZ_ABTZ00000000, Acknowledgements . We thank the following people for assistance in aspects of the NZ_ABUA00000000, NZ_ABUB00000000, NZ_ABUC00000000, project including planning and discussions (R. Stevens, G. Olsen, R. Edwards, NZ_ABUD00000000., NZ_ABUE00000000, NZ_ABUF00000000, J. Bristow, N. Ward, S. Baker, T. Lowe, J. Tiedje, G. Garrity, A. Darling, S. Giovannoni), NZ_ABUG00000000, NZ_ABUH00000000, NZ_ABUI00000000, analysis of genomes whose work could not be included in this report (B. Henrissat, NZ_ABUJ00000000, NZ_ABUK00000000, ABUL00000000, G. Xie, J. Kinney, I. Paulsen, N. Rawlings, M. Huntemann), project management NZ_ABUM00000000, NZ_ABUO00000000, NZ_ABUP00000000, (M. Miller, M. Fenner, M. McGowen, A. Greiner), sequencing and finishing (K. Ikeda, ABUQ00000000, NZ_ABUR00000000, ABUS00000000, M. Chovatia, P. Richardson, T. Glavinadelrio, C. Detter), culture growth, DNA NZ_ABUT00000000, NZ_ABUU00000000, NZ_ABUV00000000, extraction, and metadata (D. Gleim, E. Brambilla, S. Schneider, M. Schro¨der, NZ_ABUW00000000, NZ_ABUX00000000, NZ_ABUZ00000000, M. Jando, G. Gehrich-Schro¨ter, C. Wahrenburg, K. Steenblock, S. Welnitz, M. Kopitz, NZ_ABVA00000000, NZ_ABVB00000000 and NZ_ABVC00000000. All R. Fa¨hnrich, H. Pomrenke, A. Schu¨tze, M. Rohde, M. Go¨ker), and manuscript editing strains that have been sequenced are available from the DSMZ culture collection (M. Youle). This work was performed under the auspices of the US Department of and culture accessions are available in Supplementary Information. This paper is Energy’s Office of Science, Biological and Environmental Research Program, and by distributed under the terms of the Creative Commons the University of California, Lawrence Berkeley National Laboratory under contract Attribution-Non-Commercial-Share Alike licence, and is freely available to all no. DE-AC02-05CH11231, Lawrence Livermore National Laboratory under contract readers at www.nature.com/nature. Reprints and permissions information is no. DE-AC52-07NA27344, and Los Alamos National Laboratory under contract no. available at www.nature.com/reprints. Further details on sequencing and genome DE-AC02-06NA25396. Support for J.A.E., D.W. and M.W. was provided by the properties of each organism are being published in the journal Standards in Gordon and Betty Moore Foundation Grant no. 1660 to J.A.E. Support for work at Genomic Sciences (SIGS) (http://standardsingenomics.org/). Correspondence DSMZ was provided under DFG INST 599/1-1. and requests for materials should be addressed to J.A.E. ([email protected]).

1060 ©2009 Macmillan Publishers Limited. All rights reserved