<<

© 2018 Nature America, Inc., part of Springer Nature. All rights reserved. Tyne, UK. for material strain source to access easy provide and collections culture different two least at in maintained are strains Type lished. estab is name the when chosen are and species, microbial a between between phylogenetic distance and novel function discovery correlation direct a is There life. microbial of diversity evolutionary space. This skewed phylogeny narrowed our view of the functional and terial species results in a biased phylogenetic representation of sequence cies aided our understanding of pathogenesis, the focus on specific bac pathogenic species. While sequencing different strains of the same spe In 2015, 43% of sequenced bacterial genomes comprised just ten human State State University, East Lansing, Michigan, USA. 1 biotechnological relevance of the target or clinical the or on their physiologybased chosen are projects sequencing isolate most lagged behind improvements in sequencing technologies. Traditionally, Systematicof diversity the of surveys cultivated microorganisms have Supratim Mukherjee isolates expand coverage of the tree of life 1,003 reference genomes of bacterial and archaeal OPEN u o s e R 676 to to be a valuable resource for scientific discovery. date. Bacterial and archaeal isolate sequence space is still far from saturated, and future endeavors in this direction will continue potential new chemical structure and activity.antimicrobial This Resource is the largest single release of reference genomes to Weinterpretation. identify numerous biosynthetic clusters and validate experimentally a divergent phenazine cluster with 25 million previously unassigned metagenomic proteins from 4,650 samples, improving their phylogenetic and functional genomes reveal a 10.5% increase in novel protein families as a function of phylogenetic diversity. The GEBA genomes recruit strains and expand their overall phylogenetic diversity by 25%. Comparative analyses with previously available finished and draft initiative, selected to maximize sequence coverage of phylogenetic space. These genomes double the number of existing type We present 1,003 reference genomes that were sequenced as part of the Genomic Encyclopedia of and (GEBA) USA. USA. completed genomes enables more accurate whole-genome-based whole-genome-based assignments taxonomic accurate more enables genomes completed substantial increase in new genes, protein families and pathways phylogeneticgaps the the mightin tree result suggeststhata in filling Braunschweig, Germany.Braunschweig, sinet n eaeoi dt sets data metagenomic in assignment taxonomic in improvements vast enabled have lineages represented reference genomes by of targeted under sequencing phylogenetically archaeal and bacterial the expand to efforts Previous studies. nomic metage from fragments sequence of identification the for anchors Yasuo Yoshikuni Markus Göker Received Received 8 November 2016; accepted 21 April 2017; published online 12 June 2017; corrected after print 28 February 2018; authors contributed equally to this work. should Correspondence be addressed to N.C.K. ( Amrita Amrita Pati Department Department of Energy, Joint Genome Institute, Walnut Creek, California, USA. Bacterial and archaeal type strains are the representative unit of of unit representative the are strains type archaeal and Bacterial Reference genomes can fill phylogenetic gaps, but also serve as as serve also but gaps, phylogenetic fill can genomes Reference

7 Australian Australian Centre for Ecogenomics, The University of Queensland, Brisbane, Queensland, Australia. 9 Present Present addresses: Zymergen Inc., Emeryville, California, USA (R.C.C.) and Roche Molecular Systems Inc., Pleasanton, California, USA (A.P.). 1 , 9 , , Natalia N Ivanova 2

1 , Axel Visel

, R Cameron Coates 3 Department Department of Microbiology, University of Georgia, Athens, Georgia, USA. 6 , 7 1,10 and improved phylogenies improved and r , Rekha Seshadri 1 e c

, William B Whitman 5 NamesforLife, LLC, NamesforLife, East Lansing, Michigan, USA. 5 1 . Furthermore, access to to access Furthermore, . , , Tanja Woyke 1 , 9 , Michalis Hadjithomas 1,10 8 , Neha J Varghese , 9 . 2,3 1 , , Hans-Peter Klenk , which 4 3 . , George M Garrity 2 1 - - - - - Leibniz Leibniz Institute DSMZ - German Collection of and Microorganisms Cell Cultures, .

using a phylogeny-based scoring system for strain selection strain for system scoring phylogeny-based a using (GEBA-I), Initiative GEBA the of part as sequenced were classes) 43 974 bacterial and 29 archaeal genomes (from 579 genomes genera microbial of in diversity 21 phylaphylogenetic Increased and RESULTS sequences. metagenomic existing of assignment phylogenetic and ascertain whether these type-strain genomes improve the recruitment of protein and families expands the diversity of known and discovery functions, to the facilitates catalog this how determine to diversity, physiological and phylogenetic broad of catalog genome expanded 29 reference an provide and to were objectives bacterial Our 974 strains. type from archaeal genomes reference 1,003 comprising archaea and bacteria of ‘encyclopedia’ phylogeny-driven a of usefulness the project presented the analysis of 56 type-strain genomes and validated study. this of start at the available were publicly strains type of 826 only genomes the importance, their (ICNP) of of Nomenclature Code by International the as defined criteria, other and metadata, source isolation data, phenotypic and taxonomic ized well-character has strain type Typically, a experiments. subsequent new type strains added (on year new added strains type average) every 650 with names, published valid, with species archaeal and bacterial The Genomic Encyclopedia of Bacteria and pilot (GEBA) Archaea of Bacteria Encyclopedia The Genomic [email protected] 1 VOLUME 35 VOLUME 1 , Emiley , A Emiley Eloe-Fadrosh , Georgios A Pavlopoulos 3 . . We now present a data expanded substantially set (GEBA-I) 6 8 4 University University of California Davis Genome Center, Davis, California, Department Department of Microbiology and Molecular Genetics, Michigan & Nikos C Kyrpides 4

, 5 NUMBER 7 NUMBER , Jonathan A Eisen 8 School School of Biology, Newcastle University, Newcastle upon, ). 1 0 . As of December 5, 2015, there were 12,981 12,981 were there 2015, 5, . As of December

JULY 2017 JULY 1 1 , Jan P Meier-Kolthoff d

oi:10.1038/nbt.388 1

6 , David Paez-Espino

, Philip Hugenholtz nature biotechnology nature 11 , 1 2 . . However, despite 6 10 These These 6 , 1 3 . 2 7 1

, ,

, -

© 2018 Nature America, Inc., part of Springer Nature. All rights reserved. (assessed using CheckM using (assessed completeness genome average) (on 99.4% with resource reference ( content G+C age aver and size genome physiology, diverse have unsurprisingly and over time (i.e., in chronological order of their date of release; date of order of their release; in chronological (i.e., over time genomes archaeal and bacterial sequenced newly by added families ( genomes) 1,000 protein (per new families of rate growth the calculated we sequencing, genome whole of advent the since ongoing been has that trend a of tinuation diversity. sequence protein known GEBA-I genomes led to a in increase substantial the number of added ( genomes at 15,000 around plateaued large of with the but addition genomes, 5,000 the first initially almost was time over added families protein of number the that found we family novelty initially with observed the first 2,000 genomes. Second, protein the to equivalent families, protein new of rate growth the in of the GEBA-I genomes (noted in red) resulted in a dramatic increase Addition genomes. sequenced 2,000 the first after declined markedly that rate the growth of First, we new protein families observed inset). biomes, industrial waste and human terrestrial body sites environments, ( extreme including habitats of multitude a ( lum of the representative only sequenced the including phyla, additional 17 to belonged genomes 175 remaining The (157). genomes), of numbers of genomes were sequenced, the (with 330 terms in phyla, populous most The genera. new most the have phyla ( only genomes GEBA-I 436,840 from and proteins of clusters composed were 55,105 singletons these, Of singletons. million 2.6 and sequences) two least at (containing clusters protein million 1.89 in resulted KClust using length alignment 80% over identity sequence at proteins total ~30% ~26 million Clustering genomes. and archaeal bacterial control (14,625) available all from proteins non-redundant 23,470,984 with set data this We compared genomes. GEBA-I 1,003 A total of 3,402,887 protein-coding sequences were predicted from the proteins known of universe the Expanding available. is species that of representative sequenced other no that Deferribacteres of a ( representative sequenced first nature biotechnology nature basis basis of the proposed criteria for defining a “species group” of the GEBA-I were (845/1,003) genomes majority ‘singletons’ on the We vast genomes. the that control found of 14,625 a to set compared genomes GEBA-I the of novelty relative the verify to identity otide nucle average the on based analysis comparative whole-genome a of the sequence space type-strain by ~24% ( diversity overall the expanding threefold, distance phylogenetic hensive 16S rRNA gene tree gene rRNA 16S hensive meas in a compre strains type we sequenced of all distance diversity the strains), ured type (i.e., species validly bacterial and available, archaeal named previously all with compared genomes GEBA-I ( collection culture respective the through IGM system (IMG/M) Microbiomes with Genomes Microbial Integrated the through able 1 Table Supplementary ( data sequence assembled of Gbp 3.75 the 1,003 GEBA-I genomes resulted in 3,472,483 predicted genes from upeetr Tbe 2 Table Supplementary To test if this represents a meaningful increase, or a mere con mere a or increase, meaningful a represents this if test To Of the 1,003 genomes presented, 396 GEBA-I genomes were the the were genomes GEBA-I 396 presented, genomes 1,003 the Of To quantify the increase in phylogenetic diversity contributed by contributed diversity phylogenetic in increase the Toquantify Supplementary Table 1 Table Supplementary , , 1 Supplementary Fig. 2 Fig. Supplementary 5 and GenBank, and the corresponding strains strains corresponding the and GenBank, and (178), (178), 1 ). All GEBA-I genomes are publicly avail publicly are genomes GEBA-I All ). 4 ), corresponding to a 10.5% increase in in increase 10.5% a to corresponding ), ; ;

Supplementary Table 1 Supplementary VOLUME 35 VOLUME 6 and and . The GEBA-I genomes increased the the increased genomes GEBA-I The . ). The GEBA-I strains originate from from originate strains GEBA-I The ). Fig. 2 Fig. a Fig. 2 Fig. ), and the number of protein protein of number the and ),

Fig. Fig. 1 Supplementary Fig. 3 Fig. Supplementary ). GEBA-I is a high-quality high-quality a is GEBA-I ). Fig. 1 Fig. NUMBER 7 NUMBER (163) and and (163) Supplementary Table 1 Supplementary a , inset). The addition of addition The , inset). Supplementary Fig. 1 b ). ). Further, we applied a ). The The ). Caldithrixae ). Annotation of Annotation ).

Actinobacteria JULY 2017 JULY Caldithrixae 7 ( , , verifying i. 1 Fig. Fig. 2 Fig. phy and and

a ). a ------) ) , ,

proportional downweighting of internal branches internal of downweighting proportional by followed tree the in node root and leaf each between lengths branch adding by calculated was (bRPD) diversity phylogenetic relative Balanced sequence. a genome lacking strains type remaining the denotes gray and genomes GEBA-I the by covered diversity the denotes red GEBA-I, before strains type of genomes 828 by covered diversity genetic the denotes Blue strains. type the all to relative diversity gene rRNA 16S in circles on nodes with robust phylogenetic support. ( support. phylogenetic robust with nodes on circles values support Bootstrap charts. pie the to next displayed is per GEBA-I by added genera new of number The (blue). phylum per genera of number total the to (red) genomes GEBA-I by contributed genera of fraction the represent gray.charts Pie colored are phyla other all while red, colored are genome a GEBA-I containing Phyla phyla. cultivated all from genomes representative from markers protein conserved 56 of alignment concatenated on based tree likelihood Figure 1 Figure within the cultivated genome space and suggests that continued phythat continued and genome space suggests the cultivated within gene novelty remains esis functional to that substantial be discovered ( protein families lated with specific phylogenetic lineages, we examined the we minimum examined lineages, phylogenetic lated with specific families. protein diverse of catalog expanded an in result will efforts sequencing logeny-driven

b 2 a

In order to explore whether increased functional novelty is corre is novelty functional increased whether explore to order In 2 4

Cumulative percentage of total bRPD 2 Chrysiogenetes

100 Deferribacteres AcidobacteriaAc 10 20 30 40 50 60 70 80 90 0 ThermodesulfobacteriaTh 12

GEBA-I strain phylogeny and distribution. ( distribution. and phylogeny strain GEBA-I

Euryarchaeota

0 GEBA-1 Before 144 1,000 Thaumarchaeota

ProteobacteriaPro 4

Fig. Fig. 2 2,000 Crenarchaeota

3

3,000 -Thermus a , inset). Together, , inset). the hypoth reinforces this

3

4,000 2

Spirochaetes New genusaddedbyGEBA-I Prior toGEBA-I Phylum withoutGEBA-Igenome Phylum containingGEBA-Igenome

70

Number ofstrains 5,000

Actinobacteria A

Elusimicrobia 6,000

Planctomycetes

Other typestrains 2 7,000 4 Lentisphaerae ≥

50% are shown with small small with shown are 50%

8,000 platensis Coprothermobacter ae 6

2 . 2

Thermodesulfobium narugense Thermodesulfobium 9,000 Caldiserica b a u o s e R ) Overall increase increase ) Overall

) Maximum ) Maximum Dictyoglomi

10,000

IgnavibacteriaeCaldithrixae Firmicutes/

Aquificae 11,000 Synergistetes

Bacteroidetes

Cyanobacteria Armatimonates

12,000 0 Chlorobi 1

r 9 72

e c

677 52 - - - © 2018 Nature America, Inc., part of Springer Nature. All rights reserved. ( species disparate four in present 2968370) ID: Cluster identity, acid ters ( bers of two or more phyla (designated as “hyper-heterogeneous” clus hindgut isolate isolate hindgut the from 3082022) 4221619, 809557, 2586264, IDs: (Cluster ( TCS existing from style-specific expansion style-specific For example, example, For ( available previously were representatives sequenced no or phylogenetic few for to which phyla belonged of genes novel and number greatest distance the with genomes the of many families. expected, As protein novel of number greatest the encoded reference) from 16S distance greatest (i.e., distance phylogenetic increased with GEBA-I ( for genome each clusters tein mophilum pro new of number total the to compared distance gene rRNA 16S u o s e R 678 class ( in tified terms of membership within the same genus, family, order or “heterogeneous clusters”), varying levels of “heterogeneity” were iden duplication. gene ( arrays preferred is preferred for discrimination species spacer transcribed this internal 16S-23S for the example, gene for morphism, rRNA 16S the of rate of a poly higher revealing markers nature sequenced group, other with conserved highly the by Table 2 1,327 and 2,038 novel genes, respectively ( kroppenstedtii Promicromonospora were outliers striking most The genes. novel of preponderance a encoded also sequences) 16S rRNA gene near-identical (i.e., genomes related reference closely ( singletons Chloroflexi grated elements like phage or transposons ( like phage or elements grated transposons inte of proliferation from or expansion, gene lifestyle-specific from result possibly and clusters), paralogous or “homogeneous” as here (designated genome single a from arising proteins contained in total) (13,371 clusters these of 25% Approximately genomes. GEBA-I A of total of exclusively clusters were 55,105 proteins composed from clusters protein GEBA-I-only Exploring reduction. genome or streamlining undergone have may they suggests host-associated) but reference, to (e.g., distance novel few genes 16S high with relatively genomes the of sizes Conversely, actual smaller genome. of this indicator for good distance a not evolutionary and underestimation, an likely is 0.018, = (distance distance gene for relationship from three phyla (Thermodesulfobacteria, Nitrospirae and and Nitrospirae (Thermodesulfobacteria, phyla three from multiple sensory PAS folds sensory multiple involving configuration novel a have 2509672) ID: Cluster the novel, not are themselves TCS Although stimuli. environmental to rapidly ing respond and sensing in involved (TCS) systems transduction signal two-component as such functions, regulatory in implicated are ters - clus of most GEBA-I these genomes; of analyzed the size all genome to proportional number largest the clusters, homogeneous 411 ing of genome 13.6-Mbp Thermodesulfobacterium hveragerdense For the remaining 41,734 clusters in GEBA-I genomes (designated as Fig. 2 Fig. 2 ). ). For the Supplementary Fig. 5 Fig. Supplementary , , c , contributed 5,102 genes to GEBA-I-only clusters and and clusters GEBA-I-only to genes 5,102 contributed , c Thermodesulfovibrio thiophilus Thermodesulfovibrio ). One of these clusters is a four-protein cluster (66% amino Fig. 2 Fig. ). We found a subset of clusters that originated from mem r Ktedonobacter racemifer Ktedonobacter K. racemifer K. e c Sphaerochaeta coccoides Sphaerochaeta M. genavense M. M. M. genavense b ). However, a handful of GEBA-I genomes with with genomes GEBA-I of handful a However, ). upeetr Fg 4 Fig. Supplementary Mycoplasma elephantis Ktedonobacter racemifer Ktedonobacter genavense Mycobacterium 2 0 , with some clusters arranged as tandem tandem as arranged clusters some with , 1 9 encoded genes (e.g., Histidine Kinase, Kinase, Histidine (e.g., genes encoded , and high levels of sequence divergence divergence of sequence levels , and high ), suggesting gene expansion by recent expansion gene suggesting ), implied by this minimum 16S rRNA rRNA 16S minimum this by implied genome, this observation is explained is explained observation this genome, Mycobacterium parascrofulaceum Mycobacterium RS16, DSM 19349, contributing contributing 19349, DSM RS16, 17 , 1 , 8 1 Fig. 2 Fig. Thermodesulfobacterium ther . . Thus, the close evolutionary 6 may represent another life another represent may , a member of the phylum phylum the of member a , Fig. 2 , , Fig. Fig. 2 acetivorans Desulfurella , , ). Four related clusters clusters related Four ). b Allofustis seminis Allofustis ). In general, genomes genomes ). In general, b contributed a strik a contributed and c ATCC 51234 and and 51234 ATCC ). ). For the example, Supplementary Supplementary Fig. 1 Fig. , both , both a ). ). ------) ) -

correlation between the number of predicted BCs and genome size size genome and BCs predicted of number the between correlation clear a Weobserved lantipeptides, others. and as thiopeptides, well ectoine as bacteriocins, synthetases, GEBA-I polyketide all synthetases, among tide BCs of number ( genomes total greatest the encoded nations for this observation (as reviewed by Galtier and Daubin and Galtier by reviewed (as - expla observation this Plausible for history. nations species known the from different is gene individual an of history phylogenetic the when is, that discordance, export. quinolone in functions possible with archaeon), marinus using the IMG-ABC system IMG-ABC the using GEBA-Iin the were predicted genomes ( a BCs large new of bounty metabolites, potential of ducers secondary of the selected type strains in this study were known to be prolific pro are referred to gene as clusters” “biosynthetic While only (BCs). a few ondary metabolites are on typically co-localized the chromosome and sec of synthesis the for enzymes biosynthetic encoding Genes tions. auxiliary functions such as defense, communication and other interac directly involved in primary growth and development, but rather have are organic metabolites compounds that are not secondary Microbial metabolites secondary for clusters Biosynthetic helices. transmembrane more or two or peptide signal a either possess 31% of these, and length, in acids amino >100 are singletons ( motifs structural other or signaling of presence and distribution size their assessed we pipelines, prediction gene of artifacts not are singletons ( counterparts eukaryotic its of residues conserved the all containing peptide) signal a on (based E. elysicola slug. Although pepsin-like sea enzymes are mollusk commonly a found of in eukaryotes, tract the gastrointestinal the from isolated 2238, by A ple is encoded a pepsin putative tion of functional novelty still remains to be captured. One such exam that a propor large and confirms functions new potential represents evolution. convergent or history, divergence species of pendent of the duplication history the the phylogeny reflects gene partly inde paralogs, paralogy—for hidden occurs; speciation second the when lineages monophyletic two into resolved fully not is polymorphism ancestral the is, that events, speciation rapid to due sorting lineage by the number and nature of transfers that have transpired; incomplete influenced is phylogeny the where transfer, gene horizontal include: ( two domains of life, namely, namely, life, of domains two from respectively) identity, acid amino 43% and 49% with 4403394 and 4177102 IDs: (Cluster clusters three-gene of pair co-localized a is physiology or niche ecological known of terms in theme unifying to stress resistance to sulfur metabolism sulfur to resistance stress to progression cycle cell chemistry from ranging functions cellular persulfide accomplish to using proteins sulfotrans versatile as described themselves—rhodanese-like ferases, proteins the of function putative the be may speculation this for support Further species. ing cohabit possibly these among transfer gene horizontal suggesting curious, is Proteobacteria as such phyla well-saturated relatively of genomes from membership cluster of lack the databases, sequence in represented well be may not groups taxonomic higher or their era gen particular these of members thermophilic While reduction. of sulfur anaerobic physiology common a share that Proteobacteria) suooada acaciae Hyperheterogeneous clusters Hyperheterogeneous are of instances curious phylogenetic A total of 23,839 BCs were predicted from 1,003 GEBA-I genomes 1,003 from were predicted BCs of A 23,839 total GEBA-I the in genomes identified of singletons number large The VOLUME 35 VOLUME Supplementary Table 2 Table Supplementary (both Proteobacteria) and and Proteobacteria) (both candidate is the first instance of a pepsin secreted bacterial Fig. 3 Fig.

a NUMBER 7 NUMBER ). These included numerous nonribosomal pep nonribosomal numerous included These ). , , . spinosispora P.

2 Supplementary Fig. 6 Fig. Supplementary 3 JULY 2017 JULY Maritalea myrionectae Maritalea . Three Three . ). Based on this, more than 70% of of 70% than more this, on Based ). Endozoicomonas elysicola Endozoicomonas ehnlbs tindarius

Supplementary TableSupplementary 3 2 nature biotechnology nature 1 and and . A case with no apparent apparent no with case A . csinla marina Sciscionella ). To verify that that Toverify ). , , Cucumibacter Cucumibacter genomes genomes DSM (an (an ). 2 2 ------) ) © 2018 Nature America, Inc., part of Springer Nature. All rights reserved. contributing to that cluster. that to contributing type, cluster the by colored is and cluster relative. Outliers, defined as points beyond 90% of the data with the smallest absolute residuals with a linear model, are depicted as red open circles. circles. open red as depicted are model, linear a with residuals absolute smallest the with data the of 90% beyond points as defined non-GEBA Outliers, closest its to relative. genome GEBA-I each of distance rRNA 16S minimum the and singletons and clusters protein in genes of number between ( il EAI eoe aprinn a aeae 65 ( 16.5% average an apportioning actinobacte genomes with GEBA-I rial biosynthesis, metabolite secondary to genome products. natural discovering for fruitful prove may clades these around focused efforts sequencing future families, two above the to belong GEBA-I in genomes BC-rich ten top the of six that Given discovery. gene BC for targeted intensively been not had therefore study,and this before extensively sequenced been not and increase in number of new protein families over time, as new genomes were sequenced and added to public databases (inset). ( (inset). databases public to added and sequenced were genomes new as time, over families protein new of number in increase and with an average of 6.41 ( of an 6.41 average with 2 Figure nature biotechnology nature ( of their genome. Among the actinobacterial GEBA-I soil isolates, isolates, soil GEBA-I actinobacterial the Among genome. their of ers of antibiotics and other natural products natural other and antibiotics of ers While ment). P. acaciae (e.g., microbes multiple cohabiting with involving interactions niches antagonistic) (perhaps ecological particular their of reflective ( 9.58 of average an c Supplementary Fig. 7 Supplementary ) Maximum 16S rRNA distance of genomes contributing a GEBA-I-only protein cluster. Each data point represents a single GEBA-I-only protein protein GEBA-I-only single a represents point data Each cluster. protein GEBA-I-only a contributing genomes of distance rRNA 16S Maximum ) On average, the GEBA-I genomes devote nearly 10% of their their of 10% nearly devote genomes GEBA-I the average, On b a

New clusters per 1,000 genomes Protein clusters identified using GEBA-I genomes. ( genomes. GEBA-I using identified clusters Protein was isolated from was environ isolated a plant competitive rhizosphere 1,000

No.of novel genes in protein clusters 100 200 300 400 500 600 700 800 900 0 and and singletons Streptomyces

1,000 2,000 3,000 4,000 5,000 0 M. genavense Pseudonocardiaceae P. kroppenstedtii 0 ± 1,000 3.4 s.d.) BCs per Mb. This observation is likely likely is observation This Mb. per BCs s.d.) 3.4 .000 .001 0.20 0.15 0.10 0.05 0.00 ). Actinobacterial genomes were with outliers ). Actinobacterial ± 2.4 s.d.) BCs predicted per Mb of sequence Mb of per sequence predicted BCs s.d.) 2.4

species are known to be prolific produc prolific be to known are species 2,000 VOLUME 35 VOLUME 16S distancetoclosestreferencegenome

3,000 families of families x axis is the total number of genes in each cluster, and and cluster, each in genes of number total the is axis

NUMBER 7 NUMBER 4,000 2 4 , genomes from the the from genomes , Actinobacteria 5,000

K. racemifer JULY 2017 JULY a

) Change in growth rate of protein families identified per 1,000 genomes over the years years the over genomes 1,000 per identified families protein of rate growth in Change ) 6,000 ±

8% s.d.) s.d.) 8% Clusters 1,000,000 1,500,000 2,000,000 2,500,000 3,000,000 3,500,000 4,000,000 4,500,000 No. sequencedgenomesovertime 500,000 had

7,000 0 0 - - - 1,000

2,000 antimicrobial and antifungal activity, and are produced by a wide wide a by potent produced are and have activity, that antifungal and metabolites antimicrobial secondary heterocyclic containing in the GEBA-I genomes GEBA-I the in zine pathways with novel operon and structures genes were identified phena new nine example, For GEBA-I. by contributed capabilities of biosynthetic resource genomic rich the and products natural ized character for available information limited the both reflecting fied, unclassi were products BC predicted Most of the GEBA-I genomes. the across BC by each synthesized metabolite of secondary class the compounds. bioactive of classes new contribute might respectively, spring, oil an and soil discovery are vigorously Actinobacteria pursued for new product antimicrobial the previous record for for record This previous the respectively. 36%, trumping genome, any for and far so 39% reported percentage at highest the is BCs of fraction greatest the cyanogriseus GEBA-I Non-GEBA-I 8,000

c 3,000 In addition to predicting biosynthetic gene clusters, we annotated we annotated clusters, gene biosynthetic to predicting In addition Max 16S distance per cluster 4,000 0.30 0.35 0.40 0.00 0.05 0.10 0.15 0.20 0.25

5,000 No. sequencedgenomesovertime 9,000 2 6,000 6 , these , two these previously genera unrepresented isolated from 7,000 21 02 83 640 36 32 28 24 20 16 12 8 4

8,000 10,000 9,000 y

axis is the maximum 16S distance of genomes genomes of distance 16S maximum the is axis 10,000

11,000 11,000 2 Streptomyces bingchenggensis Streptomyces No. ofgenespercluster 12,000 3 . Phenazines are a large class of nitrogen- of class large a are Phenazines .

13,000

and 14,000 12,000

15,000 Smaragdicoccus niigatensis

13,000 Hyper-heterogeneous Homogeneous Heterogeneous

14,000 u o s e R b ) Relationship Relationship )

15,000 2 5 . Given that that Given . encode r e c 679 - - -

© 2018 Nature America, Inc., part of Springer Nature. All rights reserved. peaks. ( peaks. of griseoluteic acid as well as as well as acid griseoluteic of in found tecture archi pathway same the exhibited that genes phenazine-modifying dicarbo (PDC); 1,6 acid phenazine xylic and (PCA), acid phenazines core 1-carboxylic two the (phenazine produce to known taxa all across found P. halotolerans ( observed not was acid seoluteic however, acid; griseoluteic and PDC PCA, phenazines of extract A crude families the in ity halotolerans of and and of placement phylogenetic the indicates star red The genome. per BCs of percentage greatest the with genomes GEBA-I highlight stars Blue clusters. gene biosynthetic encoding genome of percentage the representing bars horizontal with genes range of bacteria. The phenazine pathways encoded in the genomes genomes the in encoded pathways phenazine The bacteria. of range 3 Figure u o s e R 680 ( by produced known) antibiotic (structure phenazine pelagiomicin the for biosynthetic responsible the likely genes identified also we Furthermore, phenazine. new in peaks (unknown metabolites prominent other of the Some different. likely is domain adenylation acid amino the by incorporated acid amino the yet genome, 18316 in to acid griseoluteic to modify known genes Supplementary Fig. 8 Fig. Supplementary a P. agglomerans P. irblie variabilis c . . ( b c

) Liquid chromatography–mass spectrometry (LC/MS) chromatogram from a crude extract of of extract a crude from chromatogram (LC/MS) spectrometry chromatography–mass ) Liquid ) Phenazine operon in in operon ) Phenazine Distribution of biosynthetic clusters (BCs) in GEBA-I genomes. ( genomes. GEBA-I in (BCs) clusters biosynthetic of Distribution DSM 18316 are the first observations of this capabil this of observations first the are 18316 DSM r DSM 18316 included all of the core phenazine genes genes phenazine core the of all included 18316 DSM e c Pantoea agglomerans Pantoea Eh1087 are present in the the in present are Eh1087 P. halotolerans Fig. 3 Fig. ). c

). This operon also contained additional additional contained also operon This ). Proteobacteria Bacteroidetes Firmicutes Actinobacteria ATCC 700307 and and 700307 ATCC P. halotolerans d -alanylgriseoluteic acid -alanylgriseoluteic Fig. 3 Fig. M. variabilis M. DSM 18316 produced three known three produced DSM 18316 Fig. 3 Fig. and and b Transporter ) may contain this potentially ) may potentially this contain Eh1087, a known producer producer known a Eh1087, b Photobacterium halotolerans ). The phenazine operon in in operon phenazine The ). c b DSM 18316 compared to those from from those to compared 18316 DSM d ATCC 700307 (ref. (ref. ATCC 700307 -alanylgriseoluteic acid acid -alanylgriseoluteic Lactoglutathione lyase A/B P. halotolerans P. A/B Pseudomonas fluorescens I Pantoea agglomeransEh1087

D Abundance Photobacterium Photobacterium D 100,000,000 150,000,000 200,000,000 300,000,000 350,000,000 400,000,000 250,000,000 50,000,000 , respectively. , respectively. 2 d 7 . The three three The . B A R -alanylgri E E 0 DSM DSM DSM 18316 1 0 2 HO 8 2–79 - - - - ) a

PDC ) Maximum likelihood phylogenetic tree using 56 conserved single-copy single-copy conserved 56 using tree phylogenetic likelihood ) Maximum F F O N N C O ment rather than numerical dominance within certain environmental samples. gut termite from were fraction tiny a and aquatic, 6.5% associated, host plant 34% terrestrial, 50% were hits new these metagenome proteins ( (ref. samples from these particular habitats. is attributed to finding primarily the of proportion high metagenome ( (21%) samples associated plant- and habitats (28%) aquatic (32%), terrestrial of metagenomes Table 4 ( metagenomes 4,650 from proteins metagenomic unassigned previously 25,576,559 recruited set protein GEBA-I The tein sequences derived from 4,948 metagenomes in the IMG database. GEBA-I proteins were compared to 2,664,695,939 non-redundant pro as phylogenetic anchors for metagenomic studies. A total of 3,402,887 data have improvementdramatic of yielded in metagenomic classification lineages underrepresented phylogenetically of inclusion through set genomes. microbial to Previous efforts expand the genomic reference reference upon dependent largely is data metagenomic to sification clas taxonomic provide and analyze phylogenetically to ability The sequences metagenomic of assignment taxonomic Improved G G OH Although GEBA-I strain selection was based on phylogenetic place VOLUME 35 VOLUME E D 3 2 5 Griseoluteic 2 Acetyl-CoA . Here, we evaluated whether the GEBA-I genomes could serve serve could genomes GEBA-I the whether evaluated we Here, . Pseudomonas fluorescens Pseudomonas 9 Acetyl-CoA synthetase O ), a ginseng field soil isolate, recruited the highest number of number highest the recruited isolate, soil field ginseng a ), acid ). The majority of newly recruited ). proteins The of were from majority recruited derived newly synthetase OH N N O

Aldehyde OH Aldehyde dehydrogenase Photobacterium halotolerans Photobacterium dehydrogenase

6 5 4 PCA NUMBER 7 NUMBER N N O MT MT Time (min) P. halotolerans OH Supplementary Fig. Supplementary 9 GF

Transporter

Fig. 4 Fig. Transporter

2-79 and and 2-79 JULY 2017 JULY Sort chain

a Sort chain dehydrogenase and and

DSM 18316 with labeled phenazine phenazine labeled with 18316 DSM dehydrogenase 10 9 8 7

Solirubrobacter soli Acyl-CoA Supplementary Fig. 9 Fig. Supplementary Pantoea agglomerans Pantoea Acyl-CoA dehydrogenase

DSM 18316 described in in described 18316 DSM dehydrogenase nature biotechnology nature ); ); habitat distribution of

AA adenylation AA adenylation modification Griseoluteic acid biosynthesis Griseoluteic acid biosynthesis Core phenazine

Supplementary Supplementary PPTase PPTase

DSM 22325 ACP ACP Eh1087. ). This This ). C b

- - -

© 2018 Nature America, Inc., part of Springer Nature. All rights reserved. x rhizosphere samples are included (switchgrass (IMG taxon_oid: 3300002128), and and 3300002128), taxon_oid: (IMG (switchgrass included are samples rhizosphere other two against hits top contrast, For 3300001904). taxon_oid: (IMG rhizoplane corn from sample metagenomic against hits blast protein top CDS contiguous scaffold available for this genome. ( genome. this for available scaffold contiguous the on ordered are CDS 3300002702). taxon_oid: (IMG pipeline oil a corroded of biofilm from sample metagenomic against to a habitat. ( a habitat. to hits total of fraction by weighted is color of intensity The follow. that circles concentric colored the in given is hits these for distribution habitat The the total number of metagenomic sequences with protein blast hits to that GEBA-I genome ( genome GEBA-I that to hits blast protein with sequences metagenomic of number total the denotes chart bar a black bearing circle outermost The genome. GEBA-I given the for habitat source isolation the indicates branch tree every of terminus the decorating dots colored The (100%) green to (60%) red from a range in colored is 60% above support branch Internal approach. Phylogeny Distance Genome-Blast the of version high-throughput the using conducted were sequences genome whole of analyses Phylogenetic genomes. GEBA-I baculatum reducer, sulfate anaerobic an is example notable Another University). State Michigan Tiedje, M. James munication, com (personal years in corn subsequent from unstressed in samples present not is it because growth), its for substrate plentiful provided probably years previous from decomposition root (where plot corn continuous drought-stressed a from taken samples rhizoplane corn that Wehypothesize read depth of but ~25), not other plant samples ( rhizosphere two corn rhizoplane samples, at high abundance on (based an average from (CDS)) sequence coding of total coverage 87% (over sequences degrading soil isolate, isolate, soil degrading For cellulose- example, potential. metabolic and encoded abundance of in terms community of microbial the members as important serve may proteins metagenomic recruited significantly that genomes the of number a that evidence found we Furthermore, samples. mental environ individual 1,204 from sequences protein recruit notably to “top recruiters,” designated genomes, 282 were about found samples, 4 Figure nature biotechnology nature a axis by position on six discrete scaffolds (which are themselves ordered by descending sequence length) available for this genome. this for available length) sequence descending by ordered themselves are (which scaffolds discrete six on position by axis

Recruitment of metagenomic sequences by GEBA-I genomes. ( genomes. GEBA-I by sequences metagenomic of Recruitment DSM 4028 (over 85% coverage of total isolate CDS), which b ) Protein recruitment plot showing amino acid percent identity ( identity percent acid amino showing plot recruitment ) Protein

R. cellulosilytica R. Actinobacteria Rudaea cellulosilytica Rudaea

Firmicutes

VOLUME 35 VOLUME is an opportunist in senescing senescing in opportunist an is

NUMBER 7 NUMBER c 3 ) Protein recruitment plot showing percent identity ( identity percent showing plot recruitment ) Protein 0 preferentially recruits recruits preferentially Animal host Waste Human host Plant host Aquatic Terrestrial Other Engineered improved hits New & Desulfomicrobium Desulfomicrobium

JULY 2017 JULY Bacteroidetes Fig. Fig. 4

c a ). ). ) Overview of metagenomic protein sequence recruitment by individual individual by recruitment sequence protein metagenomic of ) Overview - - Tree scale:0.1 to further underscore the impact of broadening the phylogenomic phylogenomic the broadening of impact the underscore further to mechanisms mechanisms for ofinfection animal and plant hosts such bacteria as (and additional examples discussed in in discussed examples additional (and umes of leg of wild root in nodules 16S rRNA surveys reported previously Indeed, community. microbial root plant related closely (e.g., samples metagenome rhizosphere plant several from proteins of recruitment showed ogy) fibrosis (CF) patients (although not known to cause disease or pathol limosus expected. example, for as identified, were exceptions habitat, interesting Some sample metagenome the and strain GEBA-I pipeline of the failure to led that corrosion piv microbial-induced a the in role otal had likely and sample biofilm pipeline oil an in abundant is y axis) of top hits of of hits top of axis)

Overall, we found a correlation between isolation source of the the of source isolation between correlation a found we Overall, 32 Proteobacteria Miscanthus , DSM 16000, a GEBA-I strain isolated from sputum of cystic cystic of sputum from isolated strain GEBA-I a 16000, DSM 3 3 3 1 . There is mounting evidence that human-pathogenic enteric ( Fig. 4 Fig. Supplementary Fig. 7 Fig. Supplementary Inquilinus (IMG taxon_oid: 3300001991). CDS are ordered on the the on ordered are CDS 3300001991). taxon_oid: (IMG b ). Desulfomicrobium baculatum Desulfomicrobium y axis) of of axis) c b % Identity % Identity species or strains may be members of the the of members be may strains or species 100 can colonize plant and tissues, use similar 70 80 90 50 60 100 Arabidopsis 50 60 70 80 90 Rudaea cellulosilytica Rudaea and and Desulfomicrobium baculatumDSM4028 Corn Supplementary Table 4 Supplementary Supplementary Note Supplementary , corn). We hypothesize that that We hypothesize corn). , Rudaea cellulosilytica Inquilinus x Miscanthus axis by position on one one on position by axis DSM 4028 CDS CDS 4028 DSM u o s e R DSM 22992 22992 DSM 34 , spp. have been been have spp. 3 5 . . Our findings Switchgrass Oil pipeline Inquilinus ). ).

r ) serve serve ) e c 681 - - -

© 2018 Nature America, Inc., part of Springer Nature. All rights reserved. the pa the by the research community biochemical validation biochemical logenetically divergent organisms, and underscores the urgent need for results in speculative, sequence-based predictions, particularly for phy difficult. Metagenomic data also contribute to ‘homology creep’, which comparisons and that estimations of accurate assertions, diversity are mean sequences contaminated or chimeric fragmented, highly from arising Artifacts quality. low relative by characterized are genomes these that is problem perceived widely However, a information. able genomes. microbe cultivated of Resource our ters arising from the majority uncultivated are now complemented by cations and metabolites secondary discovery efforts, particularly for biofuel and biotransformation appli in an Amazon river plume river Amazon an in samples, for example, example, for samples, GEBA-I in their prominent genomes and species member discovered of ecology. and diversity breadth microbial the explore to efforts cultivation-independent mentary analyses of biotechnologically relevant pathways, for years to come. experiments, including the development of microbial model cultivable.systems andWe hope that GEBA-I will provide a yet)foundation uncultivated for microbialan array speciesof and strains, the GEBA-I species are all of due to identity reference genomes. absence taxonomic lacking previously communities microbial of members important potentially uncovered we addition, In data. metagenomic of recruitment in improvements observe did we metagenomes), ing isolate genome space (rather than genomes likely to be present in exist the in gaps phylogenetic targeted exclusively strains type of selection amounts of data from uncultivated microorganisms. While our GEBA-I assembly, annotation and interpretation of growing the exponentially foundation to as support a solid serve could characterization, genetic and biochemical along with tion of which, reference isolate genomes, identification of novel associations between viruses and their hosts Table 5 comprising more than 28,000 novel spacer sequences ( mophilic switchgrass-adapted compost switchgrass-adapted mophilic accession codes and references, are available in the the in available are references, and codes accession statements of including Methods, and data availability any associated M to comple in adding case, in this databases, of public representation u o s e R 682 hunter-gatherers traditional brennaborense ment to identify rare ment microbes soil to identify phylogenetic diversity and have provided insights into microbial microbial into evolution and insights ecology provided have and diversity phylogenetic unexplored immense revealed have genomics single-cell or nomics microbes. cultivated of representation the at expanding aimed efforts sequencing phylogeny-driven continued for rationale the ing the number tance highest encoded of novel protein support families, dis phylogenetic increased with Wegenomes that isolates. observed increase the phylogenetic coverage of cultivated bacterial and archaeal This Resource data set is the single largest effort (to our knowledge) to DISCUSSION et Genomes reconstructed from metagenomic data contain much valu Other investigators have taken advantage of early access to the the to access early of advantage taken have investigators Other Unlikegenome sequences reconstructed frommetagenomes of(as- We also report genome features and a large set of CRISPR–Cas systems Recent studies of uncultivated bacteria and archaea using metage using and archaea bacteria of uncultivated studies Recent h pe ods and r . Supplementary Fig. 10 r in the gut microbiomes of non-human primates and and primates non-human of microbiomes gut the in e c 5 5 , rpnm succinifaciens 44– 2 . One path forward, as previously proposed proposed previously as forward, path One . 4 1 4 8 2 3 0 . Those studies have also bolstered gene gene bolstered also have studies Those . , , is the development of a saturated collec 7 , , , , thermophiles Sphaerobacter Ktedonobacter racemifer Ktedonobacter 49– 3 ). These CRISPR–Cas data enabled 8 , , 5 Coraliomargarita Coraliomargarita akajimensis 1 . . New species, strains and clus 4 2 . 3 6 onli and and Supplementary in an enrich ne version of version ne Treponema Treponema 4 1 in ther in 43 3 . 9 ------

of the US Department of Energy. ComputingScientific Center, which is supported by ofthe Office Science No. DE-AC02-05CH11231 and used resources of the National Research Energy Joint Genome Institute, a of DOE Office UserScience Facility, under Contract metagenomic samples. This work was conducted by the US Department of Energy Michigan State University for generously providing formetadata specifics his for the GBDP analysis is gratefully acknowledged. We also thank J.M. Tiedje at J. Bristow and E. Rubin for supporting the project. The use of the bwGRiD cluster N. INSDC, L. ShapiroGoodwin, and T. Tatum for project management and I. S. Dubchak, Wilson, A. Shahab for NCBI registrations and submission to integration into IMG, A. Schaumberg, E. Andersen, S. Hua, H. Nordberg, M. Pillay, J. Huang, E. Szeto and V. Markowitz for additional annotation and for annotation and submission to IMG, A. Chen, K. Chu, K. Palaniappan, for assembly, K. Mavromatis, M. Huntemann, G. Ovchinnikova and N. Mikhailova curation, C.-L. Wei for sequencing, J. Han, A. Clum, B. Bushnell and A. Copeland J. Mallajosyula, M. Isbandi, A. Thomas, D. Stamatis and J. for Bertsch metadata contributed to this project including T.B.K. Reddy, I. Pagani, E. Lobos, DNA, B. and Beck T. Lilburn from ATCC, as well as all the JGI staff that GEBA-I strains, E. Brambilla and B. Trümper, DSMZ, for preparing genomic S. Verbarg from Leibniz Institute DSMZ for contributing the cell pastes for the also like to thank B.J. Tindall, S. Spring, E. R. Pukall, Lang, S. Gronow and We thank H. Maughan for reading critical and feedback on the paper. We would online version of the pape Note: Any Supplementary Information and Source Data files are available in the 12. 11. 10. 9. 8. 7. 6. 5. 4. 3. 2. 1. visit license, this creativecommons.org/licenses/by/4.0 of copy a view holder.To copyright the from directly permission obtain to need will you use, permitted the exceeds or regulation statutory by permitted not If material is not included in the article’s article’s changes were made.The images or other third party material in the this to article link a are provide included source, in the the and author(s) claims jurisdictional in published maps and affiliations. institutional reprints/index.ht at online available is information permissions and Reprints The authors declare no competing interests.financial wrote the paper. All authors edited and approved the manuscript. direction and important feedback on the paper. S.M., E.A.E.-F.R.S., and N.C.K. experiments. N.N.I., T.W. and N.C.K. validated analysis tasks. A.V. gave project performed analysis tasks. andR.C.C. Y.Y. performed phenazine biosynthetic cluster S.M., N.J.V.,R.S., E.A.E.-F., J.P.M.-K., M.G., M.H., G.A.P., D.P.-E. and A.P. N.C.K., J.A.E., P.H., H.-P.K., G.M.G., W.B.W. and T.W. conceived the project. AUTHO A C ckn O MPETING FINANCIA MPETING Kyrpides, N.C. Kyrpides, there, we are Archaea: and Bacteria of driven genomics A G.M. Garrity, of Nomenclature of Code International G.M. Garrity, & B.J. Tindall, C.T., Parker, and Archaea of classification genome-based a to H.-P.route Klenk, En Göker,M. & Names. Species from Classification Strain Divorcing D.A. Baltrus, N.J. Varghese, genome- large-scale for selection H.-P.target Phylogeny-driven Klenk, & M. Göker, C. Rinke, and challenges the meeting genomics: microbial of years Fifteen N.C. Kyrpides, of D. Wu, Myriads C.A. Ouzounis, & V. Lorenzo, de A.J., Enright, I., Cases, V., Kunin, S. Mukherjee, feature enhancements. feature yet? (2015). Prokaryotes. Bacteria? projects. other) (and sequencing dream. the fulfilling counting. still and families, protein Nature myriad of type strains. type of myriad matter. , 431–439 (2016). 431–439 24, Nucleic Acids Res. Acids Nucleic VOLUME 35 VOLUME o C J. Clin. Microbiol. Clin. J. reative w R R C , 1056–1060 (2009). 1056–1060 462, A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. and Bacteria of encyclopaedia genomic phylogeny-driven A al. et Nature l edgments O Syst. Appl. Microbiol. Appl. Syst. Insights into the phylogeny and coding potential of microbial dark microbial of potential coding and phylogeny the into Insights al. et This article is licensed under a in any medium or format, as long as you give appropriate credit to the original License, which permits use, sharing, adaptation, distribution and reproduction NTRIBUTI C Int. J. Syst. Evol. Microbiol. m ommons license, unless indicated otherwise in a credit line to the material. l Microbial species delineation using whole genome sequences. genome whole using delineation species Microbial al. et Genomic encyclopedia of bacteria and archaea: sequencing a sequencing archaea: and bacteria of encyclopedia Genomic al. et , 431–437 (2013). 431–437 499, . . Publisher’s note: eoe OLn Dtbs (OD v6 dt udts and updates data v.6: (GOLD) Database OnLine Genomes al. et

NUMBER 7 NUMBER

, 6761–6771 (2015). 6761–6771 43, Nat. Biotechnol. Nat. r .

PLoS Biol. PLoS , 1956–1963 (2016). 1956–1963 54, Nucleic Acids Res. Acids Nucleic O L NS INTERESTS

/

. , 175–182 (2010). 175–182 33, Stand. Genomic Sci. Genomic Stand. JULY 2017 JULY C S , e1001920 (2014). e1001920 12, reative pringer Nature remains neutral with neutral Nature remains regard to pringer Genome Biol. Genome , 627–632 (2009). 627–632 27, C http://dx.doi.org/10.1099/ijsem.0.00077 8 reative C C

reative reative ommons license and your intended use is , D446–D456 (2017). D446–D456 45,

C ommons Attribution 4.0 International nature biotechnology nature C , 401 (2003). 401 4, ommons license, and indicate if if indicate and license, ommons

, 360–374 (2013). 360–374 8, http://www.nature.c TrendsMicrobiol.

http:// om/ © 2018 Nature America, Inc., part of Springer Nature. All rights reserved. 31. 30. 29. 28. 27. 26. 25. 24. 23. 22. 21. 20. 19. 18. 17. 16. 15. 14. 13. nature biotechnology nature (2013). Int. J. Syst. Evol. Microbiol. Evol. Syst. J. Int. Int. J. Syst. Evol. Microbiol. Evol. Syst. J. Int. An, D. An, H.-Y. Weon, M.K. Kim, N. Imamura, phenazine novel a of Characterization H.K. Mahanty, & Y. Feng, S.R., Giddens, metabolites. microbial Bioactive J. Bérdy, P.Cimermancic, metabolism secondary the of regulation The K.J. McDowall, & G.P. Wezel, van M. Hadjithomas, the in variations analyses. phylogenomic in and V.incongruence Daubin, with & Dealing Galtier,N. themes Common P. Visca, & P. Ascenzi, R., Cipollone, B. Abt, new the taxonomy: mycobacterial Taylor,PASI.B. Zhulin, & B.L. potential, redox oxygen, of sensors internal domains: on studies genotypic of Impact E. Tortoli, A. Roth, Y.-J.Chang, I.-M.A. Chen, C.T.,P.Tyson,Skennerton, Hugenholtz, & G.W.M., CheckM: Imelfort, D.H., Parks, N.C. Kyrpides, thousand microbial genomes (KMG-I) project. (KMG-I) genomes microbial thousand of prokaryotic biosynthetic gene clusters. gene biosynthetic prokaryotic of marine bacterium marine isnhtc ee lses n nvl eodr metabolites. secondary (2015). novel and clusters gene biosynthetic (2002). in cluster gene antibiotic rhodanese superfamily. rhodanese light. and 1990s. the of mycobacteria cells, single isolates, metagenomes. and from recovered genomes microbial of quality the assessing analysis system. analysis , 28 (2016). 28 7, bisulfite. with injected pipelines in corrosion microbial of cause type strain (SPN1(T)), reclassification in the genus the (2012). Sphaerochaeta. 194–209 genus in the and Spirochaetaceae reclassification family (SPN1(T)), strain as Sphaerochaeta type coccoides f tetmcs nw ik ad xeietl advances. experimental and links new Streptomyces: of Trans. R. Soc. Lond. B Lond. Soc. Trans.R. Sci. Genomic Stand. bacterium soil filamentous the of genus-specific amplification of the 16S-23S rRNA gene spacer andrestriction genespacer rRNA endonucleases. the 16S-23S of amplification genus-specific 1311–1333 (2011). 1311–1333 Metagenomic analysis indicates as a potential a as epsilonproteobacteria indicates analysis Metagenomic al. et opee eoe eune f h trie idu bacterium hindgut termite the of sequence genome Complete al. et Novel diagnostic algorithm for identification of mycobacteria using mycobacteria of identification for algorithm diagnostic Novel al. et Microbiol. Mol. Biol. Rev. Biol. Mol. Microbiol. et al. et Non-contiguous finished genome sequence and contextual data contextual and sequence genome finished Non-contiguous al. et t al. et e atcne atbois eaimcn, rdcd y new a by produced pelagiomicins, antibiotics anticancer New al. et IMG/M: integrated genome and metagenome comparative data comparative metagenome and genome integrated IMG/M: al. et J. Clin. Microbiol. Clin. J. eoi Eccoei o Tp Sris Pae : h one the I: Phase Strains, Type of Encyclopedia Genomic al. et Nucleic Acids Res. Acids Nucleic sp. nov., isolated from soil of a ginseng field. ginseng a of soil from isolated nov., sp. soli Solirubrobacter Insights into secondary metabolism from a global analysis global a from metabolism secondary into Insights al. et Pelagiobacter variabilis. Pelagiobacter ob nv ad mnain o the of emendations and nov. comb. coccoides Sphaerochaeta M-B: kolde ae o ul icvr of discovery fuel to base knowledge a IMG-ABC: al. et Genome Res. Genome e. o. s. o. ioae fo soil. from isolated nov., sp. nov., gen. cellulosilytica Rudaea

, 97–111 (2011). 97–111 5,

IUBMB Life IUBMB , 4023–4029 (2008). 4023–4029 363, Eh1087. herbicola Erwinia

Clin. Microbiol. Rev. Microbiol. Clin. , 2308–2312 (2009). 2308–2312 59, , 1453–1455 (2007). 1453–1455 57,

VOLUME 35 VOLUME type strain (SOSP1-21). strain type racemifer Ktedonobacter , 1043–1055 (2015). 1043–1055 25,

, 1094–1104 (2000). 1094–1104 38,

, 51–59 (2007). 51–59 59, , D507–D516 (2017). D507–D516 45, , 479–506 (1999). 479–506 63, Cell J. Antibiot. (Tokyo)Antibiot. J. J. Antibiot. (Tokyo)Antibiot. J. Stand. Genomic Sci. Genomic Stand. , 412–421 (2014). 412–421 158,

NUMBER 7 NUMBER , 319–354 (2003). 319–354 16, Mol. Microbiol. Mol. tn. eoi Sci. Genomic Stand. a. rd Rep. Prod. Nat.

, 1–26 (2005). 1–26 58,

, 8–12 (1997). 8–12 50, MBio JULY 2017 JULY Front. Microbiol. Front.

, 1278–1284 9, , 769–783 45, e00932 6,

Phil. 28,

6,

43. 45. 44. 42. 41. 40. 39. 38. 37. 36. 35. 34. 33. 52. 51. 50. 49. 48. 47. 46. 32.

(2016). enrichments and metagenomics. and enrichments Paez-Espino, D. D. Paez-Espino, uc, D.B. Rusch, S. Yooseph, D’haeseleer,P. Pati, A. B.M. Satinsky, K. Mavromatis, T.O. Delmont, microbiomes. gut A.J. Obregon-Tito, C. Han, A. Schikora, Dong, Y., Iniguez, A.L., Ahmer, B.M.M. & Triplett, E.W. Kinetics and strain specificity F. Zakhia, Anton, B.P., Kasif, S., Roberts, R.J. & Steffen, M. Objective: biochemical function. biochemical Objective: M. Steffen, & B.P.,R.J. Anton, Roberts, S., Kasif, metagenome-derived a of Characterization W.R. Streit, & H.L. Steele, S., Voget, Li, L.-L., McCorkle, S.R., Monchy, S., Taghavi, S. & van der Lelie, D. Bioprospecting C.-J. Guo, Anantharaman, K. L.A. Hug, S. Sunagawa, Deng, Z.S. host proteases. host the universe of protein families. protein of universe the (S 6022). (S microbiome. and Starkeya. and legumes in Tunisia and first report for nifH-like gene within the genera Microbacterium (6091). f hzshr ad nohtc ooiain y nei bcei o seedlings truncatula. on Medicago bacteria and (2003). enteric 1783–1790 sativa by Medicago colonization endophytic of and rhizosphere of Front. Genet. Front. cellulase. halotolerant biomass. converting for hydrolases glycosyl metagenomes: 463–475 (2011). 463–475 China. in Plateau Loess of regions different in salsula Atlantic through eastern tropical Pacific. tropical eastern through Atlantic n mttasrpoi ivnois f h Aao Rvr lm, ue 2010. June plume, River Amazon the Microbiome of inventories metatranscriptomic and igohmcl rcse i a aufr system. aquifer an in processes biogeochemical animals. adapted to deconstruct switchgrass. deconstruct to adapted (2016). 10 (2009). 10 type strain (04OKA010-24). strain type et al. Complete genome sequence of Complete genome sequence of sequence genome Complete al. et Stand. Genomic Sci. Genomic Stand. PLoS One PLoS A new view of the tree of life. of tree the of view new A al. et Stand. Genomic Sci. Genomic Stand. et al. Diversity of endophytic bacteria within nodules of the icvr o ratv mcoit-eie mtblts ht inhibit that metabolites microbiota-derived reactive of Discovery al. et ies bcei ascae wt ro ndls f spontaneous of nodules root with associated bacteria Diverse al. et , 17 (2014). 17 2, h Sree I Goa Oen apig xeiin expanding expedition: Sampling Ocean Global II Sorcerer The al. et Conservation of Salmonella infection mechanisms in plants and plants in mechanisms infection Salmonella of Conservation al. et Science h Sree I Goa Oen apig xeiin northwest expedition: Sampling Ocean Global II Sorcerer The al. et , 210 (2014). 210 5, Microb. Ecol. Microb. ca pako. tutr ad ucin f h goa ocean global the of function and Structure plankton. Ocean al. et Proteogenomic analysis of a thermophilic bacterial consortium bacterial thermophilic a of analysis Proteogenomic al. et eosrcig ae ol irba gnms sn i situ in using genomes microbial soil rare Reconstructing al. et h Aao cnium aae: uniaie metagenomic quantitative dataset: continuum Amazon The al. et Cell Complete genome sequence of sequence genome Complete al. et et al. Thousands of microbial genomes shed light on interconnected Nat. Commun. Nat. t al. et Subsistence strategies in traditional societies distinguish societies traditional in strategies Subsistence al. et , e24112 (2011). e24112 6, , 517–526.e18 (2017). 517–526.e18 168, , 1261359 (2015). 1261359 348, J. Biotechnol. J. Uncovering Earth’s virome. virome. Earth’s Uncovering , 375–393 (2006). 375–393 51,

Stand. Genomic Sci. Genomic Stand. , 361–370 (2011). 361–370 4,

, 49–56 (2010). 49–56 2, PLoS Biol. PLoS Front. Microbiol. Front. , 6505 (2015). 6505 6, PLoS One PLoS , 26–36 (2006). 26–36 126, PLoS Biol. PLoS Sphaerobacter thermophilus type strain , e16 (2007). e16 5, type strain type Treponemasuccinifaciens , e68465 (2013). e68465 8, Nat. Microbiol. Nat.

, 358 (2015). 358 6, pl Evrn Microbiol. Environ. Appl. , 290–299 (2010). 290–299 2, , e77 (2007). e77 5, Coraliomargarita akajimensis Coraliomargarita a. Commun. Nat. FEMS Microbiol. Ecol. Microbiol. FEMS u o s e R Nature Biotechnol. Biofuels Biotechnol. , 16048 (2016). 16048 1,

536 Sphaerophysa , 425–430 425–430 , 13219 7, r e c

683 69, 76, 2,

© 2018 Nature America, Inc., part of Springer Nature. All rights reserved. followed by iTOL by followed FastTree using visualized were Trees model. LG standard the bootstrap and 100 replicates using robustness for tested were topologies Tree 7.6.3). sion (ver RAxML with methods likelihood maximum the using inferred was tree is less than 300 or greater than 1,200 (ref. (ref. 1,200 than greater or 300 than less is pair base million per genes of number the or 100% than greater or 70% than less is density coding the if or assignment taxonomic phylum-level of lack to and genomes as flagged +low-quality” by the IMG quality control pipeline due mids, genome fragments, uncultured single cells, genomes from metagenomes - plas excluded genomes High-quality of GEBA-I genomes. analysis we began GEBA genomes in high-quality IMG as (14,625) of May 2014 were used, when analysis. comparative for selection set Control (CDSs) were identified using Prodigal using identified were (CDSs) single-copy proteins in bacteria and archaea and bacteria in proteins single-copy phylogeny. gene single-copy Conserved in reported scores and and manual curation using the JGI pipeline GenePrimp at the DOE Joint Genome Institute (JGI) using Illumina technology Illumina using (JGI) Institute Joint at Genome DOE the annotation. and assembly Sequencing, sequencing. for JGI DOE to sent was which DNA, genomic isolate Each and paste centers. cell the the generated of center sites web the at described as protocols standard using ( ATCC by provided were 133 strains remaining the DSMZ Institute while by were list provided Leibniz DNA isolation. and growth annotated. and sequenced were genomes 1,000 target a as soon as started conditions was Analysis for DNA growth cells extraction. of sufficient for production the allow that screened were species (PD) scoring highest the and excluded were (GOLD) Database Online Genomes the in registered projects sequencing genome completed or ongoing with Species tree. underlying the as used was (species/subspecies) leaves 8,029 comprising 9/2010), of as able from Phylosift from (ref. HMMER3 in included hmmalign and hmmsearch using and aligned were detected genes Marker tree. phylogenetic the the CheckM diversity (PD), as inferred from a phylogenetic tree with computed branch branch lengths computed with tree phylogenetic a from inferred as (PD), diversity phylogenetic total the to species each of contribution relative the measures selection. Organism MET ONLINE nature biotechnology nature library with an average insert size of 270bp. Majority of the genomes were Velvetwere using genomes the assembled of Majority 270bp. of size insert average an with library paired-end short-insert an Illumina and sequenced we constructed genomes, as balanced Relative Phylogenetic Diversity (bRPD) as described earlier described as (bRPD) Diversity Phylogenetic Relative balanced as the to tree LTPgene rRNA the 16s from inferred strain was diversity type phylogenetic overall each of contribution The (LTP). Project Tree Living the mapped to and contained the in species subspecies the last from (s123) release strains. type all of distance 16S in Increase GEBA-I genomes. the of analysis the began we when May 2014, of as IMG in available publicly genomes from names genus available publicly found at the JGI website ( website JGI the at found be can JGI the at performed sequencing and construction library of aspects rithm algo kClust the using clustered were genomes control 14,625 and genomes clusters. Protein Genomes (IMG-ER) platform (IMG-ER) Genomes Microbial Integrated the within performed were analyses and additional tion on a modified incremental, greedy clustering strategy, where sequences are are sequences where strategy, clustering greedy incremental, modified a on relies that tool clustering sensitive and fast a is kClust cluster. the of seed or sequence longest the with length alignment 80% over identity sequence wise the DOE–JGI genome annotation pipeline annotation genome DOE–JGI the 6 6 6 using default parameters, which amounts to 20–30% maximum pair maximum to amounts 20–30% which parameters, default using . . The All-Species-Living-Tree-Project (LTP) 1 4 genome quality estimator and individual CheckM completeness 6 3 . Alignments were concatenated and filtered. A phylogenetic A phylogenetic filtered. and concatenated were . Alignments H 26,873,871 non-redundant proteins from 1,003 GEBA-I GEBA-I 1,003 from proteins non-redundant 26,873,871 6 Supplementary Table 1 Supplementary ODS 4 . The number of new genomes was calculated based on on based calculated was genomes new of number The . Target organisms were selected based on a score that that score a on based selected were organisms Target 5 5 and ALLPATHS and Supplementary Table 1 Table Supplementary http://jgi.doe.gov 1 5 . Genome completeness was estimated using using estimated was completeness Genome . 5 Most strains (870) from the GEBA-I the from (870) strains Most 9 All GEBA-I strains were sequenced sequenced were strains GEBA-I All followed by a round of automated automated of round a by followed 6 A set of 56 universally conserved conserved universally 56 of set A 57 . 5 6 5 6 2 ). , / assembly methods. All general general All methods. assembly 5 6 ). Genomes were annotated by annotated were Genomes ). ) using HMM profiles obtained obtained profiles HMM ) using GEBA-I and type strains were were strains type and GEBA-I 8 1 . Briefly, protein-coding genes genes protein-coding . Briefly, was used for construction of of construction for used was As a control set, all non- all set, control a As 5 ). Strains were cultivated cultivated were Strains ). 3 phylogenetic tree (avail 6 0 . Functional . annota Functional 5 4 . . For all 6 . - - - - - GEBA-I genome are provided in in provided are GEBA-I genome to each corresponding numbers accession GenBank (INSDC). Collaboration Database Sequence Nucleotide International The through available also are (BCs) were predicted and annotated using AntiSMASH version 3.0.4 (ref. 3.0.4 version AntiSMASH using and annotated were predicted (BCs) clusters of in Prediction biosynthetic GEBA-I. in given are lengths and sizes Table cluster 2 Supplementary of distribution A larger. much are clusters of majority the although sequences, more or two of composed is ter 61. 60. 59. 58. 57. 56. 55. 54. 53. hog te M pra ( portal IMG the through availability. Data service. web ITOL the using visualized pseudo-bootstrapping from eters except an e-value filter of 10 of filter e-value an except eters (v2.2.30) BLAST+ with conjunction put version of analyses whole genome were sequences using conducted the high-through Phylogenetic analyses using Genome-Blast Distance Phylogeny. are sample a scaffolds. long in produce to likely organisms abundant most the only that on—the being resides gene assumption the that scaffolds the of length assembled average on based conjectured is “abundance” available, not was information depth where read however, scaffold, the of depth read average on based estimated was of 282 relative abundance genomes was Where ~27%. a possible, top recruiter for coverage median the and 94% was obtained coverage maximum The size. hits 200 of could cutoff represent as this low as 2% though coverage of even total CDS based ongenes, individual genome housekeeping merely than more included evidence the that to hit was count ensure this for choosing rationale 70% alignment length to an individual metagenome CDS were considered. The GEBA-I genomes that had over 200 CDS hits at >95% amino acid identity over isolate reference may be a closer phylogenetic match). For “top recruiters,” only hand refers to that a our 20% improvement hit over (suggesting a pre-existing acid amino >30% (at identity) set were as deemed new “Improvedrecruiters. control recruitment” on the other the from protein a recruit previously not did that CDS metagenome only and genomes, control 14,625 with searched also were metagenomes recruitment, “new” Toestablish protein. longer the of length alignment 50% over of 30% identity a minimum with protein nome protein A database. from an if “recruited” genome it is isolate deemed has a LAST hit to IMG a metage the in deposited metagenomes public assembled trimming algorithm, formula d9 and 100 pseudo-bootstrap replicates. replicates. matrices distance FastME using intergenomic pseudo-bootstrap the from 100 inferred were and trees d9 Phylogenetic formula algorithm, trimming genomes were searched using LAST using searched were genomes sequences. of metagenome Recruitment default. as left were options other All options. “borderpredict” the and “inclusive” the with similar k-mers and uses those similarities to rank sequence pairs. A clus A pairs. sequence rank to similarities those uses and k-mers similar between matches computes that pre-filter an alignment-free using compared leFdoh E.A. Eloe-Fadrosh, A. Pati, D. Hyatt, Tripp, H.J. Huntemann, M. Butler, J. using assembly read short novo de for Birney,Velvet:algorithms & E. D.R. Zerbino, K. Mavromatis, Yarza,P. genomes. tree of all sequenced type strains. type sequenced all of tree de Bruijn graphs. Bruijn de (2012). their impact on microbial genome assemblies and annotation. Genome Res. Genome Stand. Genomic Sci. Genomic Stand. candidate phylum in geothermal springs. geothermal in phylum candidate v.4). (MAP Pipeline Annotation identification. GenePRIMP: a gene prediction improvement pipeline for prokaryotic for pipeline improvement prediction gene a GenePRIMP: al. et 6 The All-Species Living Tree project: a 16S rRNA-based phylogenetic rRNA-based Tree16S Living a All-Species project: The al. et et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. 9 Prodigal: prokaryotic gene recognition and translation initiation site initiation translation and recognition gene prokaryotic Prodigal: al. et Nat. Methods Nat. of Phylogeny Distance the Genome-Blast approach (GBDP) et al. Toward a standard in structural genome annotation for prokaryotes. 7 2 , and the tree from the original together with branch support support branch with together original the from tree the and , , 810–820 (2008). 810–820 18, BMC Bioinformatics BMC The fast changing landscape of sequencing technologies and technologies sequencing of landscape changing fast The al. et et al. The standard operating procedure of the DOE-JGI Metagenome All available genomic data and annotations are available available are annotations and data genomic available All Genome Res. Genome lbl eaeoi sre rvas nw bacterial new a reveals survey metagenomic Global al. et

, 45 (2015). 45 10, . , 455–457 (2010). 455–457 7, https://img.jgi.doe.gov Stand. Genomic Sci. Genomic Stand. , 821–829 (2008). 821–829 18, Supplementary Table 1 Supplementary Syst. Appl. Microbiol. Appl. Syst. −8 , 119 (2010). 119 11, 6 7 . GBDP was run with the greedy-with- the with run was GBDP . 8 1 against 2,664,695,939 CDS from 4,948 4,948 from CDS 2,664,695,939 against in BLASTP mode with default param default with mode BLASTP in Nat. Commun. Nat. 3,402,887 CDS from 1,003 GEBA-ICDS from 1,003 3,402,887 Putative clustersBiosynthetic

/ , 17 (2016). 17 11, ). The GEBA-I genomes genomes GEBA-I The ).

, 10476 (2016). 10476 7, , 241–250 (2008). 241–250 31, doi:10.1038/nbt.3886 . PLoS One Phylogenetic 7, e48837 7 0 6 in 7 - - - - )

© 2018 Nature America, Inc., part of Springer Nature. All rights reserved. 67. 66. 65. 64. 63. 62. doi:10.1038/nbt.3886 Weber, T.Weber, large of clustering sensitive and fast kClust: J. Söding, & C.E. Mayer, M., Hauser, M. Huntemann, of display and annotation online v2: Life TreeOf P.Interactive Bork, & I. Letunic, A.E. Darling, Mistry,Eddy,R.D., Finn, J., homology in Challenges M. Punta, & A. Bateman, S.R., PeerJ Genome Annotation Pipeline (MGAP v.4). (MGAP Pipeline Annotation Genome of biosynthetic gene clusters. gene biosynthetic of protein sequence databases. sequence protein easy. made trees phylogenetic regions. Res. coiled-coil of evolution convergent and HMMER3 search: , e121 (2013). e121 41, , e243 (2014). e243 2, antiSMASH 3.0-a comprehensive resource for the genome mining genome the for resource comprehensive 3.0-a antiSMASH al. et PhyloSift: phylogenetic analysis of genomes and metagenomes. and genomes of analysis phylogenetic PhyloSift: al. et The standard operating procedure of the DOE-JGI Microbial DOE-JGI the of procedure operating standard The al. et BMC Bioinformatics BMC Nucleic Acids Res. Acids Nucleic Nucleic Acids Res. Acids Nucleic Stand. Genomic Sci. Genomic Stand.

, 248 (2013). 248 14, , W237 (2015). W237 43, , W475 (2011). W475 39,

, 86 (2015). 86 10, uli Acids Nucleic 72. 71. 70. 69. 68. 421 (2009). 421 Lefort, V., Desper, R. & Gascuel, O. FastME 2.0: a comprehensive, accurate, and accurate, comprehensive, a 2.0: FastME O. Gascuel, & R. V.,Desper, Lefort, C. Camacho, distance BLAST Genome M. Göker, & B.R. Holland, S.R., Henz, A.F., Auch, parallelized Highly M. Göker, & H.-P. Klenk, A.F., Auch, J.P., Meier-Kolthoff, tame seeds Adaptive M.C. Frith, & P. Horton, K., Sato, R., Wan, S.M., Kiełbasa, (2015). program. inference phylogeny distance-based fast Bioinformatics BMC phylogenies inferred from whole plastid and whole mitochondrion genome sequences. phylogenies. genome-based large of inference comparison. sequence genomic 1715–1729 (2014). 1715–1729 BLAST+: architecture and applications. and architecture BLAST+: al. et , 350 (2006). 350 7, Genome Res. Genome , 487–493 (2011). 487–493 21, ocr. opt Pat Exp. Pract. Comput. Concurr. nature biotechnology nature Mol. Biol. Evol. Biol. Mol. BMC Bioinformatics BMC , 2798–2800 32, 10, 26, ERRATA

Erratum: 1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life Supratim Mukherjee, Rekha Seshadri, Neha J Varghese, Emiley A Eloe-Fadrosh, Jan P Meier-Kolthoff, Markus Göker, R Cameron Coates, Michalis Hadjithomas, Georgios A Pavlopoulos, David Paez-Espino, Yasuo Yoshikuni, Axel Visel, William B Whitman, George M Garrity, Jonathan A Eisen, Philip Hugenholtz, Amrita Pati, Natalia N Ivanova, Tanja Woyke, Hans-Peter Klenk & Nikos C Kyrpides Nat. Biotechnol. 35, 676–683 (2017); published online 12 June 2017; corrected after print 28 February 2018

In the version of this article initially published, the wrong Creative Commons Attribution license (cc-by-nc rather than cc-by) was inserted. The error has been corrected in the HTML and PDF versions of the article. © 2018 Nature America, Inc., part of Springer Nature. All rights reserved. All rights part Nature. of Springer Inc., America, Nature © 2018

NATURE BIOTECHNOLOGY