Articles DOI: 10.1038/s41564-017-0012-7

Recovery of nearly 8,000 metagenome-assembled substantially expands the tree of life

Donovan H. Parks , Christian Rinke , Maria Chuvochina, Pierre-Alain Chaumeil, Ben J. Woodcroft, Paul N. Evans, Philip Hugenholtz * and Gene W. Tyson*

Challenges in cultivating microorganisms have limited the phylogenetic diversity of currently available microbial genomes. This is being addressed by advances in sequencing throughput and computational techniques that allow for the cultivation-inde- pendent recovery of genomes from metagenomes. Here, we report the reconstruction of 7,903 bacterial and archaeal genomes from >​1,500 public metagenomes. All genomes are estimated to be ≥​50% complete and nearly half are ≥​90% complete with ≤​5% contamination. These genomes increase the phylogenetic diversity of bacterial and archaeal trees by >30%​ and provide the first representatives of 17 bacterial and three archaeal candidate phyla. We also recovered 245 genomes from the Patescibacteria superphylum (also known as the Candidate Phyla Radiation) and find that the relative diversity of this group varies substantially with different protein marker sets. The scale and quality of this data set demonstrate that recovering genomes from metagenomes provides an expedient path forward to exploring microbial dark matter.

equencing of microbial genomes has accelerated with reduc- phyla previously lacking genomic representatives27–29, including tions in sequencing costs, and public repositories now contain the Patescibacteria superphylum4, which has subsequently been nearly 70,000 bacterial and archaeal genomes. The majority referred to as the ‘Candidate Phyla Radiation’ (CPR) as it may con- S 1,2 10,30 of these genomes have been obtained from axenic cultures and sist of upwards of 35 candidate phyla . Notable evolutionary and disproportionately reflect microorganisms of medical importance3. metabolic insights include the discovery of eukaryotic-like cytoskel- Consequently, current genome repositories are not representative eton genes in the archaeon Lokiarchaeota31,32 and the identification of the microbial diversity known from 16S rRNA gene surveys4. of putative methane-metabolizing genes in the Bathyarchaeota and Concerted efforts are being made to address this limitation by Verstraetearchaeota phyla33,34. These initial studies demonstrate the target­ing phylogenetically distinct microorganisms for cultivation5–7 need for additional genomic representatives across the tree of life in and single- sequencing4,8. Although these approaches continue order to more fully appreciate microbial evolution and metabolism. to provide valuable reference genomes, the former is restricted to Here, we present the first large-scale initiative to recover MAGs microorganisms amenable to cultivation and the latter is hampered from publicly available metagenomes. Nearly 8,000 draft-quality by technical challenges and the need for specialised equipment9. genomes were recovered from over 1,500 metagenomes, more than Obtaining genomes from metagenomes is an emerging approach with a threefold increase over large initiatives to genomically populate the potential for large-scale recovery of near-complete genomes10–12. the tree of life such as the Genomic Encyclopedia of Bacteria and Until recently, recovering genomes from metagenomic data was Archaea35 (~2,000 genomes), the Human Microbiome Project3 restricted to samples with low microbial diversity13, but improved (~2,000) and the largest previous MAG study11 (~2,500). We refer to sequencing throughput and advances in computational techniques our set of MAGs as the Uncultivated Bacteria and (UBA) now allow metagenome-assembled genomes (MAGs) to be recov- data set. Genome-based phylogenetic analysis indicates that the ered from high diversity environments14,15. MAGs are obtained UBA genomes provide the first representatives of several major by grouping or ‘binning’ together assembled contigs with simi- bacterial and archaeal lineages and substantially expand genomic lar sequence composition, depth of coverage across one or more representation across the tree of life. related samples and taxonomic affiliations16,17. Several tools have been developed that exploit these sources of information to pro- Results duce genomes from metagenomic data18–21 and there are ongo- Genomes are readily recovered from metagenomic data. MAGs ing efforts to evaluate the effectiveness of different approaches22. were recovered from 1,550 metagenomes submitted to the Sequence Although closed genomes have been obtained using metagenomic Read Archive (SRA) before 31 December 2015 (Supplementary binning methods10,23, MAGs are typically incomplete and may con- Fig. 1). We predominantly considered environmental and non- tain contigs from multiple strains or species due to challenges in human gastrointestinal samples in order to focus on metagenomes distinguishing between related community members both in the likely to contain microbial populations from under-sampled lin- assembly and binning processes19,24. This has spurred the develop- eages (Supplementary Table 1). The completeness and contamina- ment of methods for assessing the quality of recovered MAGs in tion of each MAG was estimated from the presence and absence order to allow biological inferences to be made with regards to their of lineage-specific genes expected to be ubiquitous and single estimated completeness and contamination25,26. copy25, and these estimates, along with assembly statistics, used to Significant insights have recently been made based on the MAGs identify genomes suitable for further study. A total of 64,295 MAGs of uncultivated microorganisms. These include elucidation of several were obtained, of which 7,903 (7,280 bacterial and 623 archaeal)

Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, St Lucia, Queensland 4072, Australia. *e-mail: [email protected]; [email protected]

Nature Microbiology | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. Articles Nature Microbiology

a 12% b 83% of genomes have ≤200 sca olds 12 10 Near complete (3,438 genomes; 43.5%) Medium quality (3,459 genomes; 43.8%) 8 Partial (1,006 genomes; 12.7%) 4

8 Genomes (%) 0 0 100 200 300 400 500 No. of sca olds 6

c 75% of genomes have 4 20 ≥15 standard tRNAs Contamination (%) 15

2 10

Genomes (%) 5

0 0 50 60 70 80 90 100 33% 20 15 10 5 0 Completeness (%) Standard tRNA count

Fig. 1 | Assessment of genome quality. a, Estimated completeness and contamination of 7,903 genomes recovered from public metagenomes. Genome quality was defined as completeness −​ 5 ×​ contamination, and only genomes with a quality of ≥​50 were retained. Near-complete genomes (completeness ≥​90%; contamination ≤​5%) are shown in red, medium-quality genomes (completeness ≥​70%; contamination ≤​10%) in blue, and partial genomes (completeness ≥​50%; contamination ≤​4%) in grey. Histograms along the x and y axes show the percentage of genomes at varying levels of completeness and contamination, respectively. Notably, only 171 of the 7,903 (2.2%) UBA genomes have >​5% contamination. b, Number of scaffolds comprising each of the 7,903 genomes with colours indicating genome quality. c, Number of tRNAs for each of the 20 standard amino acids identified within each of the 7,903 genomes with colours indicating genome quality. form the UBA data set as they met our filtering criteria of having an Taxonomic distribution of UBA genomes. The phylogenetic rela- estimated quality ≥​50 (defined as the estimated completeness of a tionships of the UBA genomes were determined across bacterial genome minus five times its estimated contamination) and consist- and archaeal trees inferred from three concatenated protein sets: ing of ≤500​ scaffolds with an N50 ≥​10 kb (Fig. 1 and Supplementary (1) a syntenic block of 16 ribosomal proteins (rp1) recently used to Table 2). Over 93% of the 7,903 UBA genomes have an average cov- infer genome-based phylogenies10,30 (Supplementary Table 4), (2) 23 erage of ≥​10× ​(5th percentile, 9.2×,​ 95th percentile, 268×​) and ribosomal proteins (rp2) previously tested for lateral gene transfer4 95.8% have >​5×​ coverage over 90% of bases, providing assurance (Supplementary Table 5), and (3) 120 bacterial (bac120) and 122 of high-quality base-calling across the genomes3,36. Among the UBA archaeal (ar122) proteins we have identified as being suitable for genomes is a subset of 3,438 near-complete genomes (3,225 bacte- phylogenetic inference (Supplementary Tables 6 and 7). The trees rial and 213 archaeal) estimated to be ≥​90% complete with ≤​5% span ~19,000 bacterial and ~1,000 archaeal genomes after species- contamination (Fig. 1a). These genomes consist of ≤​100 scaffolds level dereplication of the UBA genomes and 67,479 genomes in in 70.2% of cases (≤200​ scaffolds in 92.0% genomes) and have an RefSeq/GenBank release 76 (Supplementary Table 8). average N50 of 136 kb. Comparison of near-complete UBA genomes UBA genomes were represented in the majority of bacterial that are conspecific strains of complete isolate genomes also sug- (47 of 59, Fig. 2) and archaeal (11 of 18, Fig. 3) phyla, as defined in gest that the recovered MAGs have no systematic loss of genomic the NCBI taxonomy39. In addition, they comprise the first genomic content, with the exception of extrachromosomal elements such as representatives of 17 bacterial and 3 archaeal phyla (see section plasmids (Supplementary Note 1). ‘UBA genomes are the first representatives of several phyla’). To pro- The UBA data set was also assessed relative to the criteria used vide an objective taxonomic analysis of the UBA data set, we used by the Human Microbiome Project (HMP) for defining high-qual- the phylogenetic criterion of mean branch length to extant taxa30 as ity draft genomes3,37. Of the 3,438 UBA genomes we have defined existing taxonomic classifications are not phylogenetically uniform1. as near complete, 3,201 (93.1%) pass all of the HMP criteria, with The results were highly consistent across all trees and, as expected, the only substantial exception being 4.8% of the genomes having named groups at each taxonomic rank vary substantially in their scaffolds with an N50 of <20​ kb (Supplementary Table 3). Nearly mean branch length to extant taxa (Fig. 4 and Supplementary Figs. 3 half of the remaining 4,465 UBA genomes also pass the HMP and 4). Based on the range of mean branch length values for estab- criteria for being high quality except that they are estimated to be lished taxa, the bacterial UBA genomes are exclusive representatives <​90% complete. of 20–30% of all genus- to order–level lineages, 15–30% of class-level The presence of tRNAs for the standard 20 amino acids was lineages and 5–15% of -level lineages (Fig. 4a). Similarly, the examined as a secondary measure of genome quality (Fig. 1c). archaeal UBA genomes are the only representatives within 20–30% The 3,438 near-complete UBA genomes have tRNAs that encode of genus- to order-level lineages, 15–30% of class-level lineages for an average of 17.3 ±​ 2.2 of the 20 amino acids and ≥​15 amino and around 10% of archaeal phyla (Fig. 4b). We also tabulated the acids in 90.3% of the genomes. The correlation between estimated number of UBA-exclusive lineages at the 50th, 90th and 95th per- genome completeness and identified tRNAs was positive but weak centiles of the mean branch length distribution of each taxonomic (Supplementary Fig. 2) as tRNAs are regularly present in multiple rank (Supplementary Table 9). At the conservative 90th percentile, copies and often collocated in a genome, making them poor markers the bacterial UBA genomes are the first genomic representatives for robustly estimating completeness25,38. of 766 genera (34.6%), 226 families (28.6%), 61 orders (21.6%) and

Nature Microbiology | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. Nature Microbiology Articles

0.1 Bacteroidetes (1,513) FCB

Aminicenantes (10) Acidobacteria (35) Nitrospinae (5) Modulibacteria Chlorobi NC10 Ignavibacteriae (29) Rokubacteria Calditrichaeota (5) Nitrospirae (23) Marinimicrobia (46), UBP11 (1) Chrysio. (1) Nitrospirae_1 (19) Fibrobacteres (17), Gemmatimonadetes (39) UBP10 (2) Latescibacteria (4) Dadabacteria (2) MBNT15 (2) Zixibacteria, UBP14 (1), TA06, UBP1 (3), UBP2 (6) Deltaproteo. (1-3) (226) Cloacimonetes (31), Hyd24-12 (5), WOR-3 (10) Verrucomicrobia (178) PVC Proteobacteria (2,101) Lentisphaerae (12), Chlamydiae, UBP17 (1) Planctomycetes (61) Hydrogenedentes (2) Patescibacteria (245) Elusimicrobia (23) Aerophobetes, UBP5 (2) 'Candidate phyla radiation' Omnitrophica (14), UBP3 (4), UBP4 (2) Poribacteria UBP7 (2) UBP8 (5) Dependentiae (1) Spirochaetes (135) UBP6 (4) Fusobacteria (8) UBP13 (1) Calescamantes Armatimonadetes (24) Aquificae (1) Deferribacteres (5) UBP15 (1) Epsilonproteobacteria (52) UBP16 (1) Atribacteria (5) Firmicutes (1,666) Synergistetes (47) UBP12 (1) Actinobacteria (336) Dictyoglomi UBP9/SHA-109 (8) Coprothermo. Chloroflexi (182) Caldiserica (5) Deinococcus-thermus (6) Acetothermia (10) Thermotogae (11) Cyanobacteria (48) Terrabacteria

Fig. 2 | Distribution of UBA genomes across 76 bacterial phyla. The maximum likelihood tree was inferred from the concatenation of 120 proteins and spans a dereplicated set of 5,273 UBA and 14,304 NCBI genomes. Phyla containing UBA genomes are shown in green with the number of UBA genomes indicated in parentheses. Candidate phyla consisting only of UBA genomes are shown in red and have been named Uncultured Bacterial Phylum 1 to 17 (UBP1–UBP17). The Tenericutes and Thermodesulfobacteria lineages are not shown as they branch within the Firmicutes and Deltaproteobacteria, respectively. The Patescibacteria/CPR is shown as a single lineage for clarity and to illustrate its similarity to well-characterized phyla. The Nitrospirae and Deltaproteobacteria were found to be polyphyletic and an underscore and numerical identifier is used to distinguish between lineages. The Deltaproteobacteria are polyphyletic due to the placement of the Dadabacteria. All named lineages have bootstrap support of ≥​75%, with the exception of the Chrysiogenetes (55%), Firmicutes (1%) and Verrucomicrobia (52%). The complete tree with support values is available in Newick format (Supplementary File 1).

38 classes (18.0%) within this domain. Similarily, the archaeal UBA Only UBP9, UAP2 and UAP3 could be further taxonomically genomes represent 59 genera (30.3%), 25 families (28.4%), 13 orders resolved as the other candidate phyla either lack genomes with a 16S (23.6%) and 3 classes (15.0%) at the 90th percentile. rRNA gene, were placed sister to named phyla, or had incongruent placements across the protein and 16S rRNA trees (Supplementary UBA genomes are the first representatives of several phyla. Tables 11 and 14). UBP9 genomes are the first genomic representa- The 17 bacterial and 3 archaeal phyla comprised exclusively of tives of the Terrabacteria candidate phylum SHA-109 (Fig. 2) and UBA genomes have been given the candidate names Uncultured were recovered from baboon faeces (five genomes), palm oil effluent Bacterial Phylum 1 to 17 (UBP1 to 17, Fig. 2) and Uncultured (one genome), a toluene degrading community (one genome) and Archaeal Phylum 1 to 3 (UAP1 to 3, Fig. 3). These candidate phyla a dechlorination bioreactor (one genome, Supplementary Table 14). form well-supported clans in all three concatenated protein trees UAP2 contains the first representatives of the Marine Hydrothermal (Supplementary Table 10) and are unaffiliated with existing phyla40 Vent Group (MHVG) and consists of three genomes recovered from (Supplementary Table 11). The 10 UBP/UAP phyla with 16S rRNA the Tara Oceans Expedition along with a single genome from the genes ≥​600 bp have inter-phyla percent identity values between 76% Beebe hydrothermal vent (Supplementary Table 14 and Fig. 3). and 86% (Supplementary Table 12), in agreement with established UAP3 is represented by a single genome recovered from a Costa phyla41 (Supplementary Fig. 5). However, these 16S rRNA results Rican marine sediment metagenome (Supplementary Table 14) and should be treated with some caution as the percent identity of is the first representative of the Ancient Archaeal Group (AAG), a incomplete 16S rRNA sequences correlates poorly with values for group adjacent to the Lokiarchaeota (Fig. 3). full-length sequences42. Because 16S rRNA genes often fail to assem- ble43 and are missing from half of the UBP/UAP lineages, we used UBA genomes substantially increase phylogenetic diversity. The average amino-acid identity (mean of 46.2%) and shared gene con- phylogenetic diversity of the UBA genomes was determined across tent (mean of 24.4%) calibrated against established phyla to further the three concatenated protein trees. The results were highly con- support the classification of the UBP/UAP as phyla (Supplementary sistent across these trees with the UBA genomes covering >​50% Fig. 5 and Supplementary Table 13). of the total branch length (phylogenetic diversity, PD) spanned To further resolve the taxonomic identity of the UBP and UAP by each domain-specific genome tree and increasing total branch genomes, 16S rRNA genes from these genomes were placed into a length (phylogenetic gain, PG) by ~30% (Fig. 5 and Supplementary tree containing genomic and environmental 16S rRNA sequences. Table 8). For comparison, the PD of the CPR (1,056 genomes,

Nature Microbiology | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. Articles Nature Microbiology

0.1 TACK Crenarchaeota (11) Thaumarchaeota (26)

Korarchaeota

Verstraetearchaeota (2)

Aigarchaeota (538)

Bathyarchaeota (3)

Thorarchaeota

Lokiarchaeota UAP2/MHVG(4) UAP3/AAG (1)

Diapherotrites (1)

Micrarchaeota (12) Woesearchaeota_2 (2)

Aenigmarchaeota

Pacearchaeota (3) Nanohaloarchaeota Woesearchaeota (6) Nanoarchaeota UAP1 (2) Parvarchaeota (11) DPANN

Fig. 3 | Distribution of UBA genomes across 21 archaeal phyla. The maximum likelihood tree was inferred from the concatenation of 122 proteins and spans a dereplicated set of 453 UBA and 630 NCBI genomes. Phyla containing UBA genomes are shown in green with the number of UBA genomes indicated in parentheses. Candidate phyla consisting only of UBA genomes are shown in red and have been named Uncultured Archaeal Phylum 1 to 3 (UAP1–UAP3). The phylum Woesearchaeota was found to be polyphyletic and an underscore and numerical identifier is used to distinguish between lineages. All named lineages have bootstrap support ≥​90%, with the exception of the Crenarchaeota (73%). The complete tree with support values is available in Newick format (Supplementary File 2). MHVG, Marine Hydrothermal Vent Group; AAG, Ancient Archaeal Group. including 245 UBA genomes) and Firmicutes (25,992 genomes, groups such as the Micrarchaeota, Pacearchaeota, Parvarchaeota including 1,666 UBA genomes) is 11.2% and 25.1%, respectively and Woesearchaeota which all had a PG of >​35%. (Fig. 5). Restricting results to bacterial UBA genomes meeting our near-complete or medium-quality criteria still results in PGs Improved genomic representatives within several lineages. There of ~17% and 30%, respectively. The near-complete and medium- are 12 bacterial phyla where the UBA genomes are estimated to quality archaeal UBA genomes provide a phylogenetic gain of 10% be the highest-quality representatives (Supplementary Table 17). and 29%, respectively (Supplementary Table 8). Among these is the Aminicenantes, where the number of avail- Genomic representation of several bacterial lineages was greatly able genomes increased from 37 to 47 with the addition of the expanded by the UBA genomes (Fig. 5). The bacterial phyla with the UBA genomes, and the highest-quality genome improved from largest increase in PD were the underrepresented Aminicenantes, 86.9% to 91.9% complete, with the five highest-quality genomes all Gemmatimonadetes, Lentisphaerae and Omnitrophica lineages being UBA genomes. There are currently seven Hydrogenedentes (PG of >75%).​ Over 75% of the bacterial UBA genomes belong to genomes (five NCBI and two UBA), with the two highest-quality the Actinobacteria, Bacteroidetes, Firmicutes and Proteobacteria. representatives being UBA genomes and appreciably improving These genomes expand the PD of these phyla by 14–47%, despite upon the best previously available representative (88.3% complete, >​50% of existing genomic representatives belonging to these four 4.3% contaminated to 98.9% complete, 1.1% contaminated). The phyla (Fig. 5 and Supplementary Table 15). Such high levels of most substantial improvement was in the Latescibacteria, where increased phylogenetic diversity are the norm, with 56 of 77 phyla all previous representatives were derived from single cells and the and 73 of 143 classes being expanded by >​20% (Supplementary UBA genomes improve the best-quality representative from 57.6% Table 16). The UBA genomes have no representatives in only 10 to 95.6% complete. bacterial phyla, which we attribute to the narrow ecological range The UBA genomes are also the highest-quality representa- and/or low relative abundance of microorganisms belonging to tives of five archaeal phyla (Supplementary Table 17). Notably, the these lineages (for example, NC10 and Aerophobetes). Within the Parvarchaeota, previously represented by two single-cell genomes Archaea, the PD of 10 of 21 phyla and 7 of 17 classes increased by (18% and 4% complete), has 11 representatives among the UBA >​20% (Fig. 5 and Supplementary Tables 15 and 16). This includes genomes with an average completeness of 80.6 ±​ 4.2% and con- well-established archaeal groups such as the Euryarchaeota (PG of tamination of 1.7 ±​ 0.6%. The UBA genomes also improve the 34.8%) and Thaumarchaeota (PG of 40.8%) and poorly sampled completeness of the best representative within the Micrarchaeota

Nature Microbiology | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. Nature Microbiology Articles

abNumber of bacterial lineages Number of archaeal lineages

5,107 1,927 821 357 151 76 36 14 3 529 287 181 116 77 54 39 22 16 4 100 100 76 Phylum 4 Phylum 80 80 379 51 Class Class 60 60 867 99 Order Order 40 40 1,735 161 Lineages (%) Lineages (%) Family Family 20 20 4,150 328 Genus Genus 0 0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 Mean branch length (substitutions per site) Mean branch length (substitutions per site)

UBA NCBI UBA and NCBI

Fig. 4 | Percentage of lineages of equal evolutionary distance represented exclusively by UBA genomes. a, Percentage of bacterial lineages represented exclusively by the UBA genomes, genomes at NCBI, or both data sets as defined on the basis of mean branch length to extant taxa. Mean branch length to extant taxa was calculated for all internal nodes and lineages defined at a specific threshold to establish lineages of equal evolutionary distance. The bottom x axis indicates the threshold used to define lineages (0 =​ extant taxa, far right =​ root of tree), the top x axis indicates the number of lineages at different mean branch length thresholds, and the left y axis indicates the percentage of lineages. The right y axis indicates different taxonomic ranks, with the 5th and 90th percentiles of the distribution of mean branch length values for these ranks shown by black lines. The mean of each distribution is shown as a black circle along with the number of lineages at this value. b, Analogous plot to a, but for archaeal lineages. The phylum distribution is a single point, as the Euryarchaeota is the only archaeal phylum with phylum-level resolution (that is, containing multiple classes). Plots were determined across the domain-specific trees inferred from the concatenation of 120 bacterial and 122 archaeal proteins, respectively.

(Micrarchaeum acidiphilum ARMAN-2) from 84% complete to 91% by >​30% through the addition of 7,280 bacterial and 623 archaeal complete, while adding 11 other genomes all estimated to be >​70% genomes obtained from over 1,500 public metagenomes (Fig. 5). complete with <​3% contamination. These MAGs span the majority of recognized bacterial and archaeal phyla and include the first genomic representatives of 17 bacte- An alternative view of the CPR. The CPR has recently been pro- rial and three archaeal phyla (Figs. 2 and 3). The 7,903 genomes posed as a major collection of candidate phyla in the bacterial reported in this study range in quality from 50% complete to meet- domain30. Under the bac120 tree and mean branch length to extant ing the HMP criteria for high-quality draft genomes3,37 (Fig. 1). taxa criterion, the addition of the UBA genomes slightly reduces the They are more complete than those typically derived using sin- percentage of phylum-level lineages represented by the CPR from a gle-cell genomics4 and are of similar quality to those reported in maximum of 29.4% to 26.3%, despite UBA genomes being the sole other studies considering MAGs10,11,44. We have focused on these representatives within a number of genus- to class-level CPR lin- genomes, which represent only ~12% of the 64,295 recovered bins, eages (Fig. 6a,b). Interestingly, the percentage of phylum-level lin- as they are of sufficient quality to inform analyses such as resolv- eages within the CPR increases substantially when considering the ing phylogenetic relationships4,30 and comparing inter- and intra- rp1 and rp2 trees where the maximum percentages are 38.7% and lineage genomic features45–47. Importantly, these results demonstrate 38.3%, respectively (Fig. 6c,d). Consequently, under the bac120 tree that a large amount of microbial diversity remains to be genomically and mean branch length to extant taxa criterion the CPR contains described across the tree of life, even within existing metagenomic approximately the same percentage of phylum-level lineages as the samples, and that this diversity is readily recovered using current Firmicutes and Actinobacteria combined, whereas under the rp1 tools and methodologies. tree the CPR is far more prominent (Fig. 6e,f). Interestingly, under MAGs often lack 16S rRNA genes due to their conserved and the bac120 tree the CPR shows a pronounced increase in the relative repetitive nature impeding assembly1,10,43. The UBA genomes are no percentage of lineages attributed to being of phylum-level diversity exception, with only 17.3% of bacterial UBA genomes containing and at more specific taxonomic ranks contains approximately the a partial 16S rRNA gene and 10.2% having a fragment of ≥​600 bp. same or fewer lineages than the Firmicutes (Fig. 6f). Recovery was more successful in the archaeal UBA genomes, with A recent genome-based tree of life depicted the CPR as represent- 32.7% containing a 16S rRNA gene fragment ≥600​ bp. We attri- ing ~50% of bacterial lineages when aiming to recapitulate named bute this discrepancy to the higher average 16S rRNA copy num- phyla under the mean branch length to extant taxa criterion30. This ber in Bacteria relative to Archaea48 (bacterial mean =​ 4.12, archaeal analysis was conducted using the same 16 ribosomal proteins com- mean =​ 1.63). Challenges in assembling and binning 16S rRNA prising the rp1 marker set and resulted in 39 of 76 (51%) phylum- genes motivated the use of protein-coding genes for the phyloge- level lineages belonging to the CPR. Here, we provide an alternative netic analyses presented in this and previous MAG studies45,46. view with lineages collapsed under the same constraints on the Recently, the diversity of the CPR was explored in the context bac120 tree. This results in the CPR being represented by only 20 of a genome tree inferred from 16 ribosomal proteins where it was of 76 (26.3%) phylum-level lineages (Supplementary Fig. 7), which divided into 36 named phyla and shown to represent approximately occurs at the mean branch length threshold (namely, 0.85 substitu- 50% of bacterial lineages of equal phylum-level evolutionary dis- tions per site) resulting in the maximum percentage of phylum-level tance30. Our analyses using a 120 concatenated proteins contrast CPR lineages (Fig. 6b). with this view, as the CPR is shown to comprise ~25% of phylum- level lineages under the same criterion (Fig. 6b and Supplementary Discussion Fig. 7). This suggests that ribosomal proteins within CPR organisms Despite considerable progress, many lineages known from 16S may be evolving atypically relative to other proteins, perhaps as a rRNA surveys still lack genomic representation4. Here, we expand result of their unusual ribosome composition and the presence of the phylogenetic diversity of bacterial and archaeal genome trees self-splicing introns and proteins being encoded within their rRNA

Nature Microbiology | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. Articles Nature Microbiology

NCBI PD Total Dereplicated Relative UBA PD NCBI UBANCBI UBATotal PD Taxon PD UBA PG UBA PG/PD Bacteria 67,034 7,24414,304 5,273 100.0% 33.0% UBP (all) 046046 1.1% 1.1% Acetothermia 1101 6 0.1% 71.8% Aminicenantes 410410 0.2% 81.0% Armatimonadetes 12 24 613 0.4% 67.2% Bacteroidetes 1,324 1,513 941 1,057 9.5% 47.4% Caldiserica 1514 0.2% 74.3% Chloroflexi 61 18250169 2.7% 73.1% Dadabacteria 1 2120.1% 74.8% Elusimicrobia 323314 0.3% 70.5% Firmicutes 24,326 1,666 3,433 1,296 25.1% 32.2% Gemmatimonadetes 439337 0.2% 78.6% Hyd24-12 35350.1% 61.4% Latescibacteria 14130.1% 64.7% Lentisphaerae 212212 0.3% 75.8% Omnitrophica 114114 0.3% 89.3% Patescibacteria / CPR 811 245 811 245 11.2% 25.8% Proteobacteria 28,860 2,101 4,889 1,272 20.5% 25.0% Verrucomicrobia 45 17844166 1.9% 65.5%

Archaea 825 623630 453 100.0% 30.7% UAP (all) 07071.1% 100.0% Euryarchaeota 534 538 422369 52.7% 34.8% Marine group II 12 20612206 9.2% 85.9% Micrarchaeota 112112 2.4% 73.4% Pacearchaeota 5353 2.7% 37.9% Parvarchaeota 111111 0.2% 49.2% Thaumarchaeota 49 26 44 26 6.4% 40.8% Woesearchaeota 76764.4% 42.9%

Fig. 5 | Phylogenetic diversity and gain for select bacterial and archaeal phyla. The 7,903 UBA genomes cover considerable phylogenetic diversity (PD) and contribute substantial phylogenetic gain (PG) relative to genomes from RefSeq/GenBank release 76. PD and PG were assessed on the domain-specific trees inferred from the concatenation of 120 bacterial and 122 archaeal proteins. The bar charts give (1) the total PD for each lineage relative to the PD of the entire domain, (2) the PD of the NCBI and UBA genomes relative to the PD of the lineage, and (3) the PG contributed by the UBA genomes within the lineage. The 17 UBP and 3 UAP lineages were collapsed together to show their combined phylogenetic diversity and gain. Marine Group II, a lineage within the Euryarchaeota, has been included because it comprises the majority of euryarchaeotal UBA genomes. genes10. These contrasting views of the diversity of the CPR are provide important insights into the role of microorganisms in both equally valid and probably reflect the unique biology of the organ- natural and industrial processes. We anticipate that as metagenomic isms within this group. assembly and binning methods mature we will be presented with While the SRA represents a large set of publicly available metage- the challenge and great opportunity to be able to study microbial nomic data, many additional metagenomes exist in other reposito- communities with complete, or near complete, genomic representa- ries such as the Integrated Microbial Genomes and Metagenomes49 tion in the context of a comprehensive tree of life. (IMG/M) database and Metagenomics Rapid Annotation Server50 (MG-RAST). We expect that processing these metagenomes Note added in proof: During finalization of this manuscript, a new will add tens of thousands of additional genomes to the tree of standard specifying the minimum information about a metagenome- life. Furthermore, methods for assembling and binning metage- assembled genome (MIMAG) was proposed51. The medium-qual- nomic data are continually improving, which makes it likely that ity and partial UBA MAGs meet the medium-quality criteria of systematic reprocessing of metagenomic data will result in the the MIMAG standard. However, most of the near-complete UBA recovery of new genomes and improved versions of previously MAGs do not meet the stringent rRNA and tRNA requirements obtained genomes. for high-quality draft MAGs under this standard, and we therefore The number and diversity of genomes presented in this study, deliberately refer to these MAGs as ‘near complete’. and the many similar studies we anticipate will follow, move us closer to a comprehensive genomic representation of the micro- Methods bial world. Detailed examination of such genomes will further our Recovery of cultivation-independent genomes. Metadata for metagenomes in understanding of microbial evolution and metabolic diversity, and the Sequence Read Archive52 (SRA) at the National Center for Biotechnology

Nature Microbiology | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. Nature Microbiology Articles

a bac120 Number of lineages b bac120 Number of lineages

3,270 1,282 583 279 133 68 34 12 2 5,107 1,927 821 357 151 76 36 14 3 100 100 76 68 Phylum Phylum 80 80 291 379 Class Class 60 60 616 867 Order Order 40 40 1,162 1,735 Family Family Lineages (%) 20 Lineages (%) 20 2,646 Genus 4,150 Genus 0 0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Mean branch length (substitutions per site) Mean branch length (substitutions per site)

c rps16 Number of lineages d rps23 Number of lineages

3,454 1,197 476 179 67 27 5 3,209 1,064 411 160 67 22 3 100 100 71 Phylum 69 Phylum 80 80 475 449 Class Class 60 60 924 Order 845 Order 40 40 2,031 Family 1,872

Lineages (%) Family Lineages (%) 20 20 4,187 Genus 4,422 Genus 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Mean branch length (substitutions per site) Mean branch length (substitutions per site) UBA exclusive Other bacteria CPR UBA exclusive CPR

e rps16 Number of lineages f bac120 Number of lineages

3,454 1,197 476 179 67 27 5 5,107 1,927 821 357 151 76 36 14 3 100 100 71 Phylum 76 Phylum 80 80 475 379 Class Class 60 60 924 867 Order Order 40 40 2,031 1,735 Family Family Lineages (%) Lineages (%) 20 20 4,187 4,150 Genus Genus 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Mean branch length (substitutions per site) Mean branch length (substitutions per site) Other bacteria Proteobacteria Firmicutes Bacteroidetes Actinobacteria CPR

Fig. 6 | Percentage of lineages of equal evolutionary distance within the CPR. a, Percentage of CPR lineages within the bac120 tree without the addition of the UBA genomes. The maximum percentage of phylum-level lineages within the CPR occurs at a mean branch length of 0.85, as denoted by the dashed grey line. The remaining figure elements are the same as in Fig. 4. b, Analogous plot to a, with the addition of the UBA genomes (maximum =​ 0.85). c,d, Percentage of CPR lineages within the rp1 and rp2 trees, respectively, with the maximum percentage of phylum-level lineages within the CPR denoted with a dashed grey line (rp1 =​ 0.69; rp2 =​ 0.76) e,f, Percentage of CPR, Firmicutes, Actinobacteria, Proteobacteria and Bacteroidetes lineages within the rp1 and bac120 trees, respectively. The trees span 14,290 (a), 19,198 (b and f), 18,081 (c and e) and 18,226 (d) genomes.

Information (NCBI) were obtained from the SRAdb53. Only metagenomes lineage-specific markers genes and default parameters. For each SRA Experiment, submitted to the SRA before 31 December 2015 were considered with a only genomes recovered with the MetaBAT preset resulting in the largest number predominant focus on environmental and non-human gastrointestinal samples of bins with an estimated completeness >​70% and contamination <​5% were (for example, rumen, guinea pigs and baboon faeces; Supplementary Table 1) and considered for further refinement and validation. explicitly excluding metagenomes identified as being from studies where MAGs had already been recovered. Each of the 1,550 metagenomes were processed Merging of compatible bins. Automated binning methods can produce multiple independently, with all SRA Runs within an SRA Experiment (that is, sequences bins from the same microbial population. The ‘merge’ method of CheckM v.1.0.6 from a single biological sample) being co-assembled using the CLC de novo was used to identify pairs of bins where the completeness increased by ≥​10% and assembler v.4.4.1 (CLCBio). Assembly was restricted to contigs ≥​500 bp and the the contamination increased by ≤​1% when merged into a single bin. Bins meeting word size, bubble size and paired-end insertion size determined by the assembly these criteria were grouped into a single bin if the mean GC of the bins were within software. Assembly statistics are reported for contigs≥ ​2,000 bp (Supplementary 3%, the mean coverage of the bins had an absolute percentage difference Table 1). Reads were mapped to contigs with BWA54 v.0.7.12-r1039 using the ≤​25%, and the bins had identical taxonomic classifications as determined by their BWA-MEM algorithm with default parameters and the mean coverage of contigs placement in the reference genome tree used by CheckM. This set of criteria was obtained using the ‘coverage’ command of CheckM25 v.1.0.6. Genomes were used to avoid producing chimaeric bins. independently recovered from each SRA Experiment using MetaBAT21 v.0.26.3 under all five preset parameter settings (that is, verysensitive, sensitive, specific, Filtering scaffolds with divergent genomic properties. Scaffolds with genomic veryspecific, superspecific). The completeness and contamination of the genomes features deviating substantially from the mean GC, tetranucleotide signature, or recovered under each MetaBAT preset were estimated using CheckM using coverage of a bin were identified with the ‘outliers’ method of RefineM v.0.0.14

Nature Microbiology | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. Articles Nature Microbiology

(https://github.com/dparks1134/RefineM) using default parameters. This removes all proteins, had an E-value of ≤​1e−10, a percent identity of ≥​70% and an alignment scaffolds with a GC or tetranucleotide distance outside the 98th percentile of the expected length spanning ≥​70% of the isolate protein. distributions of these genomic features, as determined empirically over a set of 5,656 trusted reference genomes25,33. Scaffolds were also removed if their mean coverage had an Proteins used to infer genome trees. Bacterial and archaeal genome trees absolute percentage difference ≥​50% when compared to the mean coverage of the bin. were inferred from the concatenation of 120 (Supplementary Table 6) and 122 (Supplementary Table 7) phylogenetically informative proteins, respectively. Filtering scaffolds with incongruent taxonomic classification. Each gene within These proteins were identified as being present in ≥​90% of bacterial or archaeal a bin was assigned a taxonomic classification through homology search using genomes and, when present, single-copy in ≥​95% of genomes. Protein-coding BLASTP55 v.2.2.30+​ against a custom database of 12,321 genomes from RefSeq/ regions were identified using Prodigal v.2.6.3 (with default parameters, but with GenBank56 release 75. This database was constructed from RefSeq and GenBank Ns treated as masked sequences), translation tables determined using a coding genomes consisting of ≤​300 contigs, having an N50 ≥​20 kb and containing ≤​10 kb density heuristic25, and the ubiquity of genes determined across genomes from of ambiguous base pairs. A genome was only included in the database if it NCBI’s RefSeq release 73 annotated with the Pfam59 v.27 and TIGRFAMs60 was estimated to be ≥​90% complete, ≤​10% contaminated and had an overall v.15.0 databases. Only genomes composed of ≤​200 contigs, with an N50 of quality ≥​50 (defined as completeness −​ 5 ×​ contamination). Quality estimates ≥​20 kb and with CheckM completeness and contamination estimates of ≥​95% were determined with CheckM using the lineage-specific workflow and default and ≤​5%, respectively, were considered. Phylogenetically informative proteins parameters. Genomes meeting this set of requirements were dereplicated to were determined by filtering ubiquitous proteins whose gene trees had poor remove genomes from the same named species with an amino-acid identity (AAI) congruence with a set of subsampled concatenated genome trees. Specifically, ≥​99.5%. AAI values were calculated with CompareM v.0.0.13 (https://github.com/ the initial set of 188 bacterial (187 archaeal) proteins were randomly subsampled dparks1134/CompareM) and dereplication performed in a greedy fashion with to 132 genes (~70%) and concatenated to infer a subsampled genome tree. Gene a preference towards type strains and genomes annotated as complete at NCBI. subsampling was independently performed 100 times to establish well-supported Genes were assigned the taxonomic classification of their ‘top’ hit or designated splits, which we define as any split occurring in >​80% of the subsampled trees 2 as unclassified if the gene had no identified homologue with an E-value ≤​1e− , a and with ≥​1% of taxa contained in both bipartitions induced by the split. The percent sequence identity ≥​30% and a percent alignment length ≥​50%. congruence between a gene tree and the subsampled genome tree was measured Scaffolds with incongruent taxonomic classifications were removed from each as the fraction of well-supported split lengths compatible with the gene tree, a bin. The consensus classification of a bin at each taxonomic rank was determined measure we call the ‘normalized compatible split length’. Genes with a normalized by identifying the taxon that occurred at the highest frequency across all classified compatible split length of ≤​50% were removed, as this poor congruence may genes or designated as unclassified if no taxon was represented by ≥50%​ of the indicate the presence of lateral gene transfer events. Proteins were aligned to Pfam classified genes. Scaffolds where ≥​50% of the classified genes at each rank agreed and TIGRfam HMMs using HMMER61 v.3.1b1 with default parameters and trees with the consensus classification of the bin were designated as ‘trusted’, and a were inferred with FastTree62 v.2.1.7 under the WAG+​GAMMA models. taxon was considered to be ‘common’ if it comprised ≥​5% of the classified genes Trees were also inferred from two ribosomal protein sets: (1) 16 ribosomal across the set of trusted scaffolds. A scaffold was considered to be taxonomically proteins (Supplementary Table 4) that form a syntenic block10,30 and (2) 23 incongruent and removed from a bin if the following three conditions were met: ribosomal proteins (Supplementary Table 5) previously used for tree inference and (1) it contained ≥​5 classified genes and ≥​25% of all genes on the scaffold were tested for lateral gene transfer4. classified; (2) ≤​10% of the classified genes were contained in the set of common taxa at each classified rank; and (3) >​50% of classified genes were assigned to Inference of genome trees. Genome trees were inferred across a dereplicated set of the same taxon at each classified rank. Taxonomic classification of genes and UBA and RefSeq/GenBank release 76 (May 2016; includes 727 single cell and 1,811 identification of scaffolds with divergent taxonomic classifications were performed MAGs) genomes. All 5,192 RefSeq genomes annotated at NCBI as ‘reference’ or with the ‘taxon_profile’ and ‘taxon_filter’ methods of RefineM v.0.0.14 ‘representative’ were retained, except for a low-quality subset of 294 genomes that did (https://github.com/dparks1134/RefineM), respectively. not meet our ‘trusted’ genome criteria: composed of ≤​300 contigs, N50 ≥​20 kb, CheckM completeness and contamination estimates of ≥​90% and ≤​10%, respectively. Filtering scaffolds with incongruent 16S rRNA genes. Scaffolds were removed This set of 4,898 genomes was augmented with an additional 3,324 RefSeq genomes from a bin if they contained a complete or partial 16S rRNA gene ≥​600 bp to retain at least two genomes per species where possible. Preference was given to with a taxonomic classification incongruent with the taxonomic identity of the genomes annotated at NCBI as being a type strain and/or ‘complete’ and restricted bin. BLASTN55 was used to assign 16S rRNA genes the taxonomy of its closest to genomes meeting the ‘trusted’ genome filtering criteria. An additional 551 RefSeq homologue within a database comprising the 10,769 16S genes identified within genomes currently without a species designation at NCBI, but passing the genome the 12,321 reference genomes discussed in the previous section. The sequence quality filtering, were also added to this initial set of seed genomes. identity to the closest homologue was used to determine the set of ranks that UBA, GenBank and remaining RefSeq genomes meeting the ‘trusted’ should be examined for congruency. Specifically, previously reported median genome criteria were compared to these 8,773 seed genomes. Genomes with percent identities values were used to establish conservative thresholds for the an AAI of ≥​99.5% to a seed genome, as calculated over the 120 bacterial or taxonomic ranks to consider41: genus ≥​98.7%, family ≥​96.4%, order ≥​92.25%, class 122 archaeal marker genes used for phylogenetic inference, were clustered ≥​89.2%, phylum ≥​86.35% and domain ≥​83.68%. The taxon at each rank was then with the seed genome and do not appear as separate genomes in the genome compared to the taxonomic classification of the genes across all scaffolds in the trees. This cutoff correlated with the proposed 96.5% ANI threshold for bin and designated as incongruent if the taxon was assigned to ≤​10% of classified defining bacterial and archaeal species57 (Supplementary Fig. 6). Trusted genes. This methodology is implemented in RefineM v.0.0.14. genomes with an AAI <​99.5% were added to the seed set. All remaining genomes, regardless of quality, were compared to this final seed set using the Selection of refined genomes. Of the 64,295 bins produced by MetaBAT, same AAI clustering criteria of 99.5%. only the 7,903 genomes with an estimated quality ≥​50 (defined as Seed and unclustered genomes with an estimated genomes quality ≥​50 (defined completeness −​ 5 ×​ contamination), scaffolds resulting in an N50 of ≥​10 kb, as completeness −​ 5 ×​ contamination) were used to create an initial multiple containing <​100 kb ambiguous bases and consisting of <1,000​ contigs and sequence alignment, with the exception of the 797 CPR genomes10, which were <​500 scaffolds were considered to be of sufficient quality for further exploration retained regardless of their estimated quality. Proteins were identified and aligned and deposition in public repositories. We adopted the quality criteria of using HMMER v.3.1b1 and the resulting alignment trimmed to remove columns completeness −​ 5 ×​ contamination as it provides a good signal (completeness) represented by <​50% of taxa or without a common amino acid in ≥​25% of taxa. to noise (contamination) ratio, where higher levels of contamination are only Genomes with amino acids in <​40% of aligned columns (20% for the lenient permissible when the genome is largely complete. These genomes have been archaeal trees) were removed from consideration. The 120 concatenated bacterial deposited as assemblies in NCBI’s TPA:Assembly database along with alignment protein set consisted of 34,796 aligned columns after trimming and was inferred files indicating the mapping of SRA reads to UBA genomes. over 19,198 genomes. The 122 concatenated archaeal protein set contained 28,025 trimmed columns and spanned 1,012 genomes when using standard filtering Comparison of UBA genomes to complete conspecific strains. The 3,438 near- criteria and 27,942 columns spanning 1,070 genomes when using lenient filtering. Trees were inferred with FastTree v.2.1.7 under the WAG ​GAMMA models and complete (≥​90% complete; ≤​5% contamination) UBA genomes were compared to + complete isolate genomes in RefSeq release 76. Of these, 207 of the UBA genomes support values determined using 100 non-parametric bootstrap replicates. were determined to be conspecific strains of complete isolate genomes based on an average nucleotide identity (ANI) and alignment fraction (AF) above 96.5% and Taxonomic annotation of genome trees. Genome trees were annotated using 60%, respectively57. ANI and AF values were determined using ANI Calculator57 taxonomic information from the NCBI Taxonomy Database39. Only the canonical v.1. The genome size of the UBA genomes was adjusted to account for its seven taxonomic ranks (species to domain) were considered for each genome, estimated completeness and contamination: adjusted genome size =​ (genome size)/ and this information was used to annotate internal lineages using tax2tree63. (completeness +​ contamination). Homologues between UBA genomes and their Manual curation was performed to add in phylum information currently missing conspecific counterparts were determined by inferring genes with Prodigal58 v.2.6.3 at NCBI and to resolve polyphyletic groups. Polyphyletic groups that could not be and establishing sequence similarity with BLASTP v.2.2.30+​. A UBA protein was confidently resolved were identified using an underscore and numerical identifier considered homologous to an isolate protein if it was the top hit among all isolate (for example, Deltaproteobacteria_1).

Nature Microbiology | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. Nature Microbiology Articles

Inference of 16S rRNA trees. Bacterial and archaeal trees were inferred from 16S 9. Gawad, C., Koh, W. & Quake, S. R. Single-cell genome sequencing: current rRNA genes >​600 bp and >​1,200 bp within UBA and RefSeq/GenBank release state of the science. Nat. Rev. Genet. 17, 175–188 (2016). 76 genomes, respectively. The 16S rRNA genes were identified using HMMER 10. Brown, C. T. et al. Unusual biology across a group comprising more than and domain-specific SSU/LSU HMM models as implemented in the ‘ssu-finder’ 15% of domain Bacteria. Nature 523, 208–211 (2015). method of CheckM. These genes were aligned with ssu-align64 v.0.1 and trailing or 11. Anantharaman, K. et al. Thousands of microbial genomes shed light on leading columns represented by ≤​70% of taxa trimmed, which resulted in bacterial interconnected biogeochemical processes in an aquifer system. Nat. Commun. and archaeal alignments of 1,421 and 1,378 bp, respectively. Trees were inferred 7, 13219 (2016). with FastTree v.2.1.7 under the GTR+​GAMMA models and support values 12. Vanwonterghem, I., Jensen, P. D., Rabaey, K. & Tyson, G. W. Genome-centric determined using 100 non-parametric bootstrap replicates. resolution of microbial diversity, metabolism and interactions in anaerobic digestion. Environ. Microbiol. 18, 3144–3158 (2016). Similarity of 16S rRNA genes. The percent identity between 16S rRNA genes 13. Tyson, G. W. et al. Community structure and metabolism through reconstruction was calculated from the multiple sequence alignments used to infer the domain- of microbial genomes from the environment. Nature 428, 37–43 (2004). specific 16S rRNA gene trees. The ‘dist.seqs’ command of mothur65 v.1.30.2 was 14. Wrighton, K. C. et al. Fermentation, hydrogen, and sulfur metabolism in used to calculate percent identity. Default parameters were used, except that gaps multiple uncultivated bacterial phyla. Science 337, 1661–1665 (2012). at the end of sequences were ignored (countends =​ F) in order to accommodate 15. Yeoh, Y. K., Sekiguchi, Y., Parks, D. H. & Hugenholtz, P. Comparative partial 16S rRNA sequences. Inter-phylum (inter-class) 16S rRNA percent identity genomics of candidate phylum TM6 suggests that parasitism is widespread values were determined by identifying the most similar sequence to each sequence and ancestral in this lineage. Mol. Biol. Evol. 33, 915–927 (2016). within a phylum across all sequences from different phyla (classes). 16. Albertsen, M., Hugenholtz, P., Skarshewski, A., Nielsen, K. L., Tyson, G. W. & Nielsen, P. H. Genome sequences of rare, uncultured bacteria obtained by Assessment of phylogenetic and taxonomic diversity. Phylogenetic diversity differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, (total branch length spanned by a set of taxa) and gain (additional branch length 533–538 (2013). contributed by a set of taxa) were calculated using GenomeTreeTk v.0.0.23 17. Sharon, I., Morowitz, M. J., Thomas, B. C., Costello, E. K., Relman, D. A. & (https://github.com/dparks1134/GenomeTreeTk) and verified with ARB66 v.6.0.2. Banfield, J. F. Time series community genomics analysis reveals rapid shifts in Taxonomic diversity and the percentage of lineages of equal evolutionary distance bacterial species, strains, and phage during infant gut colonization. Genome unique to the UBA genomes were determined using the mean branch length to Res. 23, 111–120 (2013). 30 extant taxa criterion . Lineages of equal evolutionary distance were related to the 18. Strous, M., Kraft, B., Bisdorf, R. & Tegetmeyer, H. E. The binning of 39 distribution of NCBI taxa as defined on 19 May 2016 and used to construct the metagenomic contigs for microbial physiology of mixed cultures. Front. phylum-level lineage view (Supplementary Fig. 7) by evaluating the number of Microbiol. 3, 410 (2012). groups formed at mean branch length values of 0.5 to 1.1 with a step size of 0.025. 19. Imelfort, M., Parks, D. H., Woodcroft, B. J., Dennis, P., Hugenholtz, P. & A value of 0.85 was selected as it most closely matched the number of bacterial Tyson, G. W. GroopM: an automated tool for the recovery of population 30 phyla when excluding the CPR. In agreement with previous analyses , we used genomes from related metagenomes. PeerJ 2, e603 (2014). this criterion to explore the taxonomic structure of phylogenetic trees and not to 20. Nielsen, H. B. et al. Identification and assembly of genomes and genetic explicitly establish taxonomic status. elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014). Genomic similarity. The AAI and shared gene content between genomes were 21. Kang, D. D., Froula, J., Egan, E. & Wang, Z. MetaBAT, an efficient tool for determined with CompareM v.0.0.21 using default parameters (homologues accurately reconstructing single genomes from complex microbial defined by an E-value ≤​0.001, a percent identity ≥​30% and an alignment length communities. PeerJ 3, e1165 (2015). ≥​70%). CompareM reports shared gene content relative to the genome with the 22. Sczyrba, A. et al. Critical assessment of metagenome interpretation—a fewest identified genes in order to accommodate incomplete genomes. Inter- benchmark of computational metagenomics software. Preprint at http://www. phylum and inter-class AAI and shared gene content values were determined by biorxiv.org/content/early/2017/06/12/099127 (2017). sampling up to 50 near-complete (completeness ≥​90%, contamination ≤​5%, 23. Kantor, R. S. et al. Small genomes and sparse metabolisms of sediment- N50 >​20 kb, total contigs ≤​200) RefSeq release 76 genomes from each named associated bacteria from four candidate phyla. mBio 4, e00708-13 (2013). lineage, taking care to sample evenly between named species. The AAI score, 24. Luo, C., Knight, R., Siljander, H., Knip, M., Xavier, R. J. & Gevers, D. defined as the sum of the AAI and shared gene content, was used to determine the ConStrains identifies microbial strains in metagenomic datasets. Nat. most similar genome to each query genome. Biotechnol. 33, 1045–1052 (2015). 25. Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. Genomic and assembly properties. Genomic and assembly properties (for CheckM: assessing the quality of microbial genomes recovered from isolates, example, GC, N50, coverage) were determined using CheckM. Transfer RNAs were single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015). identified with tRNAscan-SE67 v.1.3.1 using either the bacterial or archaeal tRNA 26. Eren, A. M. et al. Anvi’o: an advanced analysis and visualization platform for model and default parameters. ‘omics data. PeerJ 3, e1319 (2015). 27. Eloe-Fadrosh, E. A. et al. Global metagenomic survey reveals a new bacterial Data availability. The UBA genomes have been deposited under NCBI BioProject candidate phylum in geothermal springs. Nat. Commun. 7, 10476 (2016). PRJNA348753. Individual genomes have been deposited at DDBJ/ENA/GenBank and accession numbers are provided in Supplementary Table 2. The initial versions 28. Sekiguchi, Y., Ohashi, A., Parks, D. H., Yamauchi, T., Tyson, G. W. & of these genomes are described in this paper. Hugenholtz, P. First genomic insights into members of a candidate bacterial phylum responsible for wastewater bulking. PeerJ 3, e740 (2015). 29. Castelle, C. J. et al. Genomic expansion of domain archaea highlights roles for Received: 28 April 2017; Accepted: 25 July 2017; organisms from new phyla in anaerobic carbon cycling. Curr. Biol. 25, Published: xx xx xxxx 690–701 (2015). 30. Hug, L. A. et al. A new view of the tree of life. Nat. Microbiol. 1, 16048 (2016). References 31. Spang, A. et al. Complex archaea that bridge the gap between prokaryotes 1. Hugenholtz, P., Sharshewski, A. & Parks, D. H. in Microbial Evolution (ed. and eukaryotes. Nature 521, 173–179 (2015). Ochman, H.) 55–65 (Cold Spring Harbor Laboratory Press, New York, 2016). 32. Williams, T. A., Foster, P. G., Cox, C. J. & Embley, T. M. An archaeal origin of 2. Solden, L., Lloyd, K. & Wrighton, K. The bright side of microbial dark eukaryotes supports only two primary domains of life. Nature 504, 231–236 matter: lessons learned from the uncultivated majority. Curr. Opin. Microbiol. (2013). 31, 217–226 (2016). 33. Evans, P. N. et al. Methane metabolism in the archaeal phylum 3. Nelson, K. E. et al. A catalog of reference genomes from the human Bathyarchaeota revealed by genome-centric metagenomics. Science 350, microbiome. Science 328, 994–999 (2010). 434–438 (2015). 4. Rinke, C. et al. Insights into the phylogeny and coding potential of microbial 34. Vanwonterghem, I. et al. Methylotrophic methanogenesis discovered in the dark matter. Nature 499, 431–437 (2013). archaeal phylum Verstraetearchaeota. Nat. Microbiol. 1, 16170 (2016). 5. Wu, D. et al. A phylogeny-driven genomic encyclopaedia of Bacteria and 35. Whitman, W. B. et al. Genomic encyclopedia of bacterial and archaeal type Archaea. Nature 462, 1056–1060 (2009). strains, phase III: the genomes of soil and plant-associated and newly 6. Kyrpides, N. C. et al. Genomic encyclopedia of type strains, phase I: the one described type strains. Stand. Genomic Sci. 10, 26 (2015). thousand microbial genomes (KMG-I) project. Stand. Genomic Sci. 9, 36. Sims, D., Sudbery, I., Ilott, N. E., Heger, A. & Ponting, C. P. Sequencing depth 1278–1284 (2013). and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 7. Mukherjee, S. et al. 1,003 reference genomes of bacterial and archaeal isolates 121–132 (2014). expand coverage of the tree of life. Nat. Biotechnol. 35, 676–683 (2017). 37. Chain, P. S. et al. Genome project standard in a new era of sequencing. 8. Marcy, Y. et al. Dissecting biological ‘dark matter’ with single-cell genetic Science 326, 236–237 (2009). analysis of rare and uncultivated TM7 microbes from the human mouth. 38. Shepherd, J. & Ibba, M. Bacterial transfer RNAs. FEMS Microbiol. Rev. 39, Proc. Natl Acad. Sci. USA 104, 11889–11894 (2007). 280–300 (2015).

Nature Microbiology | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. Articles Nature Microbiology

39. Sayers, E. W. et al. Database resources of the National Center for 62. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree: computing large minimum Biotechnology Information. Nucleic Acids Res. 37, D13–D25 (2009). evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26, 40. Hugenholtz, P., Goebel, B. M. & Pace, N. R. Impact of culture-independent 1641–1650 (2009). studies on the emerging phylogenetic view of bacterial diversity. J. Bacteriol. 63. McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks 180, 4765–4774 (1998). for ecological and evolutionary analyses of bacteria and archaea. ISME J. 6, 41. Yarza, P. et al. Uniting the classification of cultured and uncultured bacteria 610–618 (2012). and archaeal using 16S rRNA gene sequences. Nat. Rev. Microbiol. 12, 64. Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA 635–645 (2014). alignments. Bioinformatics 25, 1335–1337 (2009). 42. Schloss, P. D. The effects of alignment quality, distance calculation method, 65. Schloss, P. D. et al. Introducing mothur: open-source, platform-independent, sequence filtering, and region on the analysis of 16S rRNA gene-based community-supported software for describing and comparing microbial studies. PLoS Comput. Biol. 6, e1000844 (2010). communities. Appl. Environ. Microbiol. 75, 7537–7541 (2009). 43. Yuan, C., Lei, J., Cole, J. & Sun, Y. Reconstructing 16S rRNA genes in 66. Ludwig, W. et al. ARB: a software environment for sequence data. Nucleic metagenomic data. Bioinformatics 31, 35–43 (2015). Acids Res. 32, 1363–1371 (2004). 44. Haroon, M. F., Thompson, L. R., Parks, D. H., Hugenholtz, P. & Stingl, U. A 67. Lowe, T. M. & Eddy, S. R. tRNAscan-SE: A program for improved detection catalogue of 136 microbial draft genomes from Red Sea metagenomes. Sci. of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, Data 3, 160050 (2016). 955–964 (1997). 45. Soo, R. M. et al. An expanded genomic representation of the phylum Cyanobacteria. Genome Biol. Evol. 6, 1031–1045 (2014). Acknowledgements 46. Rahman, N. A., Parks, D. H., Vanwonterghem, I., Morrison, M., Tyson, G. W. The authors thank the many contributors to the Sequence Read Achieve for making & Hugenholtz, P. A phylogenomic analysis of the bacterial phylum their data publicly available. This study would not have been possible without this open Fibrobacteres. Front. Microbiol. 6, 01469 (2015). sharing of data. D.H.P., C.R., M.C. and P.-A.C. are supported by the Australian Centre 47. Lazar, C. S. et al. Genomic evidence for distinct carbon substrate preferences for Ecogenomics, B.J.W. by an Australian Research Council Discovery Early Career and ecological niches of Bathyarchaeota in estuarine sediments. Environ. Research Award (DE160100248) and a Genomic Science Program of the United States Microbiol. 18, 1200–1211 (2016). Department of Energy Office of Biological and Environmental Research grant (DE- 48. Stoddard, S. F., Smith, B. J., Hein, R., Roller, B. R. K. & Schmidt, T. M. SC0010580), P.N.E. by an Australian Research Council Discovery Early Career Research rrnDB: improved tools for interpreting rRNA gene abundance in bacteria and Award (DE170100428), P.H. by an Australian Research Council Laureate Fellowship archaea and a new foundation for future development. Nucleic Acids Res. 43, (FL150100038) and G.W.T. by a University of Queensland Vice Chancellor’s Research D593–D598 (2015). Focused Fellowship. 49. Markowitz, V. M. et al. IMG/M 4 version of the integrated metagenome comparative analysis system. Nucleic Acids Res. 42, D568–D573 (2014). Author contributions 50. Wilke, A. et al. The MG-RAST metagenomics database and portal in 2015. D.H.P. designed and carried out the study and wrote the manuscript. C.R. and M.C. Nucleic Acids Res. 44, D590–D594 (2016). assisted in interpreting the phylogenies and taxonomically decorating the trees. P.-A.C. 51. Bowers, R. M. et al. Minimum information about a single amplified genome and B.J.W. helped with data collection and bioinformatic analyses. P.N.E. assisted in (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and assessing the quality of the recovered genomes. P.H. and G.W.T. guided the study and archaea. Nat. Biotechnol. 35, 725–731 (2017). helped write the manuscript. 52. Leinonen, R., Sugawara, H. & Shumway, M. The sequence read archive. Nucleic Acids Res. 39, D19–D21 (2011). Competing interests 53. Zhu, Y., Stephens, R. M., Meltzer, P. S. & Davis, S. R. SRAdb: query and use The authors declare no competing financial interests. public next-generation sequencing data from within R. BMC Bioinformatics 14, 19 (2013). 54. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows– Additional information Wheeler transform. Bioinformatics 25, 1754–1760 (2009). Supplementary information is available for this paper at doi:10.1038/s41564-017-0012-7. 55. Camacho, C. et al. BLAST+:​ architecture and applications. BMC Reprints and permissions information is available at www.nature.com/reprints. Bioinformatics 10, 421 (2009). Correspondence and requests for materials should be addressed to G.W.T. 56. Tatusova, T., Ciufo, S., Fedorov, B., O’Neill, K. & Tolstoy, I. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in Res. 42, D553–D559 (2014). published maps and institutional affiliations. 57. Varghese, N. J. et al. Microbial species delineation using whole genome Open Access This article is licensed under a Creative Commons sequences. Nucleic Acids Res. 43, 6761–6771 (2015). Attribution 4.0 International License, which permits use, sharing, 58. Hyatt, D., Chen, G. L., Locascio, P. F., Land, M. L., Larimer, F. W. & Hauser, adaptation, distribution and reproduction in any medium or L. J. Prodigal: prokaryotic gene recognition and translation initiation site format, as long as you give appropriate credit to the original author(s) and the source, identification. BMC Bioinformatics 11, 119 (2010). provide a link to the Creative Commons license, and indicate if changes were made. 59. Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, The images or other third party material in this article are included in the article’s D222–D230 (2014). Creative Commons license, unless indicated otherwise in a credit line to the material. 60. Haft, D. H., Selengut, J. D. & White, O. The TIGRFAMs database of protein If material is not included in the article’s Creative Commons license and your intended families. Nucleic Acids Res. 31, 371–373 (2003). use is not permitted by statutory regulation or exceeds the permitted use, you will need 61. Eddy, S. R. Accelerated profile HMM searches. PLoS Comp. Biol. 7, e1002195 to obtain permission directly from the copyright holder. To view a copy of this license, (2011). visit http://creativecommons.org/licenses/by/4.0/.

Nature Microbiology | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.