<<

Articles DOI: 10.1038/s41564-017-0012-7

Corrected: Author correction Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of

Donovan H. Parks , Christian Rinke , Maria Chuvochina, Pierre-Alain Chaumeil, Ben J. Woodcroft, Paul N. Evans, Philip Hugenholtz * and Gene W. Tyson*

Challenges in cultivating microorganisms have limited the phylogenetic diversity of currently available microbial genomes. This is being addressed by advances in sequencing throughput and computational techniques that allow for the cultivation-inde- pendent recovery of genomes from metagenomes. Here, we report the reconstruction of 7,903 bacterial and archaeal genomes from >​1,500 public metagenomes. All genomes are estimated to be ≥​50% complete and nearly half are ≥​90% complete with ≤​5% contamination. These genomes increase the phylogenetic diversity of bacterial and archaeal genome trees by >30%​ and provide the first representatives of 17 bacterial and three archaeal candidate phyla. We also recovered 245 genomes from the Patescibacteria superphylum (also known as the Candidate Phyla Radiation) and find that the relative diversity of this group varies substantially with different protein marker sets. The scale and quality of this data set demonstrate that recovering genomes from metagenomes provides an expedient path forward to exploring microbial dark matter.

equencing of microbial genomes has accelerated with reduc- phyla previously lacking genomic representatives27–29, including tions in sequencing costs, and public repositories now contain the Patescibacteria superphylum4, which has subsequently been nearly 70,000 bacterial and archaeal genomes. The majority referred to as the ‘Candidate Phyla Radiation’ (CPR) as it may con- S 1,2 10,30 of these genomes have been obtained from axenic cultures and sist of upwards of 35 candidate phyla . Notable evolutionary and disproportionately reflect microorganisms of medical importance3. metabolic insights include the discovery of eukaryotic-like cytoskel- Consequently, current genome repositories are not representative eton genes in the archaeon Lokiarchaeota31,32 and the identification of the microbial diversity known from 16S rRNA gene surveys4. of putative methane-metabolizing genes in the Bathyarchaeota and Concerted efforts are being made to address this limitation by Verstraetearchaeota phyla33,34. These initial studies demonstrate the target­ing phylogenetically distinct microorganisms for cultivation5–7 need for additional genomic representatives across the tree of life in and single- sequencing4,8. Although these approaches continue order to more fully appreciate microbial evolution and . to provide valuable reference genomes, the former is restricted to Here, we present the first large-scale initiative to recover MAGs microorganisms amenable to cultivation and the latter is hampered from publicly available metagenomes. Nearly 8,000 draft-quality by technical challenges and the need for specialised equipment9. genomes were recovered from over 1,500 metagenomes, more than Obtaining genomes from metagenomes is an emerging approach with a threefold increase over large initiatives to genomically populate the potential for large-scale recovery of near-complete genomes10–12. the tree of life such as the Genomic Encyclopedia of and Until recently, recovering genomes from metagenomic data was Archaea35 (~2,000 genomes), the Human Microbiome Project3 restricted to samples with low microbial diversity13, but improved (~2,000) and the largest previous MAG study11 (~2,500). We refer to sequencing throughput and advances in computational techniques our set of MAGs as the Uncultivated Bacteria and (UBA) now allow metagenome-assembled genomes (MAGs) to be recov- data set. Genome-based phylogenetic analysis indicates that the ered from high diversity environments14,15. MAGs are obtained UBA genomes provide the first representatives of several major by grouping or ‘binning’ together assembled contigs with simi- bacterial and archaeal lineages and substantially expand genomic lar sequence composition, depth of coverage across one or more representation across the tree of life. related samples and taxonomic affiliations16,17. Several tools have been developed that exploit these sources of information to pro- Results duce genomes from metagenomic data18–21 and there are ongo- Genomes are readily recovered from metagenomic data. MAGs ing efforts to evaluate the effectiveness of different approaches22. were recovered from 1,550 metagenomes submitted to the Sequence Although closed genomes have been obtained using metagenomic Read Archive (SRA) before 31 December 2015 (Supplementary binning methods10,23, MAGs are typically incomplete and may con- Fig. 1). We predominantly considered environmental and non- tain contigs from multiple strains or due to challenges in human gastrointestinal samples in order to focus on metagenomes distinguishing between related community members both in the likely to contain microbial populations from under-sampled lin- assembly and binning processes19,24. This has spurred the develop- eages (Supplementary Table 1). The completeness and contamina- ment of methods for assessing the quality of recovered MAGs in tion of each MAG was estimated from the presence and absence order to allow biological inferences to be made with regards to their of lineage-specific genes expected to be ubiquitous and single estimated completeness and contamination25,26. copy25, and these estimates, along with assembly statistics, used to Significant insights have recently been made based on the MAGs identify genomes suitable for further study. A total of 64,295 MAGs of uncultivated microorganisms. These include elucidation of several were obtained, of which 7,903 (7,280 bacterial and 623 archaeal)

Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, St Lucia, Queensland 4072, Australia. *e-mail: [email protected]; [email protected]

Nature Microbiology | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology 1533 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. Articles NATuRe MiCRobiology

a 12% b 83% of genomes have ≤200 sca olds 12 10 Near complete (3,438 genomes; 43.5%) Medium quality (3,459 genomes; 43.8%) 8 Partial (1,006 genomes; 12.7%) 4

8 Genomes (%) 0 0 100 200 300 400 500 No. of sca olds 6

c 75% of genomes have 4 20 ≥15 standard tRNAs Contamination (%) 15

2 10

Genomes (%) 5

0 0 50 60 70 80 90 100 33% 20 15 10 5 0 Completeness (%) Standard tRNA count

Fig. 1 | Assessment of genome quality. a, Estimated completeness and contamination of 7,903 genomes recovered from public metagenomes. Genome quality was defined as completeness −​ 5 ×​ contamination, and only genomes with a quality of ≥​50 were retained. Near-complete genomes (completeness ≥​90%; contamination ≤​5%) are shown in red, medium-quality genomes (completeness ≥​70%; contamination ≤​10%) in blue, and partial genomes (completeness ≥​50%; contamination ≤​4%) in grey. Histograms along the x and y axes show the percentage of genomes at varying levels of completeness and contamination, respectively. Notably, only 171 of the 7,903 (2.2%) UBA genomes have >​5% contamination. b, Number of scaffolds comprising each of the 7,903 genomes with colours indicating genome quality. c, Number of tRNAs for each of the 20 standard amino acids identified within each of the 7,903 genomes with colours indicating genome quality. form the UBA data set as they met our filtering criteria of having an Taxonomic distribution of UBA genomes. The phylogenetic rela- estimated quality ≥​50 (defined as the estimated completeness of a tionships of the UBA genomes were determined across bacterial genome minus five times its estimated contamination) and consist- and archaeal trees inferred from three concatenated protein sets: ing of ≤500​ scaffolds with an N50 ≥​10 kb (Fig. 1 and Supplementary (1) a syntenic block of 16 ribosomal proteins (rp1) recently used to Table 2). Over 93% of the 7,903 UBA genomes have an average cov- infer genome-based phylogenies10,30 (Supplementary Table 4), (2) 23 erage of ≥​10×​ (5th percentile, 9.2×,​ 95th percentile, 268×​) and ribosomal proteins (rp2) previously tested for lateral gene transfer4 95.8% have >​5×​ coverage over 90% of bases, providing assurance (Supplementary Table 5), and (3) 120 bacterial (bac120) and 122 of high-quality base-calling across the genomes3,36. Among the UBA archaeal (ar122) proteins we have identified as being suitable for genomes is a subset of 3,438 near-complete genomes (3,225 bacte- phylogenetic inference (Supplementary Tables 6 and 7). The trees rial and 213 archaeal) estimated to be ≥​90% complete with ≤​5% span ~19,000 bacterial and ~1,000 archaeal genomes after species- contamination (Fig. 1a). These genomes consist of ≤​100 scaffolds level dereplication of the UBA genomes and 67,479 genomes in in 70.2% of cases (≤200​ scaffolds in 92.0% genomes) and have an RefSeq/GenBank release 76 (Supplementary Table 8). average N50 of 136 kb. Comparison of near-complete UBA genomes UBA genomes were represented in the majority of bacterial that are conspecific strains of complete isolate genomes also sug- (47 of 59, Fig. 2) and archaeal (11 of 18, Fig. 3) phyla, as defined in gest that the recovered MAGs have no systematic loss of genomic the NCBI taxonomy39. In addition, they comprise the first genomic content, with the exception of extrachromosomal elements such as representatives of 17 bacterial and 3 archaeal phyla (see section plasmids (Supplementary Note 1). ‘UBA genomes are the first representatives of several phyla’). To pro- The UBA data set was also assessed relative to the criteria used vide an objective taxonomic analysis of the UBA data set, we used by the Human Microbiome Project (HMP) for defining high-qual- the phylogenetic criterion of mean branch length to extant taxa30 as ity draft genomes3,37. Of the 3,438 UBA genomes we have defined existing taxonomic classifications are not phylogenetically uniform1. as near complete, 3,201 (93.1%) pass all of the HMP criteria, with The results were highly consistent across all trees and, as expected, the only substantial exception being 4.8% of the genomes having named groups at each taxonomic rank vary substantially in their scaffolds with an N50 of <20​ kb (Supplementary Table 3). Nearly mean branch length to extant taxa (Fig. 4 and Supplementary Figs. 3 half of the remaining 4,465 UBA genomes also pass the HMP and 4). Based on the range of mean branch length values for estab- criteria for being high quality except that they are estimated to be lished taxa, the bacterial UBA genomes are exclusive representatives <​90% complete. of 20–30% of all genus- to order–level lineages, 15–30% of class-level The presence of tRNAs for the standard 20 amino acids was lineages and 5–15% of -level lineages (Fig. 4a). Similarly, the examined as a secondary measure of genome quality (Fig. 1c). archaeal UBA genomes are the only representatives within 20–30% The 3,438 near-complete UBA genomes have tRNAs that encode of genus- to order-level lineages, 15–30% of class-level lineages for an average of 17.3 ±​ 2.2 of the 20 amino acids and ≥​15 amino and around 10% of archaeal phyla (Fig. 4b). We also tabulated the acids in 90.3% of the genomes. The correlation between estimated number of UBA-exclusive lineages at the 50th, 90th and 95th per- genome completeness and identified tRNAs was positive but weak centiles of the mean branch length distribution of each taxonomic (Supplementary Fig. 2) as tRNAs are regularly present in multiple rank (Supplementary Table 9). At the conservative 90th percentile, copies and often collocated in a genome, making them poor markers the bacterial UBA genomes are the first genomic representatives for robustly estimating completeness25,38. of 766 genera (34.6%), 226 families (28.6%), 61 orders (21.6%) and

1534 Nature Microbiology | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. NATuRe MiCRobiology Articles

0.1 (1,513) FCB

Aminicenantes (10) (35) (5) Chlorobi NC10 Ignavibacteriae (29) Rokubacteria Calditrichaeota (5) (23) (46), UBP11 (1) Chrysio. (1) Nitrospirae_1 (19) (17), (39) UBP10 (2) Latescibacteria (4) Dadabacteria (2) MBNT15 (2) , UBP14 (1), TA06, UBP1 (3), UBP2 (6) Deltaproteo. (1-3) (226) Cloacimonetes (31), Hyd24-12 (5), WOR-3 (10) (178) PVC (2,101) Lentisphaerae (12), , UBP17 (1) (61) Hydrogenedentes (2) Patescibacteria (245) (23) Aerophobetes, UBP5 (2) 'Candidate phyla radiation' Omnitrophica (14), UBP3 (4), UBP4 (2) UBP7 (2) UBP8 (5) Dependentiae (1) (135) UBP6 (4) (8) UBP13 (1) Calescamantes (24) Aquificae (1) Deferribacteres (5) UBP15 (1) (52) UBP16 (1) (5) (1,666) (47) UBP12 (1) (336) Dictyoglomi UBP9/SHA-109 (8) Coprothermo. Chloroflexi (182) Caldiserica (5) -thermus (6) Acetothermia (10) (11) (48)

Fig. 2 | Distribution of UBA genomes across 76 . The maximum likelihood tree was inferred from the concatenation of 120 proteins and spans a dereplicated set of 5,273 UBA and 14,304 NCBI genomes. Phyla containing UBA genomes are shown in green with the number of UBA genomes indicated in parentheses. Candidate phyla consisting only of UBA genomes are shown in red and have been named Uncultured Bacterial Phylum 1 to 17 (UBP1–UBP17). The and lineages are not shown as they branch within the Firmicutes and , respectively. The Patescibacteria/CPR is shown as a single lineage for clarity and to illustrate its similarity to well-characterized phyla. The Nitrospirae and Deltaproteobacteria were found to be polyphyletic and an underscore and numerical identifier is used to distinguish between lineages. The Deltaproteobacteria are polyphyletic due to the placement of the Dadabacteria. All named lineages have bootstrap support of ≥​75%, with the exception of the Chrysiogenetes (55%), Firmicutes (1%) and Verrucomicrobia (52%). The complete tree with support values is available in Newick format (Supplementary File 1).

38 classes (18.0%) within this . Similarily, the archaeal UBA Only UBP9, UAP2 and UAP3 could be further taxonomically genomes represent 59 genera (30.3%), 25 families (28.4%), 13 orders resolved as the other candidate phyla either lack genomes with a 16S (23.6%) and 3 classes (15.0%) at the 90th percentile. rRNA gene, were placed sister to named phyla, or had incongruent placements across the protein and 16S rRNA trees (Supplementary UBA genomes are the first representatives of several phyla. Tables 11 and 14). UBP9 genomes are the first genomic representa- The 17 bacterial and 3 archaeal phyla comprised exclusively of tives of the Terrabacteria candidate phylum SHA-109 (Fig. 2) and UBA genomes have been given the candidate names Uncultured were recovered from baboon faeces (five genomes), palm oil effluent Bacterial Phylum 1 to 17 (UBP1 to 17, Fig. 2) and Uncultured (one genome), a toluene degrading community (one genome) and Archaeal Phylum 1 to 3 (UAP1 to 3, Fig. 3). These candidate phyla a dechlorination bioreactor (one genome, Supplementary Table 14). form well-supported clans in all three concatenated protein trees UAP2 contains the first representatives of the Marine Hydrothermal (Supplementary Table 10) and are unaffiliated with existing phyla40 Vent Group (MHVG) and consists of three genomes recovered from (Supplementary Table 11). The 10 UBP/UAP phyla with 16S rRNA the Tara Oceans Expedition along with a single genome from the genes ≥​600 bp have inter-phyla percent identity values between 76% Beebe (Supplementary Table 14 and Fig. 3). and 86% (Supplementary Table 12), in agreement with established UAP3 is represented by a single genome recovered from a Costa phyla41 (Supplementary Fig. 5). However, these 16S rRNA results Rican metagenome (Supplementary Table 14) and should be treated with some caution as the percent identity of is the first representative of the Ancient Archaeal Group (AAG), a incomplete 16S rRNA sequences correlates poorly with values for group adjacent to the (Fig. 3). full-length sequences42. Because 16S rRNA genes often fail to assem- ble43 and are missing from half of the UBP/UAP lineages, we used UBA genomes substantially increase phylogenetic diversity. The average amino-acid identity (mean of 46.2%) and shared gene con- phylogenetic diversity of the UBA genomes was determined across tent (mean of 24.4%) calibrated against established phyla to further the three concatenated protein trees. The results were highly con- support the classification of the UBP/UAP as phyla (Supplementary sistent across these trees with the UBA genomes covering >​50% Fig. 5 and Supplementary Table 13). of the total branch length (phylogenetic diversity, PD) spanned To further resolve the taxonomic identity of the UBP and UAP by each domain-specific genome tree and increasing total branch genomes, 16S rRNA genes from these genomes were placed into a length (phylogenetic gain, PG) by ~30% (Fig. 5 and Supplementary tree containing genomic and environmental 16S rRNA sequences. Table 8). For comparison, the PD of the CPR (1,056 genomes,

Nature Microbiology | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology 1535 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. Articles NATuRe MiCRobiology

0.1 TACK (11) (26)

Korarchaeota

Verstraetearchaeota (2)

Aigarchaeota (538)

Bathyarchaeota (3)

Thorarchaeota

Lokiarchaeota UAP2/MHVG(4) UAP3/AAG (1)

Diapherotrites (1)

Micrarchaeota (12) Woesearchaeota_2 (2)

Aenigmarchaeota

Pacearchaeota (3) Nanohaloarchaeota Woesearchaeota (6) UAP1 (2) Parvarchaeota (11) DPANN

Fig. 3 | Distribution of UBA genomes across 21 archaeal phyla. The maximum likelihood tree was inferred from the concatenation of 122 proteins and spans a dereplicated set of 453 UBA and 630 NCBI genomes. Phyla containing UBA genomes are shown in green with the number of UBA genomes indicated in parentheses. Candidate phyla consisting only of UBA genomes are shown in red and have been named Uncultured Archaeal Phylum 1 to 3 (UAP1–UAP3). The phylum Woesearchaeota was found to be polyphyletic and an underscore and numerical identifier is used to distinguish between lineages. All named lineages have bootstrap support ≥​90%, with the exception of the Crenarchaeota (73%). The complete tree with support values is available in Newick format (Supplementary File 2). MHVG, Marine Hydrothermal Vent Group; AAG, Ancient Archaeal Group. including 245 UBA genomes) and Firmicutes (25,992 genomes, as the Micrarchaeota, Pacearchaeota and Woesearchaeota which all including 1,666 UBA genomes) is 11.2% and 25.1%, respectively had a PG of >​35%. (Fig. 5). Restricting results to bacterial UBA genomes meeting our near-complete or medium-quality criteria still results in PGs Improved genomic representatives within several lineages. There of ~17% and 30%, respectively. The near-complete and medium- are 12 bacterial phyla where the UBA genomes are estimated to quality archaeal UBA genomes provide a phylogenetic gain of 10% be the highest-quality representatives (Supplementary Table 17). and 29%, respectively (Supplementary Table 8). Among these is the Aminicenantes, where the number of avail- Genomic representation of several bacterial lineages was greatly able genomes increased from 37 to 47 with the addition of the expanded by the UBA genomes (Fig. 5). The bacterial phyla with the UBA genomes, and the highest-quality genome improved from largest increase in PD were the underrepresented Aminicenantes, 86.9% to 91.9% complete, with the five highest-quality genomes all Gemmatimonadetes, Lentisphaerae and Omnitrophica lineages (PG being UBA genomes. There are currently seven Hydrogenedentes of >​75%). Over 75% of the bacterial UBA genomes belong to the genomes (five NCBI and two UBA), with the two highest-quality Actinobacteria, Bacteroidetes, Firmicutes and Proteobacteria. These representatives being UBA genomes and appreciably improving genomes expand the PD of these phyla by 14–47%, despite >​50% upon the best previously available representative (88.3% complete, of existing genomic representatives belonging to these four phyla 4.3% contaminated to 98.9% complete, 1.1% contaminated). The (Fig. 5 and Supplementary Table 15). Such high levels of increased most substantial improvement was in the Latescibacteria, where phylogenetic diversity are the norm, with 56 of 77 phyla and 73 of all previous representatives were derived from single cells and the 143 classes being expanded by >​20% (Supplementary Table 16). UBA genomes improve the best-quality representative from 57.6% The UBA genomes have no representatives in only 10 bacterial to 95.6% complete. phyla, which we attribute to the narrow ecological range and/or low The UBA genomes are also the highest-quality representa- relative abundance of microorganisms belonging to these lineages tives of five archaeal phyla (Supplementary Table 17). Notably, the (for example, NC10 and Aerophobetes). Within the Archaea, the Parvarchaeota was previously represented by only four MAGs with PD of 10 of 21 phyla and 7 of 17 classes increased by >​20% (Fig. 5 the highest quality being 76.7% complete with 5.6% contamination, and Supplementary Tables 15 and 16). This includes well-estab- and there are 11 Parvarchaeota UBA MAGs with an average com- lished archaeal groups such as the Euryarchaeota (PG of 34.8%) and pleteness and contamination of 80.6 ± 4.2% and 1.7 ± 0.6%, respec- Thaumarchaeota (PG of 40.8%) and poorly sampled groups such tively. The UBA genomes also improve the completeness of the best

1536 Nature Microbiology | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. NATuRe MiCRobiology Articles

abNumber of bacterial lineages Number of archaeal lineages

5,107 1,927 821 357 151 76 36 14 3 529 287 181 116 77 54 39 22 16 4 100 100 76 Phylum 4 Phylum 80 80 379 51 Class Class 60 60 867 99 Order Order 40 40 1,735 161 Lineages (%) Lineages (%) Family Family 20 20 4,150 328 Genus Genus 0 0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 Mean branch length (substitutions per site) Mean branch length (substitutions per site)

UBA NCBI UBA and NCBI

Fig. 4 | Percentage of lineages of equal evolutionary distance represented exclusively by UBA genomes. a, Percentage of bacterial lineages represented exclusively by the UBA genomes, genomes at NCBI, or both data sets as defined on the basis of mean branch length to extant taxa. Mean branch length to extant taxa was calculated for all internal nodes and lineages defined at a specific threshold to establish lineages of equal evolutionary distance. The bottom x axis indicates the threshold used to define lineages (0 =​ extant taxa, far right =​ root of tree), the top x axis indicates the number of lineages at different mean branch length thresholds, and the left y axis indicates the percentage of lineages. The right y axis indicates different taxonomic ranks, with the 5th and 90th percentiles of the distribution of mean branch length values for these ranks shown by black lines. The mean of each distribution is shown as a black circle along with the number of lineages at this value. b, Analogous plot to a, but for archaeal lineages. The phylum distribution is a single point, as the Euryarchaeota is the only archaeal phylum with phylum-level resolution (that is, containing multiple classes). Plots were determined across the domain-specific trees inferred from the concatenation of 120 bacterial and 122 archaeal proteins, respectively.

representative within the Micrarchaeota (Micrarchaeum acidiphi- the phylogenetic diversity of bacterial and archaeal genome trees lum ARMAN-2) from 84% complete to 91% complete, while add- by >​30% through the addition of 7,280 bacterial and 623 archaeal ing 11 other genomes all estimated to be >​70% complete with <​3% genomes obtained from over 1,500 public metagenomes (Fig. 5). contamination. These MAGs span the majority of recognized bacterial and archaeal phyla and include the first genomic representatives of 17 bacte- An alternative view of the CPR. The CPR has recently been pro- rial and three archaeal phyla (Figs. 2 and 3). The 7,903 genomes posed as a major collection of candidate phyla in the bacterial reported in this study range in quality from 50% complete to meet- domain30. Under the bac120 tree and mean branch length to extant ing the HMP criteria for high-quality draft genomes3,37 (Fig. 1). taxa criterion, the addition of the UBA genomes slightly reduces the They are more complete than those typically derived using sin- percentage of phylum-level lineages represented by the CPR from a gle-cell genomics4 and are of similar quality to those reported in maximum of 29.4% to 26.3%, despite UBA genomes being the sole other studies considering MAGs10,11,44. We have focused on these representatives within a number of genus- to class-level CPR lin- genomes, which represent only ~12% of the 64,295 recovered bins, eages (Fig. 6a,b). Interestingly, the percentage of phylum-level lin- as they are of sufficient quality to inform analyses such as resolv- eages within the CPR increases substantially when considering the ing phylogenetic relationships4,30 and comparing inter- and intra- rp1 and rp2 trees where the maximum percentages are 38.7% and lineage genomic features45–47. Importantly, these results demonstrate 38.3%, respectively (Fig. 6c,d). Consequently, under the bac120 tree that a large amount of microbial diversity remains to be genomically and mean branch length to extant taxa criterion the CPR contains described across the tree of life, even within existing metagenomic approximately the same percentage of phylum-level lineages as the samples, and that this diversity is readily recovered using current Firmicutes and Actinobacteria combined, whereas under the rp1 tools and methodologies. tree the CPR is far more prominent (Fig. 6e,f). Interestingly, under MAGs often lack 16S rRNA genes due to their conserved and the bac120 tree the CPR shows a pronounced increase in the relative repetitive nature impeding assembly1,10,43. The UBA genomes are no percentage of lineages attributed to being of phylum-level diversity exception, with only 17.3% of bacterial UBA genomes containing and at more specific taxonomic ranks contains approximately the a partial 16S rRNA gene and 10.2% having a fragment of ≥​600 bp. same or fewer lineages than the Firmicutes (Fig. 6f). Recovery was more successful in the archaeal UBA genomes, with A recent genome-based tree of life depicted the CPR as represent- 32.7% containing a 16S rRNA gene fragment ≥600​ bp. We attri- ing ~50% of bacterial lineages when aiming to recapitulate named bute this discrepancy to the higher average 16S rRNA copy num- phyla under the mean branch length to extant taxa criterion30. This ber in Bacteria relative to Archaea48 (bacterial mean =​ 4.12, archaeal analysis was conducted using the same 16 ribosomal proteins com- mean =​ 1.63). Challenges in assembling and binning 16S rRNA prising the rp1 marker set and resulted in 39 of 76 (51%) phylum- genes motivated the use of protein-coding genes for the phyloge- level lineages belonging to the CPR. Here, we provide an alternative netic analyses presented in this and previous MAG studies45,46. view with lineages collapsed under the same constraints on the Recently, the diversity of the CPR was explored in the context bac120 tree. This results in the CPR being represented by only 20 of a genome tree inferred from 16 ribosomal proteins where it was of 76 (26.3%) phylum-level lineages (Supplementary Fig. 7), which divided into 36 named phyla and shown to represent approximately occurs at the mean branch length threshold (namely, 0.85 substitu- 50% of bacterial lineages of equal phylum-level evolutionary dis- tions per site) resulting in the maximum percentage of phylum-level tance30. Our analyses using a 120 concatenated proteins contrast CPR lineages (Fig. 6b). with this view, as the CPR is shown to comprise ~25% of phylum- level lineages under the same criterion (Fig. 6b and Supplementary Discussion Fig. 7). This suggests that ribosomal proteins within CPR Despite considerable progress, many lineages known from 16S may be evolving atypically relative to other proteins, perhaps as a rRNA surveys still lack genomic representation4. Here, we expand result of their unusual ribosome composition and the presence of

Nature Microbiology | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology 1537 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. Articles NATuRe MiCRobiology

NCBI PD Total Dereplicated Relative UBA PD NCBI UBANCBI UBA Total PD Taxon PD UBA PG UBA PG/PD Bacteria 67,0347,244 14,304 5,273 100.0% 33.0% UBP (all) 046046 1.1% 1.1% Acetothermia 1101 6 0.1% 71.8% Aminicenantes 410410 0.2% 81.0% Armatimonadetes 12 24 613 0.4% 67.2% Bacteroidetes 1,324 1,513941 1,057 9.5% 47.4% Caldiserica 15140.2% 74.3% Chloroflexi 61 182 50 169 2.7% 73.1% Dadabacteria 12120.1% 74.8% Elusimicrobia 323314 0.3% 70.5% Firmicutes 24,326 1,666 3,433 1,296 25.1% 32.2% Gemmatimonadetes 439337 0.2% 78.6% Hyd24-12 35350.1% 61.4% Latescibacteria 14130.1% 64.7% Lentisphaerae 212212 0.3% 75.8% Omnitrophica 114114 0.3% 89.3% Patescibacteria / CPR 811 245 811 245 11.2% 25.8% Proteobacteria 28,8602,101 4,889 1,272 20.5% 25.0% Verrucomicrobia 45 17844166 1.9% 65.5%

Archaea 825 623630 453 100.0% 30.7% UAP (all) 07071.1% 100.0% Euryarchaeota 534538 422 369 52.7% 34.8% Marine Group II 12 20612206 9.2% 85.9% Micrarchaeota 112112 2.4% 73.4% Pacearchaeota 53532.7% 37.9% Thaumarchaeota 49 26 44 26 6.4% 40.8% Woesearchaeota 76764.4% 42.9%

Fig. 5 | Phylogenetic diversity and gain for select bacterial and archaeal phyla. The 7,903 UBA genomes cover considerable phylogenetic diversity (PD) and contribute substantial phylogenetic gain (PG) relative to genomes from RefSeq/GenBank release 76. PD and PG were assessed on the domain-specific trees inferred from the concatenation of 120 bacterial and 122 archaeal proteins. The bar charts give (1) the total PD for each lineage relative to the PD of the entire domain, (2) the PD of the NCBI and UBA genomes relative to the PD of the lineage, and (3) the PG contributed by the UBA genomes within the lineage. The 17 UBP and 3 UAP lineages were collapsed together to show their combined phylogenetic diversity and gain. Marine Group II, a lineage within the Euryarchaeota, has been included because it comprises the majority of euryarchaeotal UBA genomes.

self-splicing introns and proteins being encoded within their rRNA natural and industrial processes. We anticipate that as metagenomic genes10. These contrasting views of the diversity of the CPR are assembly and binning methods mature we will be presented with equally valid and probably reflect the unique biology of the organ- the challenge and great opportunity to be able to study microbial isms within this group. communities with complete, or near complete, genomic representa- While the SRA represents a large set of publicly available tion in the context of a comprehensive tree of life. metagenomic data, many additional metagenomes exist in other repositories such as the Integrated Microbial Genomes and Note added in proof: During finalization of this manuscript, a new Metagenomes49 (IMG/M) database and Metagenomics Rapid standard specifying the minimum information about a metagenome- Annotation Server50 (MG-RAST). We expect that processing these assembled genome (MIMAG) was proposed51. The medium-qual- metagenomes will add tens of thousands of additional genomes ity and partial UBA MAGs meet the medium-quality criteria of to the tree of life. Furthermore, methods for assembling and bin- the MIMAG standard. However, most of the near-complete UBA ning metagenomic data are continually improving, which makes it MAGs do not meet the stringent rRNA and tRNA requirements likely that systematic reprocessing of metagenomic data will result for high-quality draft MAGs under this standard, and we therefore in the recovery of new genomes and improved versions of previ- deliberately refer to these MAGs as ‘near complete’. ously obtained genomes. The number and diversity of genomes presented in this study, and the many similar studies we anticipate will follow, move us Methods closer to a comprehensive genomic representation of the micro- Recovery of cultivation-independent genomes. Metadata for metagenomes in the Sequence Read Archive52 (SRA) at the National Center for Biotechnology bial world. Detailed examination of such genomes will further our Information (NCBI) were obtained from the SRAdb53. Only metagenomes understanding of microbial evolution and metabolic diversity, and submitted to the SRA before 31 December 2015 were considered with a provide important insights into the role of microorganisms in both predominant focus on environmental and non-human gastrointestinal samples

1538 Nature Microbiology | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. NATuRe MiCRobiology Articles

a bac120 Number of lineages b bac120 Number of lineages

3,270 1,282 583 279 133 68 34 12 2 5,107 1,927 821 357 151 76 36 14 3 100 100 76 68 Phylum Phylum 80 80 291 379 Class Class 60 60 616 867 Order Order 40 40 1,162 1,735 Family Family Lineages (%) 20 Lineages (%) 20 2,646 Genus 4,150 Genus 0 0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Mean branch length (substitutions per site) Mean branch length (substitutions per site)

c rps16 Number of lineages d rps23 Number of lineages

3,454 1,197 476 179 67 27 5 3,209 1,064 411 160 67 22 3 100 100 71 Phylum 69 Phylum 80 80 475 449 Class Class 60 60 924 Order 845 Order 40 40 2,031 Family 1,872

Lineages (%) Family Lineages (%) 20 20 4,187 Genus 4,422 Genus 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Mean branch length (substitutions per site) Mean branch length (substitutions per site) UBA exclusive Other bacteria CPR UBA exclusive CPR

e rps16 Number of lineages f bac120 Number of lineages

3,454 1,197 476 179 67 27 5 5,107 1,927 821 357 151 76 36 14 3 100 100 71 Phylum 76 Phylum 80 80 475 379 Class Class 60 60 924 867 Order Order 40 40 2,031 1,735 Family Family Lineages (%) Lineages (%) 20 20 4,187 4,150 Genus Genus 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Mean branch length (substitutions per site) Mean branch length (substitutions per site) Other bacteria Proteobacteria Firmicutes Bacteroidetes Actinobacteria CPR

Fig. 6 | Percentage of lineages of equal evolutionary distance within the CPR. a, Percentage of CPR lineages within the bac120 tree without the addition of the UBA genomes. The maximum percentage of phylum-level lineages within the CPR occurs at a mean branch length of 0.85, as denoted by the dashed grey line. The remaining figure elements are the same as in Fig. 4. b, Analogous plot to a, with the addition of the UBA genomes (maximum =​ 0.85). c,d, Percentage of CPR lineages within the rp1 and rp2 trees, respectively, with the maximum percentage of phylum-level lineages within the CPR denoted with a dashed grey line (rp1 =​ 0.69; rp2 =​ 0.76) e,f, Percentage of CPR, Firmicutes, Actinobacteria, Proteobacteria and Bacteroidetes lineages within the rp1 and bac120 trees, respectively. The trees span 14,290 (a), 19,198 (b and f), 18,081 (c and e) and 18,226 (d) genomes.

(for example, rumen, guinea pigs and baboon faeces; Supplementary Table 1). of bins with an estimated completeness >​70% and contamination <​5% were Metagenomes from studies where MAGs had previously been recovered were considered for further refnement and validation. excluded if the UBA MAGs did not provide appreciable improvements in genome quality or phylogenetic diversity. Each of the 1,550 metagenomes were processed Merging of compatible bins. Automated binning methods can produce multiple independently, with all SRA Runs within an SRA Experiment (that is, sequences bins from the same microbial population. The ‘merge’ method of CheckM v.1.0.6 from a single biological sample) being co-assembled using the CLC de novo was used to identify pairs of bins where the completeness increased by ≥​10% and assembler v.4.4.1 (CLCBio). Assembly was restricted to contigs ≥​500 bp and the the contamination increased by ≤​1% when merged into a single bin. Bins meeting word size, bubble size and paired-end insertion size determined by the assembly these criteria were grouped into a single bin if the mean GC of the bins were within sofware. Assembly statistics are reported for contigs ≥​2,000 bp (Supplementary 3%, the mean coverage of the bins had an absolute percentage difference Table 1). Reads were mapped to contigs with BWA54 v.0.7.12-r1039 using the ≤​25%, and the bins had identical taxonomic classifications as determined by their BWA-MEM algorithm with default parameters and the mean coverage of contigs placement in the reference genome tree used by CheckM. This set of criteria was obtained using the ‘coverage’ command of CheckM25 v.1.0.6. Genomes were used to avoid producing chimaeric bins. independently recovered from each SRA Experiment using MetaBAT21 v.0.26.3 under all fve preset parameter settings (that is, verysensitive, sensitive, specifc, Filtering scaffolds with divergent genomic properties. Scaffolds with genomic veryspecifc, superspecifc). Te completeness and contamination of the genomes features deviating substantially from the mean GC, tetranucleotide signature, or recovered under each MetaBAT preset were estimated using CheckM using coverage of a bin were identified with the ‘outliers’ method of RefineM v.0.0.14 lineage-specifc markers genes and default parameters. For each SRA Experiment, (https://github.com/dparks1134/RefineM) using default parameters. This removes all only genomes recovered with the MetaBAT preset resulting in the largest number scaffolds with a GC or tetranucleotide distance outside the 98th percentile of the expected

Nature Microbiology | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology 1539 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. Articles NATuRe MiCRobiology distributions of these genomic features, as determined empirically over a set of 5,656 Proteins used to infer genome trees. Bacterial and archaeal genome trees trusted reference genomes25,33. Scaffolds were also removed if their mean coverage had an were inferred from the concatenation of 120 (Supplementary Table 6) and 122 absolute percentage difference ≥​50% when compared to the mean coverage of the bin. (Supplementary Table 7) phylogenetically informative proteins, respectively. These proteins were identified as being present in ≥​90% of bacterial or archaeal Filtering scaffolds with incongruent taxonomic classification. Each gene within genomes and, when present, single-copy in ≥​95% of genomes. Protein-coding a bin was assigned a taxonomic classification through homology search using regions were identified using Prodigal v.2.6.3 (with default parameters, but with BLASTP55 v.2.2.30+ ​against a custom database of 12,321 genomes from RefSeq/ Ns treated as masked sequences), translation tables determined using a coding GenBank56 release 75. This database was constructed from RefSeq and GenBank density heuristic25, and the ubiquity of genes determined across genomes from genomes consisting of ≤​300 contigs, having an N50 ≥​20 kb and containing ≤​10 kb NCBI’s RefSeq release 73 annotated with the Pfam59 v.27 and TIGRFAMs60 of ambiguous base pairs. A genome was only included in the database if it v.15.0 databases. Only genomes composed of ≤​200 contigs, with an N50 of was estimated to be ≥​90% complete, ≤​10% contaminated and had an overall ≥​20 kb and with CheckM completeness and contamination estimates of ≥​95% quality ≥​50 (defined as completeness −​ 5 ×​ contamination). Quality estimates and ≤​5%, respectively, were considered. Phylogenetically informative proteins were determined with CheckM using the lineage-specific workflow and default were determined by filtering ubiquitous proteins whose gene trees had poor parameters. Genomes meeting this set of requirements were dereplicated to congruence with a set of subsampled concatenated genome trees. Specifically, remove genomes from the same named species with an amino-acid identity (AAI) the initial set of 188 bacterial (187 archaeal) proteins were randomly subsampled ≥​99.5%. AAI values were calculated with CompareM v.0.0.13 (https://github.com/ to 132 genes (~70%) and concatenated to infer a subsampled genome tree. Gene dparks1134/CompareM) and dereplication performed in a greedy fashion with subsampling was independently performed 100 times to establish well-supported a preference towards type strains and genomes annotated as complete at NCBI. splits, which we define as any split occurring in >​80% of the subsampled trees Genes were assigned the taxonomic classification of their ‘top’ hit or designated and with ≥​1% of taxa contained in both bipartitions induced by the split. The as unclassified if the gene had no identified homologue with an E-value ≤​1e−2, a congruence between a gene tree and the subsampled genome tree was measured percent sequence identity ≥​30% and a percent alignment length ≥​50%. as the fraction of well-supported split lengths compatible with the gene tree, a Scaffolds with incongruent taxonomic classifications were removed from each measure we call the ‘normalized compatible split length’. Genes with a normalized bin. The consensus classification of a bin at each taxonomic rank was determined compatible split length of ≤​50% were removed, as this poor congruence may by identifying the taxon that occurred at the highest frequency across all classified indicate the presence of lateral gene transfer events. Proteins were aligned to Pfam genes or designated as unclassified if no taxon was represented by ≥50%​ of the and TIGRfam HMMs using HMMER61 v.3.1b1 with default parameters and trees classified genes. Scaffolds where ≥​50% of the classified genes at each rank agreed were inferred with FastTree62 v.2.1.7 under the WAG+​GAMMA models. with the consensus classification of the bin were designated as ‘trusted’, and a Trees were also inferred from two ribosomal protein sets: (1) 16 ribosomal taxon was considered to be ‘common’ if it comprised ≥​5% of the classified genes proteins (Supplementary Table 4) that form a syntenic block10,30 and (2) 23 across the set of trusted scaffolds. A scaffold was considered to be taxonomically ribosomal proteins (Supplementary Table 5) previously used for tree inference and incongruent and removed from a bin if the following three conditions were met: tested for lateral gene transfer4. (1) it contained ≥​5 classified genes and ≥​25% of all genes on the scaffold were classified; (2) ≤​10% of the classified genes were contained in the set of common Inference of genome trees. Genome trees were inferred across a dereplicated set of taxa at each classified rank; and (3) >​50% of classified genes were assigned to UBA and RefSeq/GenBank release 76 (May 2016; includes 727 single cell and 1,811 the same taxon at each classified rank. Taxonomic classification of genes and MAGs) genomes. All 5,192 RefSeq genomes annotated at NCBI as ‘reference’ or identification of scaffolds with divergent taxonomic classifications were performed ‘representative’ were retained, except for a low-quality subset of 294 genomes that did with the ‘taxon_profile’ and ‘taxon_filter’ methods of RefineM v.0.0.14 not meet our ‘trusted’ genome criteria: composed of ≤​300 contigs, N50 ≥​20 kb, (https://github.com/dparks1134/RefineM), respectively. CheckM completeness and contamination estimates of ≥​90% and ≤​10%, respectively. This set of 4,898 genomes was augmented with an additional 3,324 RefSeq genomes Filtering scaffolds with incongruent 16S rRNA genes. Scaffolds were removed to retain at least two genomes per species where possible. Preference was given to from a bin if they contained a complete or partial 16S rRNA gene ≥​600 bp genomes annotated at NCBI as being a type strain and/or ‘complete’ and restricted with a taxonomic classification incongruent with the taxonomic identity of the to genomes meeting the ‘trusted’ genome filtering criteria. An additional 551 RefSeq bin. BLASTN55 was used to assign 16S rRNA genes the of its closest genomes currently without a species designation at NCBI, but passing the genome homologue within a database comprising the 10,769 16S genes identified within quality filtering, were also added to this initial set of seed genomes. the 12,321 reference genomes discussed in the previous section. The sequence UBA, GenBank and remaining RefSeq genomes meeting the ‘trusted’ genome identity to the closest homologue was used to determine the set of ranks that criteria were compared to these 8,773 seed genomes. Genomes with an AAI of should be examined for congruency. Specifically, previously reported median ≥​99.5% to a seed genome, as calculated over the 120 bacterial or 122 archaeal percent identities values were used to establish conservative thresholds for the marker genes used for phylogenetic inference, were clustered with the seed genome taxonomic ranks to consider41: genus ≥​98.7%, family ≥​96.4%, order ≥​92.25%, class and do not appear as separate genomes in the genome trees. This cutoff correlated ≥​89.2%, phylum ≥​86.35% and domain ≥83.68%.​ The taxon at each rank was then with the proposed 96.5% ANI threshold for defining bacterial and archaeal compared to the taxonomic classification of the genes across all scaffolds in the species57 (Supplementary Fig. 6). Trusted genomes with an AAI <​99.5% were bin and designated as incongruent if the taxon was assigned to ≤​10% of classified added to the seed set. All remaining genomes, regardless of quality, were compared genes. This methodology is implemented in RefineM v.0.0.14. to this final seed set using the same AAI clustering criteria of 99.5%. Seed and unclustered genomes with an estimated genomes quality ≥​50 (defined Selection of refined genomes. Of the 64,295 bins produced by MetaBAT, as completeness −​ 5 ×​ contamination) were used to create an initial multiple 10 only the 7,903 genomes with an estimated quality ≥​50 (defined as sequence alignment, with the exception of the 797 CPR genomes , which were completeness −​ 5 ×​ contamination), scaffolds resulting in an N50 of ≥​10 kb, retained regardless of their estimated quality. Proteins were identified and aligned containing <100​ kb ambiguous bases and consisting of <​1,000 contigs and using HMMER v.3.1b1 and the resulting alignment trimmed to remove columns <​500 scaffolds were considered to be of sufficient quality for further exploration represented by <​50% of taxa or without a common amino acid in ≥​25% of taxa. and deposition in public repositories. We adopted the quality criteria of Genomes with amino acids in <​40% of aligned columns (20% for the lenient completeness −​ 5 ×​ contamination as it provides a good signal (completeness) archaeal trees) were removed from consideration. The 120 concatenated bacterial to noise (contamination) ratio, where higher levels of contamination are only protein set consisted of 34,796 aligned columns after trimming and was inferred permissible when the genome is largely complete. These genomes have been over 19,198 genomes. The 122 concatenated archaeal protein set contained 28,025 deposited as assemblies in NCBI’s TPA:Assembly database along with alignment trimmed columns and spanned 1,012 genomes when using standard filtering files indicating the mapping of SRA reads to UBA genomes. criteria and 27,942 columns spanning 1,070 genomes when using lenient filtering. Trees were inferred with FastTree v.2.1.7 under the WAG+​GAMMA models and support values determined using 100 non-parametric bootstrap replicates. Comparison of UBA genomes to complete conspecific strains. The 3,438 near- complete (≥​90% complete; ≤​5% contamination) UBA genomes were compared to complete isolate genomes in RefSeq release 76. Of these, 207 of the UBA genomes Taxonomic annotation of genome trees. Genome trees were annotated using 39 were determined to be conspecific strains of complete isolate genomes based on an taxonomic information from the NCBI Taxonomy Database . Only the canonical average nucleotide identity (ANI) and alignment fraction (AF) above 96.5% and seven taxonomic ranks (species to domain) were considered for each genome, 63 60%, respectively57. ANI and AF values were determined using ANI Calculator57 and this information was used to annotate internal lineages using tax2tree . v.1. The genome size of the UBA genomes was adjusted to account for its Manual curation was performed to add in phylum information currently missing estimated completeness and contamination: adjusted genome size =​ (genome size)/ at NCBI and to resolve polyphyletic groups. Polyphyletic groups that could not be (completeness +​ contamination). Homologues between UBA genomes and their confidently resolved were identified using an underscore and numerical identifier conspecific counterparts were determined by inferring genes with Prodigal58 v.2.6.3 (for example, Deltaproteobacteria_1). and establishing sequence similarity with BLASTP v.2.2.30+​. A UBA protein was considered homologous to an isolate protein if it was the top hit among all isolate Inference of 16S rRNA trees. Bacterial and archaeal trees were inferred from 16S proteins, had an E-value of ≤​1e−10, a percent identity of ≥​70% and an alignment rRNA genes >​600 bp and >​1,200 bp within UBA and RefSeq/GenBank release length spanning ≥​70% of the isolate protein. 76 genomes, respectively. The 16S rRNA genes were identified using HMMER

1540 Nature Microbiology | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. NATuRe MiCRobiology Articles

and domain-specific SSU/LSU HMM models as implemented in the ‘ssu-finder’ 10. Brown, C. T. et al. Unusual biology across a group comprising more than method of CheckM. These genes were aligned with ssu-align64 v.0.1 and trailing or 15% of domain Bacteria. Nature 523, 208–211 (2015). leading columns represented by ≤​70% of taxa trimmed, which resulted in bacterial 11. Anantharaman, K. et al. Tousands of microbial genomes shed light on and archaeal alignments of 1,421 and 1,378 bp, respectively. Trees were inferred interconnected biogeochemical processes in an aquifer system. Nat. Commun. with FastTree v.2.1.7 under the GTR+​GAMMA models and support values 7, 13219 (2016). determined using 100 non-parametric bootstrap replicates. 12. Vanwonterghem, I., Jensen, P. D., Rabaey, K. & Tyson, G. W. Genome-centric resolution of microbial diversity, metabolism and interactions in anaerobic Similarity of 16S rRNA genes. The percent identity between 16S rRNA genes digestion. Environ. Microbiol. 18, 3144–3158 (2016). was calculated from the multiple sequence alignments used to infer the domain- 13. Tyson, G. W. et al. Community structure and metabolism through reconstruction specific 16S rRNA gene trees. The ‘dist.seqs’ command of mothur65 v.1.30.2 was of microbial genomes from the environment. Nature 428, 37–43 (2004). used to calculate percent identity. Default parameters were used, except that gaps 14. Wrighton, K. C. et al. Fermentation, hydrogen, and sulfur metabolism in at the end of sequences were ignored (countends =​ F) in order to accommodate multiple uncultivated bacterial phyla. Science 337, 1661–1665 (2012). partial 16S rRNA sequences. Inter-phylum (inter-class) 16S rRNA percent identity 15. Yeoh, Y. K., Sekiguchi, Y., Parks, D. H. & Hugenholtz, P. Comparative values were determined by identifying the most similar sequence to each sequence of candidate phylum TM6 suggests that parasitism is widespread within a phylum across all sequences from different phyla (classes). and ancestral in this lineage. Mol. Biol. Evol. 33, 915–927 (2016). 16. Albertsen, M., Hugenholtz, P., Skarshewski, A., Nielsen, K. L., Tyson, G. W. & Assessment of phylogenetic and taxonomic diversity. Phylogenetic diversity Nielsen, P. H. Genome sequences of rare, uncultured bacteria obtained by (total branch length spanned by a set of taxa) and gain (additional branch length diferential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, contributed by a set of taxa) were calculated using GenomeTreeTk v.0.0.23 533–538 (2013). (https://github.com/dparks1134/GenomeTreeTk) and verified with ARB66 v.6.0.2. 17. Sharon, I., Morowitz, M. J., Tomas, B. C., Costello, E. K., Relman, D. A. & Taxonomic diversity and the percentage of lineages of equal evolutionary distance Banfeld, J. F. Time series community genomics analysis reveals rapid shifs in unique to the UBA genomes were determined using the mean branch length to bacterial species, strains, and phage during infant gut colonization. Genome extant taxa criterion30. Lineages of equal evolutionary distance were related to the Res. 23, 111–120 (2013). distribution of NCBI taxa39 as defined on 19 May 2016 and used to construct the 18. Strous, M., Kraf, B., Bisdorf, R. & Tegetmeyer, H. E. Te binning of phylum-level lineage view (Supplementary Fig. 7) by evaluating the number of metagenomic contigs for microbial physiology of mixed cultures. Front. groups formed at mean branch length values of 0.5 to 1.1 with a step size of 0.025. Microbiol. 3, 410 (2012). A value of 0.85 was selected as it most closely matched the number of bacterial 19. Imelfort, M., Parks, D. H., Woodcrof, B. J., Dennis, P., Hugenholtz, P. & 30 phyla when excluding the CPR. In agreement with previous analyses , we used Tyson, G. W. GroopM: an automated tool for the recovery of population this criterion to explore the taxonomic structure of phylogenetic trees and not to genomes from related metagenomes. PeerJ 2, e603 (2014). explicitly establish taxonomic status. 20. Nielsen, H. B. et al. Identifcation and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Genomic similarity. The AAI and shared gene content between genomes were Nat. Biotechnol. 32, 822–828 (2014). determined with CompareM v.0.0.21 using default parameters (homologues 21. Kang, D. D., Froula, J., Egan, E. & Wang, Z. MetaBAT, an efcient tool for defined by an E-value ≤​0.001, a percent identity ≥​30% and an alignment length accurately reconstructing single genomes from complex microbial ≥​70%). CompareM reports shared gene content relative to the genome with the communities. PeerJ 3, e1165 (2015). fewest identified genes in order to accommodate incomplete genomes. Inter- 22. Sczyrba, A. et al. Critical assessment of metagenome interpretation—a phylum and inter-class AAI and shared gene content values were determined by benchmark of computational metagenomics sofware. Preprint at http://www. sampling up to 50 near-complete (completeness ≥​90%, contamination ≤​5%, biorxiv.org/content/early/2017/06/12/099127 (2017). N50 >​20 kb, total contigs ≤​200) RefSeq release 76 genomes from each named 23. Kantor, R. S. et al. Small genomes and sparse of sediment- lineage, taking care to sample evenly between named species. The AAI score, associated bacteria from four candidate phyla. mBio 4, e00708-13 (2013). defined as the sum of the AAI and shared gene content, was used to determine the 24. Luo, C., Knight, R., Siljander, H., Knip, M., Xavier, R. J. & Gevers, D. most similar genome to each query genome. ConStrains identifes microbial strains in metagenomic datasets. Nat. Biotechnol. 33, 1045–1052 (2015). Genomic and assembly properties. Genomic and assembly properties (for 25. Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. example, GC, N50, coverage) were determined using CheckM. Transfer RNAs were CheckM: assessing the quality of microbial genomes recovered from isolates, identified with tRNAscan-SE67 v.1.3.1 using either the bacterial or archaeal tRNA single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015). model and default parameters. 26. Eren, A. M. et al. Anvi’o: an advanced analysis and visualization platform for ‘omics data. PeerJ 3, e1319 (2015). Data availability. The UBA genomes have been deposited under NCBI BioProject 27. Eloe-Fadrosh, E. A. et al. Global metagenomic survey reveals a new PRJNA348753. Individual genomes have been deposited at DDBJ/ENA/GenBank bacterial candidate phylum in geothermal springs. Nat. Commun. 7, and accession numbers are provided in Supplementary Table 2. The initial versions 10476 (2016). of these genomes are described in this paper. 28. Sekiguchi, Y., Ohashi, A., Parks, D. H., Yamauchi, T., Tyson, G. W. & Hugenholtz, P. First genomic insights into members of a candidate bacterial Received: 28 April 2017; Accepted: 25 July 2017; phylum responsible for wastewater bulking. PeerJ 3, e740 (2015). Published online: 11 September 2017 29. Castelle, C. J. et al. Genomic expansion of domain archaea highlights roles for organisms from new phyla in anaerobic carbon cycling. Curr. Biol. 25, References 690–701 (2015). 1. Hugenholtz, P., Sharshewski, A. & Parks, D. H. in Microbial Evolution (ed. 30. Hug, L. A. et al. A new view of the tree of life. Nat. Microbiol. 1, Ochman, H.) 55–65 (Cold Spring Harbor Laboratory Press, New York, 2016). 16048 (2016). 2. Solden, L., Lloyd, K. & Wrighton, K. Te bright side of microbial dark 31. Spang, A. et al. Complex archaea that bridge the gap between matter: lessons learned from the uncultivated majority. Curr. Opin. Microbiol. and . Nature 521, 173–179 (2015). 31, 217–226 (2016). 32. Williams, T. A., Foster, P. G., Cox, C. J. & Embley, T. M. An archaeal origin of 3. Nelson, K. E. et al. A catalog of reference genomes from the human eukaryotes supports only two primary domains of life. Nature 504, microbiome. Science 328, 994–999 (2010). 231–236 (2013). 4. Rinke, C. et al. Insights into the phylogeny and coding potential of microbial 33. Evans, P. N. et al. Methane metabolism in the archaeal phylum dark matter. Nature 499, 431–437 (2013). Bathyarchaeota revealed by genome-centric metagenomics. Science 350, 5. Wu, D. et al. A phylogeny-driven genomic encyclopaedia of Bacteria and 434–438 (2015). Archaea. Nature 462, 1056–1060 (2009). 34. Vanwonterghem, I. et al. Methylotrophic discovered in the 6. Kyrpides, N. C. et al. Genomic encyclopedia of type strains, phase I: the one archaeal phylum Verstraetearchaeota. Nat. Microbiol. 1, 16170 (2016). thousand microbial genomes (KMG-I) project. Stand. Genomic Sci. 9, 35. Whitman, W. B. et al. Genomic encyclopedia of bacterial and archaeal type 1278–1284 (2013). strains, phase III: the genomes of soil and plant-associated and newly 7. Mukherjee, S. et al. 1,003 reference genomes of bacterial and archaeal described type strains. Stand. Genomic Sci. 10, 26 (2015). isolates expand coverage of the tree of life. Nat. Biotechnol. 35, 36. Sims, D., Sudbery, I., Ilott, N. E., Heger, A. & Ponting, C. P. Sequencing depth 676–683 (2017). and coverage: key considerations in genomic analyses. Nat. Rev. Genet. 15, 8. Marcy, Y. et al. Dissecting biological ‘dark matter’ with single-cell genetic 121–132 (2014). analysis of rare and uncultivated TM7 microbes from the human mouth. 37. Chain, P. S. et al. Genome project standard in a new era of sequencing. Proc. Natl Acad. Sci. USA 104, 11889–11894 (2007). Science 326, 236–237 (2009). 9. Gawad, C., Koh, W. & Quake, S. R. Single-cell genome sequencing: current 38. Shepherd, J. & Ibba, M. Bacterial transfer RNAs. FEMS Microbiol. Rev. 39, state of the science. Nat. Rev. Genet. 17, 175–188 (2016). 280–300 (2015).

Nature Microbiology | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology 1541 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. Articles NATuRe MiCRobiology

39. Sayers, E. W. et al. Database resources of the National Center for 62. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree: computing large minimum Biotechnology Information. Nucleic Acids Res. 37, D13–D25 (2009). evolution trees with profles instead of a distance matrix. Mol. Biol. Evol. 26, 40. Hugenholtz, P., Goebel, B. M. & Pace, N. R. Impact of culture-independent 1641–1650 (2009). studies on the emerging phylogenetic view of bacterial diversity. J. Bacteriol. 63. McDonald, D. et al. An improved Greengenes taxonomy with explicit ranks 180, 4765–4774 (1998). for ecological and evolutionary analyses of bacteria and archaea. ISME J. 6, 41. Yarza, P. et al. Uniting the classifcation of cultured and uncultured bacteria 610–618 (2012). and archaeal using 16S rRNA gene sequences. Nat. Rev. Microbiol. 12, 64. Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA 635–645 (2014). alignments. Bioinformatics 25, 1335–1337 (2009). 42. Schloss, P. D. Te efects of alignment quality, distance calculation method, 65. Schloss, P. D. et al. Introducing mothur: open-source, platform-independent, sequence fltering, and region on the analysis of 16S rRNA gene-based community-supported sofware for describing and comparing microbial studies. PLoS Comput. Biol. 6, e1000844 (2010). communities. Appl. Environ. Microbiol. 75, 7537–7541 (2009). 43. Yuan, C., Lei, J., Cole, J. & Sun, Y. Reconstructing 16S rRNA genes in 66. Ludwig, W. et al. ARB: a sofware environment for sequence data. Nucleic metagenomic data. Bioinformatics 31, 35–43 (2015). Acids Res. 32, 1363–1371 (2004). 44. Haroon, M. F., Tompson, L. R., Parks, D. H., Hugenholtz, P. & Stingl, U. A 67. Lowe, T. M. & Eddy, S. R. tRNAscan-SE: A program for improved detection catalogue of 136 microbial draf genomes from Red Sea metagenomes. Sci. of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, Data 3, 160050 (2016). 955–964 (1997). 45. Soo, R. M. et al. An expanded genomic representation of the phylum Cyanobacteria. Genome Biol. Evol. 6, 1031–1045 (2014). Acknowledgements 46. Rahman, N. A., Parks, D. H., Vanwonterghem, I., Morrison, M., Tyson, G. W. The authors thank the many contributors to the Sequence Read Achieve for making & Hugenholtz, P. A phylogenomic analysis of the bacterial phylum their data publicly available. This study would not have been possible without this open Fibrobacteres. Front. Microbiol. 6, 01469 (2015). sharing of data. D.H.P., C.R., M.C. and P.-A.C. are supported by the Australian Centre 47. Lazar, C. S. et al. Genomic evidence for distinct carbon substrate preferences for Ecogenomics, B.J.W. by an Australian Research Council Discovery Early Career and ecological niches of Bathyarchaeota in estuarine sediments. Environ. Research Award (DE160100248) and a Genomic Science Program of the United States Microbiol. 18, 1200–1211 (2016). Department of Energy Office of Biological and Environmental Research grant (DE- 48. Stoddard, S. F., Smith, B. J., Hein, R., Roller, B. R. K. & Schmidt, T. M. SC0010580), P.N.E. by an Australian Research Council Discovery Early Career Research rrnDB: improved tools for interpreting rRNA gene abundance in bacteria and Award (DE170100428), P.H. by an Australian Research Council Laureate Fellowship archaea and a new foundation for future development. Nucleic Acids Res. 43, (FL150100038) and G.W.T. by a University of Queensland Vice Chancellor’s Research D593–D598 (2015). Focused Fellowship. 49. Markowitz, V. M. et al. IMG/M 4 version of the integrated metagenome comparative analysis system. Nucleic Acids Res. 42, D568–D573 (2014). Author contributions 50. Wilke, A. et al. Te MG-RAST metagenomics database and portal in 2015. D.H.P. designed and carried out the study and wrote the manuscript. C.R. and M.C. Nucleic Acids Res. 44, D590–D594 (2016). assisted in interpreting the phylogenies and taxonomically decorating the trees. P.-A.C. 51. Bowers, R. M. et al. Minimum information about a single amplifed genome and B.J.W. helped with data collection and bioinformatic analyses. P.N.E. assisted in (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and assessing the quality of the recovered genomes. P.H. and G.W.T. guided the study and archaea. Nat. Biotechnol. 35, 725–731 (2017). helped write the manuscript. 52. Leinonen, R., Sugawara, H. & Shumway, M. Te sequence read archive. Nucleic Acids Res. 39, D19–D21 (2011). Competing interests 53. Zhu, Y., Stephens, R. M., Meltzer, P. S. & Davis, S. R. SRAdb: query and use The authors declare no competing financial interests. public next-generation sequencing data from within R. BMC Bioinformatics 14, 19 (2013). 54. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows– Additional information Wheeler transform. Bioinformatics 25, 1754–1760 (2009). Supplementary information is available for this paper at doi:10.1038/s41564-017-0012-7. 55. Camacho, C. et al. BLAST+:​ architecture and applications. BMC Reprints and permissions information is available at www.nature.com/reprints. Bioinformatics 10, 421 (2009). Correspondence and requests for materials should be addressed to G.W.T. 56. Tatusova, T., Ciufo, S., Fedorov, B., O’Neill, K. & Tolstoy, I. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in Res. 42, D553–D559 (2014). published maps and institutional affiliations. 57. Varghese, N. J. et al. Microbial species delineation using whole genome Open Access Tis article is licensed under a Creative Commons sequences. Nucleic Acids Res. 43, 6761–6771 (2015). Attribution 4.0 International License, which permits use, sharing, 58. Hyatt, D., Chen, G. L., Locascio, P. F., Land, M. L., Larimer, F. W. & Hauser, adaptation, distribution and reproduction in any medium or L. J. Prodigal: prokaryotic gene recognition and translation initiation site format, as long as you give appropriate credit to the original author(s) and the source, identifcation. BMC Bioinformatics 11, 119 (2010). provide a link to the Creative Commons license, and indicate if changes were made. 59. Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, Te images or other third party material in this article are included in the article’s D222–D230 (2014). Creative Commons license, unless indicated otherwise in a credit line to the material. 60. Haf, D. H., Selengut, J. D. & White, O. Te TIGRFAMs database of protein If material is not included in the article’s Creative Commons license and your intended families. Nucleic Acids Res. 31, 371–373 (2003). use is not permitted by statutory regulation or exceeds the permitted use, you will need 61. Eddy, S. R. Accelerated profle HMM searches. PLoS Comp. Biol. 7, to obtain permission directly from the copyright holder. To view a copy of this license, e1002195 (2011). visit http://creativecommons.org/licenses/by/4.0/.

1542 Nature Microbiology | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved.