EVE 161: Microbial Phylogenomics

Case Study 1000s of MAGs

UC Davis, Winter 2018 Instructor: Jonathan Eisen Teaching Assistant: Cassie Ettinger

!1 • Presenters? • Topics to Cover? ARTICLES DOI: 10.1038/s41564-017-0012-7

Corrected: Author correction Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life

Donovan H. Parks! !, Christian Rinke! !, Maria Chuvochina, Pierre-Alain Chaumeil, Ben J. Woodcroft, Paul N. Evans, Philip Hugenholtz! !* and Gene W. Tyson*

Challenges in cultivating microorganisms have limited the phylogenetic diversity of currently available microbial genomes. This is being addressed by advances in sequencing throughput and computational techniques that allow for the cultivation-inde- pendent recovery of genomes from metagenomes. Here, we report the reconstruction of 7,903 bacterial and archaeal genomes from > 1,500 public metagenomes. All genomes are estimated to be ≥ 50% complete and nearly half are ≥ 90% complete with ≤ 5% contamination. These genomes increase the phylogenetic diversity of bacterial and archaeal genome trees by > 30% and provide the first representatives of 17 bacterial and three archaeal candidate phyla. We also recovered 245 genomes from the Patescibacteria superphylum (also known as the Candidate Phyla Radiation) and find that the relative diversity of this group varies substantially with different protein marker sets. The scale and quality of this data set demonstrate that recovering genomes from metagenomes provides an expedient path forward to exploring microbial dark matter.

equencing of microbial genomes has accelerated with reduc- phyla previously lacking genomic representatives27–29, including tions in sequencing costs, and public repositories now contain the Patescibacteria superphylum4, which has subsequently been nearly 70,000 bacterial and archaeal genomes. The majority referred to as the ‘Candidate Phyla Radiation’ (CPR) as it may con- S 1,2 10,30 of these genomes have been obtained from axenic cultures and sist of upwards of 35 candidate phyla . Notable evolutionary and disproportionately reflect microorganisms of medical importance3. metabolic insights include the discovery of eukaryotic-like cytoskel- Consequently, current genome repositories are not representative eton genes in the archaeon Lokiarchaeota31,32 and the identification of the microbial diversity known from 16S rRNA gene surveys4. of putative methane-metabolizing genes in the Bathyarchaeota and Concerted efforts are being made to address this limitation by Verstraetearchaeota phyla33,34. These initial studies demonstrate the target ing phylogenetically distinct microorganisms for cultivation5–7 need for additional genomic representatives across the tree of life in and single-cell sequencing4,8. Although these approaches continue order to more fully appreciate microbial evolution and . to provide valuable reference genomes, the former is restricted to Here, we present the first large-scale initiative to recover MAGs microorganisms amenable to cultivation and the latter is hampered from publicly available metagenomes. Nearly 8,000 draft-quality by technical challenges and the need for specialised equipment9. genomes were recovered from over 1,500 metagenomes, more than Obtaining genomes from metagenomes is an emerging approach with a threefold increase over large initiatives to genomically populate the potential for large-scale recovery of near-complete genomes10–12. the tree of life such as the Genomic Encyclopedia of and Until recently, recovering genomes from metagenomic data was Archaea35 (~2,000 genomes), the Human Microbiome Project3 restricted to samples with low microbial diversity13, but improved (~2,000) and the largest previous MAG study11 (~2,500). We refer to sequencing throughput and advances in computational techniques our set of MAGs as the Uncultivated Bacteria and Archaea (UBA) now allow metagenome-assembled genomes (MAGs) to be recov- data set. Genome-based phylogenetic analysis indicates that the ered from high diversity environments14,15. MAGs are obtained UBA genomes provide the first representatives of several major by grouping or ‘binning’ together assembled contigs with simi- bacterial and archaeal lineages and substantially expand genomic lar sequence composition, depth of coverage across one or more representation across the tree of life. related samples and taxonomic affiliations16,17. Several tools have been developed that exploit these sources of information to pro- Results duce genomes from metagenomic data18–21 and there are ongo- Genomes are readily recovered from metagenomic data. MAGs ing efforts to evaluate the effectiveness of different approaches22. were recovered from 1,550 metagenomes submitted to the Sequence Although closed genomes have been obtained using metagenomic Read Archive (SRA) before 31 December 2015 (Supplementary binning methods10,23, MAGs are typically incomplete and may con- Fig. 1). We predominantly considered environmental and non- tain contigs from multiple strains or species due to challenges in human gastrointestinal samples in order to focus on metagenomes distinguishing between related community members both in the likely to contain microbial populations from under-sampled lin- assembly and binning processes19,24. This has spurred the develop- eages (Supplementary Table 1). The completeness and contamina- ment of methods for assessing the quality of recovered MAGs in tion of each MAG was estimated from the presence and absence order to allow biological inferences to be made with regards to their of lineage-specific genes expected to be ubiquitous and single estimated completeness and contamination25,26. copy25, and these estimates, along with assembly statistics, used to Significant insights have recently been made based on the MAGs identify genomes suitable for further study. A total of 64,295 MAGs of uncultivated microorganisms. These include elucidation of several were obtained, of which 7,903 (7,280 bacterial and 623 archaeal)

Australian Centre for Ecogenomics, School of Chemistry and Molecular Biosciences, The University of Queensland, St Lucia, Queensland 4072, Australia. *e-mail: [email protected]; [email protected]

NATURE MICROBIOLOGY | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology 1533 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. ARTICLES NATURE MICROBIOLOGY

a 12% b 83% of genomes have ≤200 scafolds 12 10 Near complete (3,438 genomes; 43.5%) Medium quality (3,459 genomes; 43.8%) 8 Partial (1,006 genomes; 12.7%) 4

8 Genomes (%) 0 0 100 200 300 400 500 No. of scafolds 6

c 75% of genomes have 4 20 ≥15 standard tRNAs Contamination (%) 15

2 10

Genomes (%) 5

0 0 50 60 70 80 90 100 33% 20 15 10 5 0 Completeness (%) Standard tRNA count

Fig. 1 | Assessment of genome quality. a, Estimated completeness and contamination of 7,903 genomes recovered from public metagenomes. Genome quality was defined as completeness!− !5!× !contamination, and only genomes with a quality of ≥ 50 were retained. Near-complete genomes (completeness ≥ 90%; contamination ≤ 5%) are shown in red, medium-quality genomes (completeness ≥ 70%; contamination ≤ 10%) in blue, and partial genomes (completeness ≥ 50%; contamination ≤ 4%) in grey. Histograms along the x and y axes show the percentage of genomes at varying levels of completeness and contamination, respectively. Notably, only 171 of the 7,903 (2.2%) UBA genomes have > 5% contamination. b, Number of scaffolds comprising each of the 7,903 genomes with colours indicating genome quality. c, Number of tRNAs for each of the 20 standard amino acids identified within each of the 7,903 genomes with colours indicating genome quality. form the UBA data set as they met our filtering criteria of having an Taxonomic distribution of UBA genomes. The phylogenetic rela- estimated quality ≥ 50 (defined as the estimated completeness of a tionships of the UBA genomes were determined across bacterial genome minus five times its estimated contamination) and consist- and archaeal trees inferred from three concatenated protein sets: ing of ≤500 scaffolds with an N50 ≥ 10 kb (Fig. 1 and Supplementary (1) a syntenic block of 16 ribosomal proteins (rp1) recently used to Table 2). Over 93% of the 7,903 UBA genomes have an average cov- infer genome-based phylogenies10,30 (Supplementary Table 4), (2) 23 erage of ≥ 10× (5th percentile, 9.2×, 95th percentile, 268× ) and ribosomal proteins (rp2) previously tested for lateral gene transfer4 95.8% have > 5× coverage over 90% of bases, providing assurance (Supplementary Table 5), and (3) 120 bacterial (bac120) and 122 of high-quality base-calling across the genomes3,36. Among the UBA archaeal (ar122) proteins we have identified as being suitable for genomes is a subset of 3,438 near-complete genomes (3,225 bacte- phylogenetic inference (Supplementary Tables 6 and 7). The trees rial and 213 archaeal) estimated to be ≥ 90% complete with ≤ 5% span ~19,000 bacterial and ~1,000 archaeal genomes after species- contamination (Fig. 1a). These genomes consist of ≤ 100 scaffolds level dereplication of the UBA genomes and 67,479 genomes in in 70.2% of cases (≤ 200 scaffolds in 92.0% genomes) and have an RefSeq/GenBank release 76 (Supplementary Table 8). average N50 of 136 kb. Comparison of near-complete UBA genomes UBA genomes were represented in the majority of bacterial that are conspecific strains of complete isolate genomes also sug- (47 of 59, Fig. 2) and archaeal (11 of 18, Fig. 3) phyla, as defined in gest that the recovered MAGs have no systematic loss of genomic the NCBI taxonomy39. In addition, they comprise the first genomic content, with the exception of extrachromosomal elements such as representatives of 17 bacterial and 3 archaeal phyla (see section plasmids (Supplementary Note 1). ‘UBA genomes are the first representatives of several phyla’). To pro- The UBA data set was also assessed relative to the criteria used vide an objective taxonomic analysis of the UBA data set, we used by the Human Microbiome Project (HMP) for defining high-qual- the phylogenetic criterion of mean branch length to extant taxa30 as ity draft genomes3,37. Of the 3,438 UBA genomes we have defined existing taxonomic classifications are not phylogenetically uniform1. as near complete, 3,201 (93.1%) pass all of the HMP criteria, with The results were highly consistent across all trees and, as expected, the only substantial exception being 4.8% of the genomes having named groups at each taxonomic rank vary substantially in their scaffolds with an N50 of <20 kb (Supplementary Table 3). Nearly mean branch length to extant taxa (Fig. 4 and Supplementary Figs. 3 half of the remaining 4,465 UBA genomes also pass the HMP and 4). Based on the range of mean branch length values for estab- criteria for being high quality except that they are estimated to be lished taxa, the bacterial UBA genomes are exclusive representatives < 90% complete. of 20–30% of all genus- to order–level lineages, 15–30% of class-level The presence of tRNAs for the standard 20 amino acids was lineages and 5–15% of -level lineages (Fig. 4a). Similarly, the examined as a secondary measure of genome quality (Fig. 1c). archaeal UBA genomes are the only representatives within 20–30% The 3,438 near-complete UBA genomes have tRNAs that encode of genus- to order-level lineages, 15–30% of class-level lineages for an average of 17.3 ± 2.2 of the 20 amino acids and ≥ 15 amino and around 10% of archaeal phyla (Fig. 4b). We also tabulated the acids in 90.3% of the genomes. The correlation between estimated number of UBA-exclusive lineages at the 50th, 90th and 95th per- genome completeness and identified tRNAs was positive but weak centiles of the mean branch length distribution of each taxonomic (Supplementary Fig. 2) as tRNAs are regularly present in multiple rank (Supplementary Table 9). At the conservative 90th percentile, copies and often collocated in a genome, making them poor markers the bacterial UBA genomes are the first genomic representatives for robustly estimating completeness25,38. of 766 genera (34.6%), 226 families (28.6%), 61 orders (21.6%) and

1534 NATURE MICROBIOLOGY | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. NATURE MICROBIOLOGY ARTICLES

0.1 (1,513) FCB

Aminicenantes (10) (35) (5) Modulibacteria Chlorobi NC10 Ignavibacteriae (29) Rokubacteria Calditrichaeota (5) (23) Marinimicrobia (46), UBP11 (1) Chrysio. (1) Nitrospirae_1 (19) (17), (39) UBP10 (2) Latescibacteria (4) Dadabacteria (2) MBNT15 (2) , UBP14 (1), TA06, UBP1 (3), UBP2 (6) Deltaproteo. (1-3) (226) Cloacimonetes (31), Hyd24-12 (5), WOR-3 (10) (178) PVC (2,101) Lentisphaerae (12), , UBP17 (1) (61) Hydrogenedentes (2) Patescibacteria (245) (23) Aerophobetes, UBP5 (2) 'Candidate phyla radiation' Omnitrophica (14), UBP3 (4), UBP4 (2) UBP7 (2) UBP8 (5) Dependentiae (1) (135) UBP6 (4) (8) UBP13 (1) Calescamantes (24) Aquificae (1) Deferribacteres (5) UBP15 (1) (52) UBP16 (1) (5) (1,666) (47) UBP12 (1) (336) Dictyoglomi UBP9/SHA-109 (8) Coprothermo. Chloroflexi (182) Caldiserica (5) -thermus (6) Acetothermia (10) (11) (48)

Fig. 2Fig. | Distribution 2 | Distribution of of UuBABA genomes genomes across across 76 bacterial 76 . The phyla. maximum The likelihood maximum tree was likelihood inferred from tree the was concatenation inferred from of 120 theproteins concatenation and of 120 proteins spans a dereplicated set of 5,273 UBA and 14,304 NCBI genomes. Phyla containing UBA genomes are shown in green with the number of UBA genomes and spansindicated a dereplicated in parentheses. set Candidate of 5,273 phyla UBA consisting and 14,304only of UBA NCBI genomes genomes. are shown Phyla in red containing and have been UBA named genomes Uncultured are Bacterial shown Phylum in green 1 to 17 with the number of UBA genomes(UBP1–UBP17). indicated The in Tenericutesparentheses. and ThermodesulfobacteriaCandidate phyla consisting lineages are only not shown of UBA as they genomes branch within are shownthe Firmicutes in red and and , have been named Uncultured Bacterial Phylumrespectively. 1 to 17 The(UBP1–UBP17). Patescibacteria/CPR The is shownTenericutes as a single and lineage for clarity and to illustrate lineages its similarity are to not well-characterized shown as they phyla. branch The Nitrospirae within the Firmicutes and Deltaproteobacteria,and Deltaproteobacteria respectively. were found The to be Patescibacteria/CPR polyphyletic and an underscore is shown and numerical as a single identifier lineage is used for to clarity distinguish and between to illustrate lineages. its similarity to well-characterized phyla.The The Deltaproteobacteria Nitrospirae and are Deltaproteobacteriapolyphyletic due to the placement were found of the Dadabacteria. to be polyphyletic All named andlineages an haveunderscore bootstrap supportand numerical of ≥ 75%, with identifier the exception is used of the to distinguish betweenChrysiogenetes lineages. (55%), The DeltaproteobacteriaFirmicutes (1%) and Verrucomicrobia are polyphyletic (52%). The due complete to the tree placement with support of values the Dadabacteria. is available in Newick All format named (Supplementary lineages have File 1). bootstrap support of ≥75%, with the exception of the Chrysiogenetes (55%), Firmicutes (1%) and Verrucomicrobia (52%). The complete tree with support values is available in Newick format (Supplementary File 1). 38 classes (18.0%) within this . Similarily, the archaeal UBA Only UBP9, UAP2 and UAP3 could be further taxonomically genomes represent 59 genera (30.3%), 25 families (28.4%), 13 orders resolved as the other candidate phyla either lack genomes with a 16S (23.6%) and 3 classes (15.0%) at the 90th percentile. rRNA gene, were placed sister to named phyla, or had incongruent placements across the protein and 16S rRNA trees (Supplementary UBA genomes are the first representatives of several phyla. Tables 11 and 14). UBP9 genomes are the first genomic representa- The 17 bacterial and 3 archaeal phyla comprised exclusively of tives of the Terrabacteria candidate phylum SHA-109 (Fig. 2) and UBA genomes have been given the candidate names Uncultured were recovered from baboon faeces (five genomes), palm oil effluent Bacterial Phylum 1 to 17 (UBP1 to 17, Fig. 2) and Uncultured (one genome), a toluene degrading community (one genome) and Archaeal Phylum 1 to 3 (UAP1 to 3, Fig. 3). These candidate phyla a dechlorination bioreactor (one genome, Supplementary Table 14). form well-supported clans in all three concatenated protein trees UAP2 contains the first representatives of the Marine Hydrothermal (Supplementary Table 10) and are unaffiliated with existing phyla40 Vent Group (MHVG) and consists of three genomes recovered from (Supplementary Table 11). The 10 UBP/UAP phyla with 16S rRNA the Tara Oceans Expedition along with a single genome from the genes ≥ 600 bp have inter-phyla percent identity values between 76% Beebe hydrothermal vent (Supplementary Table 14 and Fig. 3). and 86% (Supplementary Table 12), in agreement with established UAP3 is represented by a single genome recovered from a Costa phyla41 (Supplementary Fig. 5). However, these 16S rRNA results Rican metagenome (Supplementary Table 14) and should be treated with some caution as the percent identity of is the first representative of the Ancient Archaeal Group (AAG), a incomplete 16S rRNA sequences correlates poorly with values for group adjacent to the Lokiarchaeota (Fig. 3). full-length sequences42. Because 16S rRNA genes often fail to assem- ble43 and are missing from half of the UBP/UAP lineages, we used UBA genomes substantially increase phylogenetic diversity. The average amino-acid identity (mean of 46.2%) and shared gene con- phylogenetic diversity of the UBA genomes was determined across tent (mean of 24.4%) calibrated against established phyla to further the three concatenated protein trees. The results were highly con- support the classification of the UBP/UAP as phyla (Supplementary sistent across these trees with the UBA genomes covering > 50% Fig. 5 and Supplementary Table 13). of the total branch length (phylogenetic diversity, PD) spanned To further resolve the taxonomic identity of the UBP and UAP by each domain-specific genome tree and increasing total branch genomes, 16S rRNA genes from these genomes were placed into a length (phylogenetic gain, PG) by ~30% (Fig. 5 and Supplementary tree containing genomic and environmental 16S rRNA sequences. Table 8). For comparison, the PD of the CPR (1,056 genomes,

NATURE MICROBIOLOGY | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology 1535 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. ARTICLES NATURE MICROBIOLOGY

0.1 TACK Crenarchaeota (11) Thaumarchaeota (26)

Korarchaeota

Verstraetearchaeota (2)

Aigarchaeota Euryarchaeota (538)

Bathyarchaeota (3)

Thorarchaeota

Lokiarchaeota UAP2/MHVG(4) UAP3/AAG (1)

Diapherotrites (1)

Micrarchaeota (12) Woesearchaeota_2 (2)

Aenigmarchaeota

Pacearchaeota (3) Nanohaloarchaeota Woesearchaeota (6) Nanoarchaeota UAP1 (2) Parvarchaeota (11) DPANN

Fig. 3 | Distribution of UBA genomes across 21 archaeal phyla. The maximum likelihood tree was inferred from the concatenation of 122 proteins and spans a dereplicated set of 453 UBA and 630 NCBI genomes. Phyla containing UBA genomes are shown in green with the number of UBA genomes indicated in parentheses. Candidate phyla consisting only of UBA genomes are shown in red and have been named Uncultured Archaeal Phylum 1 to 3 (UAP1–UAP3). The phylum Woesearchaeota was found to be polyphyletic and an underscore and numerical identifier is used to distinguish between lineages. All named lineages have bootstrap support ≥ 90%, with the exception of the Crenarchaeota (73%). The complete tree with support values is available in Newick format (Supplementary File 2). MHVG, Marine Hydrothermal Vent Group; AAG, Ancient Archaeal Group. including 245 UBA genomes) and Firmicutes (25,992 genomes, as the Micrarchaeota, Pacearchaeota and Woesearchaeota which all including 1,666 UBA genomes) is 11.2% and 25.1%, respectively had a PG of > 35%. (Fig. 5). Restricting results to bacterial UBA genomes meeting our near-complete or medium-quality criteria still results in PGs Improved genomic representatives within several lineages. There of ~17% and 30%, respectively. The near-complete and medium- are 12 bacterial phyla where the UBA genomes are estimated to quality archaeal UBA genomes provide a phylogenetic gain of 10% be the highest-quality representatives (Supplementary Table 17). and 29%, respectively (Supplementary Table 8). Among these is the Aminicenantes, where the number of avail- Genomic representation of several bacterial lineages was greatly able genomes increased from 37 to 47 with the addition of the expanded by the UBA genomes (Fig. 5). The bacterial phyla with the UBA genomes, and the highest-quality genome improved from largest increase in PD were the underrepresented Aminicenantes, 86.9% to 91.9% complete, with the five highest-quality genomes all Gemmatimonadetes, Lentisphaerae and Omnitrophica lineages (PG being UBA genomes. There are currently seven Hydrogenedentes of > 75%). Over 75% of the bacterial UBA genomes belong to the genomes (five NCBI and two UBA), with the two highest-quality Actinobacteria, Bacteroidetes, Firmicutes and Proteobacteria. These representatives being UBA genomes and appreciably improving genomes expand the PD of these phyla by 14–47%, despite > 50% upon the best previously available representative (88.3% complete, of existing genomic representatives belonging to these four phyla 4.3% contaminated to 98.9% complete, 1.1% contaminated). The (Fig. 5 and Supplementary Table 15). Such high levels of increased most substantial improvement was in the Latescibacteria, where phylogenetic diversity are the norm, with 56 of 77 phyla and 73 of all previous representatives were derived from single cells and the 143 classes being expanded by > 20% (Supplementary Table 16). UBA genomes improve the best-quality representative from 57.6% The UBA genomes have no representatives in only 10 bacterial to 95.6% complete. phyla, which we attribute to the narrow ecological range and/or low The UBA genomes are also the highest-quality representa- relative abundance of microorganisms belonging to these lineages tives of five archaeal phyla (Supplementary Table 17). Notably, the (for example, NC10 and Aerophobetes). Within the Archaea, the Parvarchaeota was previously represented by only four MAGs with PD of 10 of 21 phyla and 7 of 17 classes increased by > 20% (Fig. 5 the highest quality being 76.7% complete with 5.6% contamination, and Supplementary Tables 15 and 16). This includes well-estab- and there are 11 Parvarchaeota UBA MAGs with an average com- lished archaeal groups such as the Euryarchaeota (PG of 34.8%) and pleteness and contamination of 80.6 ± 4.2% and 1.7 ± 0.6%, respec- Thaumarchaeota (PG of 40.8%) and poorly sampled groups such tively. The UBA genomes also improve the completeness of the best

1536 NATURE MICROBIOLOGY | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. ARTICLES NATURE MICROBIOLOGY

0.1 TACK Crenarchaeota (11) Thaumarchaeota (26)

Korarchaeota

Verstraetearchaeota (2)

Aigarchaeota Euryarchaeota (538)

Bathyarchaeota (3)

Thorarchaeota

Lokiarchaeota UAP2/MHVG(4) UAP3/AAG (1)

Diapherotrites (1)

Micrarchaeota (12) Woesearchaeota_2 (2)

Aenigmarchaeota

Pacearchaeota (3) Nanohaloarchaeota Woesearchaeota (6) Nanoarchaeota UAP1 (2) Parvarchaeota (11) DPANN

Fig. 3 | Distribution of UBA genomes across 21 archaeal phyla. The maximum likelihood tree was inferred from the concatenation of 122 proteins and spans a dereplicated set of 453 UBA and 630 NCBI genomes. Phyla containing UBA genomes are shown in green with the number of UBA genomes indicated in parentheses. Candidate phyla consisting only of UBA genomes are shown in red and have been named Uncultured Archaeal Phylum 1 to 3 (UAP1–UAP3). The phylum Woesearchaeota was found to be polyphyletic and an underscore and numerical identifier is used to distinguish between lineages. All named lineages have bootstrap support ≥ 90%, with the exception of the Crenarchaeota (73%). The complete tree with support values is available in Newick format (Supplementary File 2). MHVG, Marine Hydrothermal Vent Group; AAG, Ancient Archaeal Group.

including 245 UBA genomes) and Firmicutes (25,992 genomes, as the Micrarchaeota, Pacearchaeota and Woesearchaeota which all including 1,666 UBA genomes) is 11.2% and 25.1%, respectively had a PG of > 35%. (Fig. 5). Restricting results to bacterial UBA genomes meeting our near-complete or medium-quality criteria still results in PGs Improved genomic representatives within several lineages. There of ~17% and 30%, respectively. The near-complete and medium- are 12 bacterial phyla where the UBA genomes are estimated to quality archaeal UBA genomes provide a phylogenetic gain of 10% be the highest-quality representatives (Supplementary Table 17). and 29%, respectively (Supplementary Table 8). Among these is the Aminicenantes, where the number of avail- Genomic representation of several bacterial lineages was greatly able genomes increased from 37 to 47 with the addition of the expanded by the UBA genomes (Fig. 5). The bacterial phyla with the UBA genomes, and the highest-quality genome improved from largest increase in PD were the underrepresented Aminicenantes, 86.9% to 91.9% complete, with the five highest-quality genomes all Gemmatimonadetes, Lentisphaerae and Omnitrophica lineages (PG being UBA genomes. There are currently seven Hydrogenedentes of > 75%). Over 75% of the bacterial UBA genomes belong to the genomes (five NCBI and two UBA), with the two highest-quality Actinobacteria, Bacteroidetes, Firmicutes and Proteobacteria. These representatives being UBA genomes and appreciably improving genomes expand the PD of these phyla by 14–47%, despite > 50% upon the best previously available representative (88.3% complete, of existing genomic representatives belonging to these four phyla 4.3% contaminated to 98.9% complete, 1.1% contaminated). The (Fig. 5 and Supplementary Table 15). Such high levels of increased most substantial improvement was in the Latescibacteria, where phylogenetic diversity are the norm, with 56 of 77 phyla and 73 of all previous representatives were derived from single cells and the 143 classes being expanded by > 20% (Supplementary Table 16). UBA genomes improve the best-quality representative from 57.6% The UBA genomes have no representatives in only 10 bacterial to 95.6% complete. phyla, which we attribute to the narrow ecological range and/or low The UBA genomes are also the highest-quality representa- relative abundance of microorganisms belonging to these lineages tives of five archaeal phyla (Supplementary Table 17). Notably, the (for example, NC10 and Aerophobetes). Within the Archaea, the Parvarchaeota was previously represented by only four MAGs with PD of 10 of 21 phyla and 7 of 17 classes increased by > 20% (Fig. 5 the highest quality being 76.7% complete with 5.6% contamination, and Supplementary Tables 15 and 16). This includes well-estab- and there are 11 Parvarchaeota UBA MAGs with an average com- lished archaeal groups such as the Euryarchaeota (PG of 34.8%) and pleteness and contamination of 80.6 ± 4.2% and 1.7 ± 0.6%, respec- Thaumarchaeota (PG of 40.8%) and poorly sampled groups such tively. The UBA genomes also improve the completeness of the best

1536 NATURE MICROBIOLOGY | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. NATURE MICROBIOLOGY ARTICLES

abNumber of bacterial lineages Number of archaeal lineages

NATURE MICROBIOLOGY5,107 1,927 821 357 151 76 36 14 3 529 287 181 116 77 54 39 22 16 4 100 100 ARTICLES 76 Phylum 4 Phylum 80 abNumber of bacterial lineages 80 Number of archaeal lineages 379 51 Class Class 5,107 1,927 821 357 151 76 36 14 3 529 287 181 116 77 54 39 22 16 4 60 100 10060 867 76 99 4 OrderPhylum Phylum Order 80 80 40 379 40 51 1,735 Class 161 Class Lineages (%) Lineages (%) 60 Family 60 Family 867 99 20 Order 20 Order 4,150 40 40 328 1,735 Genus 161 Genus Lineages (%) Lineages (%) Family Family 20 20 0 4,150 0 328 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Genus 0.0 0.2 0.4 0.6 Genus0.8 1.0 0 0 Mean branch0.0 length0.2 (substitutions0.4 0.6 0.8 per1.0 site)1.2 0.0 Mean0.2 branch0.4 length0.6 (substitutions0.8 1.0 per site) Mean branch length (substitutions per site) Mean branch length (substitutions per site) UBA NCBI UBA and NCBI UBA NCBI UBA and NCBI

Fig. 4 | Percentage ofFig. lineages 4 | Percentage of equal of lineages evolutionary of equal evolutionary distance distance represented represented exclusivelyexclusively by UbyBA UgenomesBA genomes. a, Percentage. a, Percentage of bacterial lineages of bacterial represented lineages represented exclusively by the UBAexclusively genomes, by the UBAgenomes genomes, at genomes NCBI, ator NCBI, both or data both data sets sets as as defined defined on on the the basis basis of mean of branchmean length branch to extant length taxa. to Mean extant branch taxa. length Mean branch length to extant taxa was calculated for all internal nodes and lineages defined at a specific threshold to establish lineages of equal evolutionary distance. The to extant taxa was calculatedbottom x axis for indicates all internal the threshold nodes used and to define lineages lineages defined (0 = extant at taxa,a specific far right =threshold root of tree), to the establish top x axis indicates lineages the of number equal of evolutionarylineages at distance. The bottom x axis indicatesdifferent the mean threshold branch length used thresholds, to define and lineages the left y axis (0 indicates= extant the taxa, percentage far right of lineages. = root The of right tree), y axis the indicates top differentx axis indicates taxonomic ranks, the numberwith of lineages at the 5th and 90th percentiles of the distribution of mean branch length values for these ranks shown by black lines. The mean of each distribution is shown different mean branchas a length black circle thresholds, along with the and number the ofleft lineages y axis at indicatesthis value. b, theAnalogous percentage plot to a, butof lineages.for archaeal Thelineages. right The y phylum axis indicates distribution isdifferent a single point, taxonomic ranks, with the 5th and 90th percentilesas the Euryarchaeota of the distribution is the only archaeal of mean phylum branchwith phylum-level length resolutionvalues for (that these is, containing ranks multipleshown classes). by black Plots lines. were determined The mean across of each the distribution is shown as a black circle alongdomain-specific with the number trees inferred of lineages from the concatenation at this value. of 120 b bacterial, Analogous and 122 plot archaeal to aproteins,, but for respectively. archaeal lineages. The phylum distribution is a single point, as the Euryarchaeota is the only archaeal phylum with phylum-level resolution (that is, containing multiple classes). Plots were determined across the domain-specific treesrepresentative inferred from within the the concatenation Micrarchaeota of (Micrarchaeum 120 bacterial acidiphi and 122- thearchaeal phylogenetic proteins, diversity respectively. of bacterial and archaeal genome trees lum ARMAN-2) from 84% complete to 91% complete, while add- by > 30% through the addition of 7,280 bacterial and 623 archaeal ing 11 other genomes all estimated to be > 70% complete with < 3% genomes obtained from over 1,500 public metagenomes (Fig. 5). contamination. These MAGs span the majority of recognized bacterial and archaeal representative within the Micrarchaeota (Micrarchaeum acidiphi- phylathe phylogenetic and include the first diversity genomic of representatives bacterial andof 17 archaeal bacte- genome trees lum ARMAN-2) fromAn alternative 84% complete view of the toCPR. 91% The complete, CPR has recently while been add pro-- rialby and 30% three througharchaeal phyla the (Figs.addition 2 and 3of). The7,280 7,903 bacterial genomes and 623 archaeal posed as a major collection of candidate phyla in the bacterial reported> in this study range in quality from 50% complete to meet- ing 11 other genomesdomain all30. estimatedUnder the bac120 to be tree > and 70% mean complete branch length with to

NATURE MICROBIOLOGY | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology 1537 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. ARTICLES NATURE MICROBIOLOGY

NCBI PD Total Dereplicated Relative UBA PD NCBI UBANCBI UBA Total PD Taxon PD UBA PG UBA PG/PD Bacteria 67,0347,244 14,304 5,273 100.0% 33.0% UBP (all) 046046 1.1% 1.1% Acetothermia 1101 6 0.1% 71.8% Fig. 5 | Phylogenetic diversity and gain for select Aminicenantes 410410 0.2% 81.0% bacterial and archaeal phyla. The 7,903 UBA Armatimonadetes 12 24 613 0.4% 67.2% genomes cover considerable phylogenetic diversity Bacteroidetes 1,324 1,513941 1,057 9.5% 47.4% Caldiserica 15140.2% 74.3% (PD) and contribute substantial phylogenetic gain Chloroflexi 61 182 50 169 2.7% 73.1% (PG) relative to genomes from RefSeq/GenBank Dadabacteria 12120.1% 74.8% release 76. PD and PG were assessed on the Elusimicrobia 323314 0.3% 70.5% Firmicutes 24,326 1,666 3,433 1,296 25.1% 32.2% domain-specific trees inferred from the Gemmatimonadetes 439337 0.2% 78.6% concatenation of 120 bacterial and 122 archaeal Hyd24-12 35350.1% 61.4% proteins. The bar charts give (1) the total PD for Latescibacteria 14130.1% 64.7% each lineage relative to the PD of the entire domain, Lentisphaerae 212212 0.3% 75.8% Omnitrophica 114114 0.3% 89.3% (2) the PD of the NCBI and UBA genomes relative Patescibacteria / CPR 811 245 811 245 11.2% 25.8% to the PD of the lineage, and (3) the PG contributed Proteobacteria 28,8602,101 4,889 1,272 20.5% 25.0% by the UBA genomes within the lineage. The 17 Verrucomicrobia 45 17844166 1.9% 65.5% UBP and 3 UAP lineages were collapsed together to Archaea 825 623630 453 100.0% 30.7% show their combined phylogenetic diversity and UAP (all) 07071.1% 100.0% Euryarchaeota 534538 422 369 52.7% 34.8% gain. Marine Group II, a lineage within the Marine Group II 12 20612206 9.2% 85.9% Euryarchaeota, has been included because it Micrarchaeota 112112 2.4% 73.4% comprises the majority of euryarchaeotal UBA Pacearchaeota 53532.7% 37.9% Thaumarchaeota 49 26 44 26 6.4% 40.8% genomes. Woesearchaeota 76764.4% 42.9%

Fig. 5 | Phylogenetic diversity and gain for select bacterial and archaeal phyla. The 7,903 UBA genomes cover considerable phylogenetic diversity (PD) and contribute substantial phylogenetic gain (PG) relative to genomes from RefSeq/GenBank release 76. PD and PG were assessed on the domain-specific trees inferred from the concatenation of 120 bacterial and 122 archaeal proteins. The bar charts give (1) the total PD for each lineage relative to the PD of the entire domain, (2) the PD of the NCBI and UBA genomes relative to the PD of the lineage, and (3) the PG contributed by the UBA genomes within the lineage. The 17 UBP and 3 UAP lineages were collapsed together to show their combined phylogenetic diversity and gain. Marine Group II, a lineage within the Euryarchaeota, has been included because it comprises the majority of euryarchaeotal UBA genomes.

self-splicing introns and proteins being encoded within their rRNA natural and industrial processes. We anticipate that as metagenomic genes10. These contrasting views of the diversity of the CPR are assembly and binning methods mature we will be presented with equally valid and probably reflect the unique biology of the organ- the challenge and great opportunity to be able to study microbial isms within this group. communities with complete, or near complete, genomic representa- While the SRA represents a large set of publicly available tion in the context of a comprehensive tree of life. metagenomic data, many additional metagenomes exist in other repositories such as the Integrated Microbial Genomes and Note added in proof: During finalization of this manuscript, a new Metagenomes49 (IMG/M) database and Metagenomics Rapid standard specifying the minimum information about a metagenome- Annotation Server50 (MG-RAST). We expect that processing these assembled genome (MIMAG) was proposed51. The medium-qual- metagenomes will add tens of thousands of additional genomes ity and partial UBA MAGs meet the medium-quality criteria of to the tree of life. Furthermore, methods for assembling and bin- the MIMAG standard. However, most of the near-complete UBA ning metagenomic data are continually improving, which makes it MAGs do not meet the stringent rRNA and tRNA requirements likely that systematic reprocessing of metagenomic data will result for high-quality draft MAGs under this standard, and we therefore in the recovery of new genomes and improved versions of previ- deliberately refer to these MAGs as ‘near complete’. ously obtained genomes. The number and diversity of genomes presented in this study, and the many similar studies we anticipate will follow, move us Methods closer to a comprehensive genomic representation of the micro- Recovery of cultivation-independent genomes. Metadata for metagenomes in the Sequence Read Archive52 (SRA) at the National Center for Biotechnology bial world. Detailed examination of such genomes will further our Information (NCBI) were obtained from the SRAdb53. Only metagenomes understanding of microbial evolution and metabolic diversity, and submitted to the SRA before 31 December 2015 were considered with a provide important insights into the role of microorganisms in both predominant focus on environmental and non-human gastrointestinal samples

1538 NATURE MICROBIOLOGY | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. NATURE MICROBIOLOGY ARTICLES

Fig. 6 | Percentage of lineages of equal a bac120 Number of lineages b bac120 Number of lineages

evolutionary distance within the CPR. a, 3,270 1,282 583 279 133 68 34 12 2 5,107 1,927 821 357 151 76 36 14 3 100 100 76 68 Phylum Phylum Percentage of CPR lineages within the 80 80 291 379 Class Class bac120 tree without the addition of the UBA 60 60 616 867 Order Order 40 40 genomes. The maximum percentage of 1,162 1,735 Family Family Lineages (%) phylum-level lineages within the CPR 20 Lineages (%) 20 2,646 Genus 4,150 Genus 0 0 occurs at a mean branch length of 0.85, as 0.0 0.2 0.4 0.6 0.8 1.0 1.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 denoted by the dashed grey line. The Mean branch length (substitutions per site) Mean branch length (substitutions per site) remaining figure elements are the same as in c rps16 Number of lineages d rps23 Number of lineages

3,454 1,197 476 179 67 27 5 3,209 1,064 411 160 67 22 3 Fig. 4. b, Analogous plot to a, with the 100 100 71 Phylum 69 Phylum 80 80 addition of the UBA genomes (maximum = 475 449 Class Class 60 60 0.85). 924 Order 845 Order 40 40 c,d, Percentage of CPR lineages within the 2,031 Family 1,872

Lineages (%) Family Lineages (%) 20 20 rp1 and rp2 trees, respectively, with the 4,187 Genus 4,422 Genus 0 0 maximum percentage of phylum-level 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 lineages within the CPR denoted with a Mean branch length (substitutions per site) Mean branch length (substitutions per site) UBA exclusive Other bacteria CPR UBA exclusive CPR

dashed grey line (rp1 = 0.69; rp2 = 0.76) e rps16 Number of lineages f bac120 Number of lineages

e,f, Percentage of CPR, Firmicutes, 3,454 1,197 476 179 67 27 5 5,107 1,927 821 357 151 76 36 14 3 100 100 Actinobacteria, Proteobacteria and 71 Phylum 76 Phylum 80 80 475 379 Class Class Bacteroidetes lineages within the rp1 and 60 60 924 867 Order Order bac120 trees, respectively. The trees span 40 40 2,031 1,735 Family Family Lineages (%) Lineages (%) 14,290 (a), 19,198 (b and f), 18,081 (c and 20 20 4,187 4,150 Genus Genus e) and 18,226 (d) genomes. 0 0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Mean branch length (substitutions per site) Mean branch length (substitutions per site) Other bacteria Proteobacteria Firmicutes Bacteroidetes Actinobacteria CPR

Fig. 6 | Percentage of lineages of equal evolutionary distance within the CPR. a, Percentage of CPR lineages within the bac120 tree without the addition of the UBA genomes. The maximum percentage of phylum-level lineages within the CPR occurs at a mean branch length of 0.85, as denoted by the dashed grey line. The remaining figure elements are the same as in Fig. 4. b, Analogous plot to a, with the addition of the UBA genomes (maximum"= "0.85). c,d, Percentage of CPR lineages within the rp1 and rp2 trees, respectively, with the maximum percentage of phylum-level lineages within the CPR denoted with a dashed grey line (rp1"= "0.69; rp2"= "0.76) e,f, Percentage of CPR, Firmicutes, Actinobacteria, Proteobacteria and Bacteroidetes lineages within the rp1 and bac120 trees, respectively. The trees span 14,290 (a), 19,198 (b and f), 18,081 (c and e) and 18,226 (d) genomes.

(for example, rumen, guinea pigs and baboon faeces; Supplementary Table 1). of bins with an estimated completeness > 70% and contamination < 5% were Metagenomes from studies where MAGs had previously been recovered were considered for further refnement and validation. excluded if the UBA MAGs did not provide appreciable improvements in genome quality or phylogenetic diversity. Each of the 1,550 metagenomes were processed Merging of compatible bins. Automated binning methods can produce multiple independently, with all SRA Runs within an SRA Experiment (that is, sequences bins from the same microbial population. The ‘merge’ method of CheckM v.1.0.6 from a single biological sample) being co-assembled using the CLC de novo was used to identify pairs of bins where the completeness increased by ≥ 10% and assembler v.4.4.1 (CLCBio). Assembly was restricted to contigs ≥ 500 bp and the the contamination increased by ≤ 1% when merged into a single bin. Bins meeting word size, bubble size and paired-end insertion size determined by the assembly these criteria were grouped into a single bin if the mean GC of the bins were within sofware. Assembly statistics are reported for contigs ≥ 2,000 bp (Supplementary 3%, the mean coverage of the bins had an absolute percentage difference Table 1). Reads were mapped to contigs with BWA54 v.0.7.12-r1039 using the ≤ 25%, and the bins had identical taxonomic classifications as determined by their BWA-MEM algorithm with default parameters and the mean coverage of contigs placement in the reference genome tree used by CheckM. This set of criteria was obtained using the ‘coverage’ command of CheckM25 v.1.0.6. Genomes were used to avoid producing chimaeric bins. independently recovered from each SRA Experiment using MetaBAT21 v.0.26.3 under all fve preset parameter settings (that is, verysensitive, sensitive, specifc, Filtering scaffolds with divergent genomic properties. Scaffolds with genomic veryspecifc, superspecifc). Te completeness and contamination of the genomes features deviating substantially from the mean GC, tetranucleotide signature, or recovered under each MetaBAT preset were estimated using CheckM using coverage of a bin were identified with the ‘outliers’ method of RefineM v.0.0.14 lineage-specifc markers genes and default parameters. For each SRA Experiment, (https://github.com/dparks1134/RefineM) using default parameters. This removes all only genomes recovered with the MetaBAT preset resulting in the largest number scaffolds with a GC or tetranucleotide distance outside the 98th percentile of the expected

NATURE MICROBIOLOGY | VOL 2 | NOVEMBER 2017 | 1533–1542 | www.nature.com/naturemicrobiology 1539 © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. © 2017 Macmillan Publishers Limited, part of Springer Nature. All rights reserved. Recently, the diversity of the CPR was explored in the context of a genome tree inferred from 16 ribosomal proteins where it was divided into 36 named phyla and shown to represent approximately 50% of bacterial lineages of equal phylum-level evolutionary distance30. Our analyses using a 120 concatenated proteins contrast with this view, as the CPR is shown to comprise ~25% of phylum- level lineages under the same criterion (Fig. 6b and Supplementary Fig. 7). This suggests that ribosomal proteins within CPR organisms may be evolving atypically relative to other proteins, perhaps as a result of their unusual ribosome composition and the presence of self-splicing introns and proteins being encoded within their rRNA genes10. These contrasting views of the diversity of the CPR are equally valid and probably reflect the unique biology of the organ- isms within this group.