<<

Articles https://doi.org/10.1038/s41564-020-00840-5

Genome-resolved metagenomics reveals site-specific diversity of episymbiotic CPR and DPANN in groundwater ecosystems

Christine He 1, Ray Keren 2, Michael L. Whittaker3,4, Ibrahim F. Farag 1, Jennifer A. Doudna 1,5,6,7,8, Jamie H. D. Cate 1,5,6,7 and Jillian F. Banfield 1,4,9 ✉

Candidate phyla radiation (CPR) bacteria and DPANN archaea are unisolated, small-celled symbionts that are often detected in groundwater. The effects of groundwater geochemistry on the abundance, distribution, taxonomic diversity and host associa- tion of CPR bacteria and DPANN archaea has not been studied. Here, we performed -resolved metagenomic analysis of one agricultural and seven pristine groundwater microbial communities and recovered 746 CPR and DPANN in total. The pristine sites, which serve as local sources of drinking water, contained up to 31% CPR bacteria and 4% DPANN archaea. We observed little species-level overlap of metagenome-assembled genomes (MAGs) across the groundwater sites, indicating that CPR and DPANN communities may be differentiated according to physicochemical conditions and host populations. Cryogenic transmission electron microscopy imaging and genomic analyses enabled us to identify CPR and DPANN lineages that reproduc- ibly attach to host cells and showed that the growth of CPR bacteria seems to be stimulated by attachment to host-cell surfaces. Our analysis reveals site-specific diversity of CPR bacteria and DPANN archaea that coexist with diverse hosts in groundwater aquifers. Given that CPR and DPANN organisms have been identified in human and their presence is correlated with diseases such as periodontitis, our findings are relevant to considerations of drinking water quality and human health.

etagenome-enabled phylogenomic analyses have led to despite harbouring an estimated 90% of all bacterial biomass32. CPR/ the classification of two groups of organisms that lack pure DPANN organism abundance is likely to have been underestimated culture representatives—the CPR bacteria and DPANN in genomic surveys because they are small enough to pass through M1–4 archaea . Although diverse, CPR and DPANN organisms share 0.2 µm filters, which are widely used to collect cells. Furthermore, conserved traits that are indicative of a symbiotic lifestyle, being ‘universal’ primers to divergent or intron-containing 16S rRNA ultrasmall in size with small genomes and minimal biosynthetic genes2,12,15 are unlikely to detect many members of both groups. capabilities5–9. Episymbiosis (surface attachment) with bacterial or Most of the available near-complete CPR and DPANN genomes are archaeal hosts has been observed in co-cultures of from just two aquifers2,19,22. In this Article, to investigate the roles with Actinobacteria10,11, with Crenarchaeota12–14, that CPR and DPANN organisms may have in groundwater ecosys- and Nanohaloarchaeota and archaeal Richmond Mine acidophilic tems, we applied genome-resolved metagenomics to analyse eight nanoorganisms (ARMAN) with Euryarchaeota15–17, and one case groundwater communities in Northern California, and cryogenic of endosymbiosis has been reported in which a member of the transmission electron microscopy (cryo-TEM) to image the com- CPR superphylum Parcubacteria inside a protist18. CPR and munity with the highest abundance of CPR/DPANN organisms. DPANN organisms are ubiquitous and can be abundant in ground- water, in which they are predicted to contribute to biogeochemical Results cycling2,4,8,9,19–24. CPR bacteria can persist in drinking water through Metagenome sampling and MAG assembly. The planktonic frac- multiple treatment methods25–27, posing the question of whether tions of eight groundwater communities in Northern California groundwater is a source of CPR10,11,28–30 and DPANN31 organisms were sampled during 2017–2019 (Fig. 1a) using bulk filtration detected in human microbiomes. (0.1 µm filter). Some sites were also sampled using serial size fil- The variation in the abundance and distribution of CPR and tration (2.5 µm, 0.65 µm, 0.2 µm and 0.1 µm filters) in parallel. This DPANN organisms in groundwater environments, their roles and enterprise required pumping 400–1,200 l of groundwater from their relationships with host organisms are not well characterized. each site through a purpose-built sequential filtration appara- Subsurface environments such as groundwater are difficult to sample tus to recover sufficient biomass for deep sequencing of each size and are poorly characterized compared with surface environments, fraction (Methods). One site (Ag) is an agriculturally impacted,

1Innovative Genomics Institute, University of California, Berkeley, CA, USA. 2Department of Civil and Environmental Engineering, University of California, Berkeley, CA, USA. 3Energy Geoscience Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA. 4Department of Earth and Planetary Sciences, University of California, Berkeley, CA, USA. 5Department of Molecular and Cell Biology, University of California, Berkeley, CA, USA. 6Department of Chemistry, University of California, Berkeley, CA, USA. 7Molecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA. 8Howard Hughes Medical Institute, University of California Berkeley, Berkeley, CA, USA. 9Department of and Microbial Biology, University of California, Berkeley, CA, USA. ✉e-mail: [email protected]

354 Nature Microbiology | VOL 6 | March 2021 | 354–365 | www.nature.com/naturemicrobiology NATuRE MICRObIOlOGy Articles

a c 8 8 (i) (ii) Ag (Mar 2017) Ag (Sep 2017) Ep 6 17% CPR, 16% DPANN 6 37% CPR, 24% DPANN Q KJf (i) Qv E KI Tv 4 4 J KI Ep Qv 2 2 (ii) Sacramento Total coverage (%) coverage Total Total coverage (%) coverage Total KJf 0 0 Ep m 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Abundance rank Abundance rank J Ku um KI 8 8 Ag (Nov 2017) Ag (Feb 2018) 6 6 (iii) 40% CPR, 19% DPANN 38% CPR, 19% DPANN Pr1 Pr2 4 4 San Francisco QPc (iii) Q Pr3 Pr4 2 2 Total coverage (%) coverage Total Total coverage (%) coverage Total 0 0 Modesto Pr5 Pr6 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Abundance rank Abundance rank 8 8 San Jose Pr7 Ag Ag (Jun 2018) Pr1 (Dec 2018) 6 37% CPR, 17% DPANN 6 38% CPR, 4.2% DPANN 4 4

2 2 Total coverage (%) coverage Total Total coverage (%) coverage Total 0 0 b 100 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Abundance rank Abundance rank 8 8 Pr2 (May 2019) Pr3 (Nov 2017) 6 6 80 31% CPR, 3.2% DPANN 16% CPR, 3.5% DPANN 4 4

2 2 60 Total coverage (%) coverage Total Total coverage (%) coverage Total 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Abundance rank 8 Abundance rank 40 8

Total coverage (%) coverage Total Pr3 (Sep 2018) Pr4 (Sep 2018) 6 6 2.7% CPR, 0.35% DPANN 8.5% CPR, 1.2% DPANN

20 4 4 2 2 Total coverage (%) coverage Total Total coverage (%) coverage Total 0 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Ag Ag Ag Ag Ag Pr1 Pr2 Pr3 Pr3 Pr4 Pr5 Pr6 Pr7 Abundance rank Abundance rank 8 8 Pr5 (May 2019) Pr6 (May 2019) 6 6 6.4% CPR, 0.89% DPANN 7.0% CPR, 2.7% DPANN (Mar 2017) (Sep 2017) (Nov 2017) (Feb 2018) (Jun 2018) (Dec 2018) (May 2019) (Nov 2017) (Sep 2018) (Sep 2018) (May 2019) (May 2019) (Nov 2018) 4 4

Sampling point 2 2 Total coverage (%) coverage Total Total coverage (%) coverage Total 0 0 CPR Ca. Acetothermia 0 5 10 15 20 25 30 0 5 10 15 20 25 30 DPANN Ca. Tectomicrobia Bathyarchaeota Ca. Eisenbacteria Lambdaproteobacteria Abundance rank Abundance rank Ca. Schekmanbacteria TA06 Ca. Ratteibacteria Ca. Cloacimonetes Lentisphaerae 8 Ca. Omnitrophica Ca. Handelsmanbacteria Ca. Wallbacteria Ca. Hydrogenedentes Pr7 (Nov 2018) 6 Ca. Rokubacteria Ca. Latescibacteria 4.7% CPR, 0% DPANN DeItaproteobacteria Ca. Saganbacteria Ca. Riflebacteria 4 Ignavibacteria Ca. Lindowbacteria Ca. Margulisbacteria Ca. Dependentiae Unbinned NC10 Ca. Sumerlaeota Ca. KSB1 2 Ca. Dadabacteria Ca. Blackallbacteria –Thermus Oligoflexia

Nitrospinae Ca. Absconditabacteria Ca. Coatesbacteria (%) coverage Total 0 Ca. Desantisbacteria Ca. Abawacabacteria 0 5 10 15 20 25 30 Bacteria Ca. Stahlbacteria Abundance rank

Fig. 1 | Sampling and overview of groundwater communities. a, Map of eight Northern California groundwater sites sampled in this study (image from Google Maps). Insets: geological maps (i)–(iii) show the sampled areas (black boxes). J, marine sedimentary and metasedimentary rocks (Jurassic);

KJfm, marine sedimentary and metasedimentary rocks (Cretaceous–Jurassic); Q, marine and non-marine (continental) sedimentary rocks (Pleistocene– Holocene); Qv, marine sedimentary and metasedimentary rocks (Cretaceous–Jurassic); um, plutonic (Mesozoic); K, marine sedimentary rocks (Pliocene); Kl, marine sedimentary and metasedimentary rocks (Lower Cretaceous); Ep, marine sedimentary rocks (Paleocene); E, marine sedimentary rocks (Eocene); QPc, non-marine (continental) sedimentary rocks (Pliocene–Pleistocene); Tv, volcanic rocks (Tertiary). Scale bars, 23.3 km (large map), 3.2 km (i and ii) and 9.7 km (iii). b, -level breakdown (with the exception of CPR and DPANN superphyla) of rpS3 genes detected in each site. The sampling dates for each site are indicated. c, Rank abundance curves showing the 30 rpS3 genes with highest relative coverage identified for each site. The hatched bars indicate an unbinned rpS3 gene. river sediment-hosted aquifer and the remaining sites are pristine genomes across all sites (≥70% completeness and ≤10% contamina- groundwater aquifers hosted in a mixture of sedimentary and volca- tion). The median genome completeness was >90% and up to 58% nic rocks (Pr1–Pr7, numbered in decreasing order of total CPR and of metagenomic reads mapped to each site’s dereplicated genome DPANN organism abundance). On the basis of the high abundance set (assembly and binning statistics are provided in Supplementary and diversity of CPR/DPANN organisms found at the Ag site in a Tables 1 and 2). Of this dereplicated set, 540 and 206 genomes were previous metagenomics study33, we sampled five time points over classified as CPR bacteria and DPANN archaea, respectively. 15 months. Binning of bacterial and archaeal genomes from metagenomic Abundance and diversity of CPR/DPANN organisms. We first data was performed using four different binning algorithms/tech- sought to characterize and compare compositions of the eight niques on the basis of GC content, coverage, the presence/copies of groundwater communities, with a particular focus on CPR and ribosomal proteins and single-copy genes, tetranucleotide frequen- DPANN organisms. To broadly survey microbial community com- cies and patterns of coverage across samples (Methods). The highest position, we used the ribosomal protein uS3 (encoded by rpS3) as quality bins were chosen using DASTool34 and then manually a single-copy marker gene due to its strong phylogenetic signal36. A curated. All bins for a given site were dereplicated at 99% average comparison of rpS3 genes against recovered genomes indicated that, identity (ANI)35, resulting in a dereplicated set of 2,007 with the exception of the Pr2 site, the majority of the most abundant

Nature Microbiology | VOL 6 | March 2021 | 354–365 | www.nature.com/naturemicrobiology 355 Articles NATuRE MICRObIOlOGy

a b 25 Tree scale: 0.1 Site 166 DPANN archaea Ag Pr3 Pr6 20 Pr1 Woesearchaeota Woesearchaeota Pr4 Pr7 Diapherotrites Pr2 Pr5 15 Aenigmarchaeota Pacearchaeota Pacearchaeota 10 Mamarchaeota Micrarchaeota Total coverage (%) coverage Total 30 5 4 0 0 1 0 Micrarchaeota 0 Ag Ag Ag Ag Ag Pr6 Pr7 Pr3 Pr3 Pr4 Pr5 Pr1 Pr2 (Jun 2018) (Mar 2017) (Feb 2018) (Nov 2017) (Nov 2017) (Sep 2017) (Sep 2018) (Sep 2018) (Dec 2018) (May 2019) (May 2018) (May 2019) (May 2019) Sampling point Diapherotrites Mamarchaeota Altiarchaeota Nanoarchaeota 40 51 383 CPR bacteria Aenigmarchaeota Nanohaloarchaeota Parvarchaeota 35 Tree scale: 1

3 Ca .

Ca. Pacebacteria

Collierbacteria

Ca.

30 Ca.

Chisholmbacteria Ca. Ca. Beckwithbacteria

Levybacteria RoizmanbacteriaCa. Ca. Woesebacteria Ca . Shapirobacteria

Gottesmanbacteria 25 Ca. Ca. Gottesmanbacteria Ca.

Ca . Amesbacteria Ca. Genascibacteria Ca. Blackburnbacteria Ca. Saccharibacteria Ca. Absconditabacteria

Ca. Peregrinibacteria Ca. Daviesbacteria

20 Ca. 88 Curtissbacteria Ca. Doudnabacteria Ca. KAZAN Ca. Abawacabacteria Ca. Woykebacteria Ca. Andersenbacteria

Total coverage (%) coverage Total 15 Ca. Dojkabacteria Ca. Moisslbacteria Ca. Jacksonbacteria Ca. Zambryskibacteria Ca. Kuenenbacteria 10 5 Ca. Komeilibacteria Ca. Falkowbacteria/ 1 1 Ca. Taylorbacteria Dependentiae Ca. Parcubacteria Ca. Buchananbacteria 11 Ca. Adlerbacteria Ca. Buchananbacteria Ca. Falkowbacteria 5 Ca. Campbellbacteria

Ca. Kaiserbacteria Ca. Parcubacteria 0 Ca. Yonathbacteria To other bacteria Ca. Lloydbacteria Ca. Uhrbacteria Ag Ag Ag Ag Pr1 Pr3 Pr3 Pr4 Ag Pr2 Pr5 Pr6 Pr7 Ca. Nomurabacteria Ca. Lloydbacteria Ca. Magasanikbacteria (Mar 2017) (Feb 2018) (Nov 2017) (Dec 2018) (Sep 2017) (Nov 2017) (Sep 2018) (Sep 2018) (May 2019) (Jun 2018) (May 2019) (May 2019) (May 2018) Ca. Uhrbacteria Sampling point Ca. VogelbacteriaCa. Terrybacteria Ca. Torokbacteria Ca. Tagabacteria Ca. Parcubacteria

Ca. Tagabacteria Ca. Moranbacteria Ca. Adlerbacteria Ca. Komeilibacteria Ca. Taylorbacteria Ca. Daviesbacteria Ca. Parcubacteria Ca. Niyogibacteria Ca. Portnoybacteria Ca. Andersenbacteria Ca. Kuenenbacteria Ca. Terrybacteria Ca. Dojkabacteria Ca. ParcubacteriaCa. Spechtbacteria Ca. Azambacteria Ca. Liptonbacteria Ca. Uhrbacteria Ca. Gottesmanbacteria Ca. GiovannonibacteriaCa. Ryanbacteria Ca. Sungbacteria Ca. Parcubacteria Ca. Brennerbacteria Ca. Lloydbacteria Ca. Veblenbacteria Ca. Levybacteria Ca. Azambacteria Ca. Wildermuthbacteria Ca. Nealsonbacteria Ca. Buchananbacteria Ca. Magasanikbacteria Ca. Vogelbacteria Ca. Pacebacteria Ca. Parcubacteria Ca. Montesolbacteria Ca. GribaldobacteriaCa. Staskawiczbacteria Ca. Colwellbacteria Ca. Moranbacteria Ca. Wildermuthbacteria Ca. Roizmanbacteria Ca. Yanofskybacteria

Ca. Harrisonbacteria Ca. Falkowbacteria Ca. Nealsonbacteria Ca. Wolfebacteria Ca. Shapirobacteria Ca. Colwellbacteria Ca. LiptonbacteriaCa. Wolfebacteria Ca. Giovannonibacteria Ca. Niyogibacteria Ca. Yanofskybacteria Ca. Microgenomates Ca. Azambacteria Ca. Parcubacteria Ca. Gribaldobacteria Ca. Nomurabacteria Ca. Yonathbacteria Ca. Woesebacteria Ca. Parcubacteria

Ca. Parcubacteria Ca. Harrisonbacteria Ca. Portnoybacteria Ca. Zambryskibacteria Ca. Woykebacteria Ca. Jorgensenbacteria

Ca. Howlettbacteria Ca. Ryanbacteria Ca. Amesbacteria Ca. Berkelbacteria Ca. Yanofskybacteria Ca. Jacksonbacteria Ca. Spechtbacteria Ca. Blackburnbacteria Ca. Doudnabacteria Ca. Jorgensenbacteria Ca. Staskawiczbacteria Ca. Chisholmbacteria Ca. Katanobacteria Ca. Kaiserbacteria Ca. Sungbacteria Ca. Collierbacteria Ca. Peregrinibacteria Ca. Kerfeldbacteria Ca. Tagabacteria Ca. Curtissbacteria Ca. Saccharibacteria

Fig. 2 | Distribution of CPR and DPANN organisms across groundwater sites. a, Abundances (relative coverage of scaffolds containing rpS3 marker genes) of phylum-level lineages within DPANN (above) and CPR (below) in all sites. The numbers above the bars indicate the number of draft-quality (>70% complete, <10% contamination) dereplicated DPANN or CPR genomes that were recovered from metagenomic reads. b, Maximum likelihood of the DPANN radiation (top), based on 14 concatenated ribosomal proteins, and of the CPR (bottom), based on 15 concatenated ribosome proteins. Phylum-level lineages within the CPR (as previously defined2) are collapsed. Markers next to each lineage indicate the groundwater sites where at least one representative genome from that lineage was recovered. New CPR lineages ‘Candidatus Genascibacteria’ (within the Microgenomates superphylum in green) and ‘Candidatus Montesolbacteria’ (within the Parcubacteria superphylum in purple) are highlighted in grey. organisms at each site are represented by genome bins (Fig. 1c, the archaea in Ag groundwater (10–24%) is much higher compared to hatched bars indicate unbinned rpS3 genes). We found that all of the the pristine sites, where DPANN organisms comprise <5% of the groundwater communities are distinct in phylum-level composition community. Across all of the sites, genomes were recovered from 58 (Fig. 1b,c), with a strong divide between the Ag site and the pris- out of 73 currently identified phylum-level lineages within the CPR4 tine sites on the basis of principal component analysis (Extended and from 6 out of 10 currently identified phylum-level lineages Data Fig. 1). Change over time of the Ag groundwater community within the DPANN radiation (Fig. 2b). In particular, recovered CPR is examined in further detail in the ‘An agriculturally impacted genomes from Ag groundwater span most of the diversity within groundwater site rich in CPR/DPANN’ section. the CPR (Fig. 2b, filled black circles). On the basis of the criteria Specifically, the populations of CPR and DPANN organisms for 16S rRNA gene sequence identity (<76% for phylum-level21,37) are quite distinct between sites (Fig. 2a), although a few CPR and and concatenated ribosomal protein phylogenetic placement2, we DPANN lineages are fairly ubiquitous across sites (Extended Data defined two new phylum-level lineages within the CPR, each con- Fig. 2). Across all of the sites, CPR and DPANN organisms repre- sisting of sequences from Ag and Pr1 groundwater (Supplementary sent 3–40% and 0–24% of the communities (measured by bulk fil- Table 3). We propose the names ‘Candidatus Genascibacteria’ and tration onto a 0.1 µm filter), respectively. The abundance of DPANN ‘Candidatus Montesolbacteria’ for these new phylum-level lineages

356 Nature Microbiology | VOL 6 | March 2021 | 354–365 | www.nature.com/naturemicrobiology NATuRE MICRObIOlOGy Articles

a b Ca. Dojkabacteria (1) Ca. Katanobacteria (3) – Ca. Curtissbacteria (7) NO Ca. Daviesbacteria (30) 3 es Ca. Gottesmanbacteria (28) Organic carbon t Ca. Roizmanbacteria (6) Ca. Microgenomates (10) 23% (199) 6.9% (169) 28% (240) 2– 19% (214) Ca. Levybacteria (30) – SO4 53% (592) 9.3% (70) Ca. Chisholmbacteria (2) NO 14% (160) 3% (44) Ca. Collierbacteria (6) 2 Acetate Ethanol Ca. Pacebacteria (1) Ag (Sep 2017) 2– ogenom a Ca. Shapirobacteria (1) NO SO Ca. Amesbacteria (3) 3 4% (45) 2– ic r –1 H Ca. Blackburnbacteria (6) S O 2.8% (64) M Ca. Genascibacteria (3) <0.05 mg l NH -N 2 3 2 Percentage of genomes containing key genes 4 18% (225) Ca. Woesebacteria (9) –1 H O Ca. Berkelbacteria (13) 100 165 mg l NO -N 2 Ca. Saccharibacteria (17) 3 18% (230) N O –1 2 Ca. Abawacabacteria (1) 69 mg l SO -S 0 Ca. Peregrinibacteria (28) 4 33% (359) CH4 4.0% (65) S 2.5% Ca. Andersenbacteria (5) 1,108 genomes + 6.6% (7) Ca. Doudnabacteria (10) NH 21% (276) (46) 73% Ca. Jacksonbacteria (1) 4 N2 0.1% (2) 80 0.2% (25) (859) Ca. Kuenenbacteria (1) H2S Ca. Komeilibacteria (1) 1.0% (23) CO2 Ca. Parcubacteria (43) Ca. Falkowbacteria (9) Ca. Kerfeldbacteria (19) Ca. Brennerbacteria (3) 60 – Ca. Uhrbacteria (40) NO CPR Ca. Magasanikbacteria (13) 3 Organic carbon Ca. Moranbacteria (1) Ca. Portnoybacteria (3) Ca. Spechtbacteria (2) 7.2% (43) 48% (116) 57% (200) ia 56% (153) 2– 87% Ca. Wildermuthbacteria (14) 40

e r Ca. Nealsonbacteria (8) – SO4 (363) 27% (101) t NO 37% (100) 17% (37) Ca. Staskawiczbacteria (5) 0.1% (1) 2 Acetate Ethanol Ca. Azambacteria (8) 2– Ca. Montesolbacteria (4) SO Ca. Yanofskybacteria (12) Pr3 (Nov 2017) NO 3 2– cuba c 7.5% (59) 38% (127) Ca. Jorgensenbacteria (1) 20 –1 S2O3 H2 a r Ca. Wolfebacteria (7) 0.07 mg l NH -N 29% (107) P Ca. Liptonbacteria (13) 4 –1 H O Ca. Colwellbacteria (9) <0.05 mg l NO -N 64% (199) N O 2 Ca. Harrisonbacteria (7) 3 2 Ca. Sungbacteria (20) 65.1 mg l–1 SO -S 62% (203) 0 CH Ca. Ryanbacteria (2) 0 4 11% (55) S 6.3% 4 Ca. Giovannonibacteria (6) + 1.1% (5) (25) Ca. Niyogibacteria (9) 439 genomes NH N 66% (229) 93% Ca. Terrybacteria (3) 4 2 Ca. Vogelbacteria (10) 0.7% (4) 42% (82) (399) Ca. Nomurabacteria (6) 32% (88) H2S CO2 Ca. Lloydbacteria (2) Ca. Yonathbacteria (5) Ca. Kaiserbacteria (22) Ca. Adlerbacteria (1) Ca. Taylorbacteria (5) – Ca. Zambryskibacteria (5) NO3 Organic carbon Micrarchaeota (18) 12% (13) Diapherotrites (7) 15% (9) 52% (43) 2– 65% (53) 91% Aenigmarchaeota (43) Mamarchaeota (2) SO (72) 37% (21) ANN – 4 Pacearchaeota (18) NO 23% (11) 12% (7) Woesearchaeota (114) 2 Acetate Ethanol D P 2– DPANN (4) I I II I I h 7 L h a B R SO H I I rA NO t d 1 / I d IV oS AB 3 A Id h 2– I 22% (18) c o acs a 0 hsA rB D o Pr4 (Aug 2018) pflD oBD I ABC o ds r D p p c 37% (30) ABC au AB

H ac l nxrAB fd h m /n i

S O norBC For m nosDZ 6/abfD / or m For m ack/p t

2 m a s rAB C s r e A BC 2 3 For m / For m m f c B /s q r so x B C Y mc rA BC H hE /

–1 B/nar G 35% (26) am o Cellulase Chitinase A pm o For m fDKG/nif H .1. 5 hzoA/hzsA nirKS / p_ FD H/f ae AD H Isoamylase xa F Pullulanase /co xM x /c d 0.12 mg l NH -N fd o e /m cr/K1 5 ac d 4 H O S m 6. 2 nap A ds rA B/ s o r /s d Glucoamylase Arabinosidase x E nr F 42% (34) 2 Cellobiosidase Alpha-amylase wB / o a pr A / s t/c y CN

N O Beta-xylosidase DK/v n

–1 Hexosaminidase c cdh D 1/ E 2 Beta-glucosidase pr p 6 Beta-mannosidase co S_ d Beta-galactosidase Beta-glucuronidase CH4 met- <0.05 mg l NO -N 8 3 0 75% (55) CH oG/f d d 13% Alpha-L-rhamnosidase 12% (6) S 4 Fermen- f

–1 A/m y DKG/ni f

abolism l/K1 8 63.7 mg l SO -S + (9) m Sulfur metabolism an f 4 58% (42) tation /f r NH4 N 94% 79 genomes 2 Alpha-D-xyloside xylohydrolase (77) 466/4h b

H S 25% (14) Mannan endo-l,4-beta-mannosidase Nitrogen metabolism 27% (18) CO K1 4 2 2 fdh A/f gh A Complex carbon Carbon metabolism degradation C1 metabolism

Fig. 3 | Metabolic profile of groundwater communities. a, Biogeochemical cycling diagrams profiling the community-level metabolic potential of three groundwater sites sampled in this study (all of the sites are shown in Extended Data Fig. 4). The total relative abundance of all genomes capable of carrying out the step, as well as the number of genomes containing the capacity for that step, are listed next to each metabolic step. Arrow sizes are drawn proportional to the total relative abundance of genomes capable of carrying out the metabolic step. b, Heat map of 746 CPR and DPANN genomes from this study (rows are phylum-level lineages, and the numbers in parentheses are the number of genomes recovered), showing the percentage of genomes containing key genes required for various metabolic and biosynthetic functions (columns).

(Fig. 2b, highlighted in grey) on the basis of the two sites at which genomes against a curated set of protein hidden Markov models the representative sequences were found. (HMMs)41 (Methods) and utilized genome relative coverage values To assess groundwater community similarity at the genome level, to compare metabolic profiles of whole communities (Fig. 3a and we used ANI to cluster35 the 2,007 genomes from this study with Extended Data Fig. 4). The metabolic profile of Ag groundwater is 3,044 genomes from previous studies of two groundwater sites rich in clearly differentiated from that of the pristine sites33 (Extended Data CPR/DPANN organisms: Crystal Geyser in Utah22,38,39 and an aquifer Fig. 5). In Ag groundwater, which receives heavy nitrogen input from adjacent to the Colorado River in Rifle, Colorado2,19. At the strain neighbouring agricultural activity, ammonia is oxidized by seven level (>99% ANI), there is very little similarity between genomes of Planctomycetes that are capable of anammox (comprising 8% of the the analysed sites; most pairs of sites share one or no strains despite community), resulting in low levels of ammonia in Ag groundwater the fact that sites Pr1 to Pr6 are located in close proximity (~1 km (Fig. 3a). The Ag community encodes greater capacity for nitrite oxi- between neighbouring sites) and multiple sites are hosted in plutonic dation than nitrate reduction, consistent measurements of high nitrate −1 −1 rock. The sole pair of sites that share more than a few strains (>99% (165 mg l NO3-N) and low nitrite (<0.05 mg l NO2-N) levels (Fig. ANI) is Pr1 and Pr7, which share 44 strains, including 7 CPR bacte- 3a). Most of the groundwater communities sampled have an incom- rial strains (Extended Data Fig. 3). It is unlikely that the aquifers of plete capacity for denitrification (Extended Data Fig. 4), with far fewer these two sites are connected as they lie on separate sides of Putah genomes encoding the required genes for the final step of nitrous Creek, a major hydrological feature. Furthermore, we do not attribute oxide reduction compared with the previous steps. Pr3 and Pr4 are this observed genome similarity to index hopping during sequenc- two sites with a greater capacity for nitrous oxide reduction compared ing (Methods). Even at the species level (>95% ANI40), most pairs of with the other groundwater communities, in addition to nitrogen analysed sites share no more than one species in common (Extended fixation, thiosulfate disproportionation, sulfide oxidation and carbon Data Fig. 3). The overall lack of genomic similarity between these fixation (Fig. 3a and Extended Data Fig. 4). Although Pr3 and Pr4 ten groundwater communities—at the phylum, species and strain have little species-level overlap, their similarity in community-level levels—indicates that there is a high degree of specialization based metabolic capacities may reflect their proximity (<1 km) and similar on local hydrogeochemical conditions. groundwater chemistry (levels of NH4-N, NO3-N, NO2-N and SO4-S). We specifically examined key metabolic marker genes in CPR and Roles of CPR/DPANN organisms in biogeochemical cycling. Next, DPANN genomes to assess what metabolic roles that they may have given the abundance of CPR and DPANN organisms, we sought to (Fig. 3b). The presence of the nitrite reductase nirK in 19 CPR and investigate the potential metabolic roles these organisms have in these 4 DPANN genomes as well as the presence of nosD in 11 DPANN eight groundwater communities. As most CPR/DPANN organisms genomes across sites suggest a complementary or accessory role of are predicted to be symbionts, it is probable that their metabolic roles many CPR/DPANN lineages in denitrification (consistent with pre- within a community vary with the metabolic capacities of their host vious identification of nirK genes in Parcubacteria21,42). Furthermore, organisms. To investigate this relationship, we profiled all recovered 13 DPANN genomes in Ag groundwater encode the small subunit

Nature Microbiology | VOL 6 | March 2021 | 354–365 | www.nature.com/naturemicrobiology 357 Articles NATuRE MICRObIOlOGy

a c d e 100 0.035 Non-CPR bacteria CPR bacteria 14 6 Planctomycetes (38; 350) Ca. Harrisonbacteria (42; 30) 80 0.030 12 (58; 14) Ca. Niyogibacteria (45; 26) Ca. Omnitrophica (63; 724) 5 Ca. Roizmanbacteria (35; 236) Ca. Uhrbacteria (45; 124) Ca. Omnitrophica (63; 24) 0.025 10 60 CPR bacteria lgnavibacteriae (38; 923) 4 Non-CPR bacteria Betaproteobacteria (64; 25) 0.020 8 DPANN archaea 40 3 Non-DPANN archaea 0.015 6 over time over 2 Total coverage (%) coverage Total 20 0.010 4 Total coverage (%) Total coverage (%) 2 1 0 0.005 r.m.s.d. of percentage coverage r.m.s.d. 0 0 0 1 Mar 2017 Sep 2017 Nov 2017 Feb 2018 Jun 2018 Ag sampling point Mar 2017 Sep 2017 Nov 2017 Feb 2018 Jun 2018 Mar 2017 Sep 2017 Nov 2017 Feb 2018 Jun 2018 Ag sampling point Ag sampling point b CPR bacteria Non-CPR bacteria f Community-level biogeochemical cycling 0.8 DPANN archaea 80 Non-DPANN archaea 80 Carbon Nitrogen Nitrogen fixation Organic carbon Ammonia oxidation oxidation 60 Nitrite oxidation Carbon fixation 60 Ethanol oxidation Nitrate reduction 0.4 Acetate oxidation Nitrite reduction 40 Nitric oxide 40 Hydrogen

NMDS2 reduction generation Nitrous oxide reduction 20 Methanogenesis 20 Total coverage (%) coverage Total (%) coverage Total Nitrite ammonification Methanotrophy 0 Anammox Hydrogen oxidation 0 0

Jun 2018 Jun 2018 –0.5 0 0.5 Mar 2017 Sep 2017 Nov 2017 Feb 2018 Mar 2017 Sep 2017 Nov 2017 Feb 2018 Ag sampling point Ag sampling point NMDS1 80 80 Sulfur B-0.1 S-0.2 Other Sulfide oxidation 0.4 Feb 2018 S-2.5 S-0.65 60 60 Jun 2018 Sulfur reduction Metal reduction S-0.1 Mar 2017 Sulfur oxidation Chlorate reduction Jun 2018 Sulfite oxidation Arsenate reduction Jun 2018 40 40 Jun 2018 Sep 2017 Sulfate reduction Arsenite oxidation Feb 2018 Nov 2017 Feb 2018 Sulfite reduction Selenate reduction

NMDS2 Feb 2018 0 Mar 2017 Thiosulfate Sep 2017 20 20

Mar 2017Sep 2017 Mar 2017 (%) coverage Total oxidation (%) coverage Total Jun 2018 Sep 2017 Thiosulfate Feb 2018 disproportionation Mar 2017 0 0

–0.5 0 0.5 NMDS1 Mar 2017 Sep 2017 Nov 2017 Feb 2018 Jun 2018 Mar 2017 Sep 2017 Nov 2017 Feb 2018 Jun 2018 Ag sampling point Ag sampling point

Fig. 4 | Ag groundwater microbial community over time. a, Relative abundances of non-CPR bacteria, CPR bacteria, DPANN archaea and non-DPANN archaea genomes (1,103 in total) in Ag over time. b, NMDS analysis of 1,103 Ag genome relative abundances in all size fractions over all time points. The positions of genomes in ordination space are shown in the top graph, while the positions of the samples in ordination space are shown in the bottom graph. In the bottom graph, B-0.1 refers to bulk filtration on a 0.1 µm filter (circles), S-2.5 refers to 2.5+ µm size fractions (triangles), S-0.65 refers to 0.65–2.5 µm size fractions (crosses), S-0.2 refers to 0.2–0.65 µm size fractions (plus signs), and S-0.1 refers to 0.1–0.2 µm size fractions (closed squares). c, Box plot showing the r.m.s.d. of the relative abundance of all of the genomes in the bulk filter (whole community on a 0.1 µm filter) over time. The median r.m.s.d. (orange line) is <0.001, indicating that there is little variation in relative abundance over time for individual genomes in Ag. d,e, The relative abundance over time for non-CPR bacteria (d) and CPR bacteria (e) that have an r.m.s.d. > 0.004. Genomes are identified in the legend by phylum, percentage GC and coverage in the original time point that the representative genome was derived from (the latter two in parentheses). f, The variation in Ag community-level capacity (total relative abundance of all genomes capable of a broad metabolic function) for carbon, nitrogen, sulfur and miscellaneous element cycling over time. For d–f, the blue and orange backgrounds indicate the rainy and dry seasons in Northern California, respectively. of nitrite reductase (nirD) but lack the catalytic large subunit nirB, at 99% ANI). Inspection of abundance patterns in individual suggesting that DPANN organisms have an accessory role in nitrite genomes with an r.m.s.d. > 0.004 (Fig. 4d,e) show a coabundance pat- reduction to ammonia. At Ag, Pr1 and Pr3, we found that 30 DPANN tern between a Planctomycetes organism and several CPR bacteria genomes and 3 CPR genomes encode sulfur dioxygenase sdo, while that, although certainly not conclusive, may result from a parasitic dozens of diverse CPR and DPANN genomes encode sat, cysC and CPR–host relationship. The co-occurrence of two Ignavibacteria and cysN, which are involved in sulfate reduction, suggesting a potential Betaproteobacteria organisms with an Uhrbacteria organism (Fig. role of CPR and DPANN organisms in transformations to sulfite. 4d) may be an indication of a commensal or mutualistic CPR–host relationship. These observed temporal trends merit further investiga- An agriculturally impacted groundwater site rich in CPR/DPANN tion to determine whether they reflect symbiotic relationships. organisms. After establishing the prevalence and metabolic roles of Examination of the changes in metabolic cycling capacities in Ag CPR and DPANN organisms in groundwater communities, we per- groundwater over time indicate that there is a higher community formed temporal and size filtration sampling of Ag groundwater (Fig. capacity (5–10% relative abundance) for organic carbon oxidation, 4a) to investigate how these characteristics change with time and envi- carbon fixation, fermentation, nitrite oxidation, nitric oxide reduction ronmental factors. Non-metric multidimensional scaling (NMDS) and sulfate reduction during the rainy season (Fig. 4f, blue back- ordination (Methods) shows that, as expected, most CPR and DPANN ground) compared with the dry season (Fig. 4f, orange back- genomes cluster together and away from other bacteria and archaea, ground). Furthermore, we see a greater increase in these metabolic distinguished by prevalence in the 0.1–0.2 µm fraction (Fig. 4b). There capacities during the 2016–2017 rainy season compared with the is no observable clustering of genomes by sampling time in ordina- 2017–2018 rainy season (Fig. 4f), which may reflect a major differ- tion space and the median root mean square deviation (r.m.s.d.) of ence in rainfall (more than 25 cm more in 2016–2017 versus 2017– genome relative abundances over time is ~0.002 (Fig. 4c), indicat- 2018)43. Overall, we found that Ag groundwater is an extremely ing a very stable community at the strain level (genomes dereplicated stable incubator for high abundance and diversity of both CPR and

358 Nature Microbiology | VOL 6 | March 2021 | 354–365 | www.nature.com/naturemicrobiology NATuRE MICRObIOlOGy Articles

a b c d d

e b

e f 3.8 nm g

c 10 nm

f

10 nm

i j k l h i k

3.2 nm 3.2 nm 3.2 nm 3.2 nm

Fig. 5 | Imaging of Ag groundwater cells concentrated with tangential flow filtration. a, Image of larger host cell (white arrow) with multiple ultrasmall cells (magnified in b and e) attached. b, Magnification of the indicated area in a (white box) showing a chain of ultrasmall cells attached to the surface of the larger host cell. c, Magnification of the indicated area in b (black box) showing the contact region between two ultrasmall cells and the pili-like appendages decorating their surfaces (white arrowheads). d, Magnification of the indicated area in b (white box) showing the contact region between an ultrasmall cell and host cell. Pili-like appendages are indicated by white arrowheads. e, Magnification of the indicated area in a (white box) showing a single ultrasmall cell decorated by pili-like appendages (white arrowheads) attached to a host cell. Attachment may be mediated by pili-like appendages that extend from the ultrasmall cell into the host cell (dashed white boxes). f,g, The membrane from the ultrasmall cell in e (black box), with the membrane structure showing a clear periodicity measured to be 10 nm (f) as well as a periodicity of 3.8 nm that is evident in Fourier space (g; white arrows indicate repeating structure spacings). h, Image of a host cell with a single ultrasmall cell attached. Several pili-like appendages extend from the ultrasmall cell into the host cell (dashed white boxes and arrowheads). i,j, Magnification of the indicated area in h (white box) showing the host cell envelope, the outer layer of which exhibits a periodicity of 3.2 nm; a Fourier-transformed image is shown (j; white arrows indicate repeating structure spacings). k,l, Magnification of the indicated area in h (black box) showing the ultrasmall cell envelope, which also exhibits a periodicity of 3.2 nm in the outer layer; a Fourier-transformed image is shown (l; white arrows indicate repeating structure spacings). For d and e, the orange arrowheads indicate lines of high density observed at the contact interface. Scale bars, 1 μm (a), 200 nm (b, e and h) and 50 nm (c, d, f, i and k).

DPANN organisms, but the microbial community is not stagnant in Fig. 6), some of which extend into the corresponding host cell, its metabolic capacities, which vary between rainy and dry seasons. potentially mediating episymbiont–host interaction (Fig. 5, white dashed boxes). At the ultrasmall cell–host contact region shown in Pili-mediated episymbiotic interactions between ultrasmall cells Fig. 5e,h, the host cell envelope appears to be thickened, whereas the and hosts. Fundamental to understanding the wider role of CPR and episymbiont cell envelope is thinned. For multiple pairs of ultrasmall DPANN organisms in groundwater communities is characterizing cells and hosts, a line of higher density is observed at the cell inter- their relationships with hosts. With few exceptions, host organisms face (Fig. 5d,e,i, orange arrows), similar to what has previously been have not been conclusively identified and only a handful of studies observed at tight interfaces between archaeal ARMAN (DPANN) have performed high-resolution microscopy to directly image the cells and their hosts44. The host in Fig. 5a has physical associations between CPR/DPANN organisms and hosts in multiple ultrasmall cells directly attached to its cell envelope that natural environments15,44,45. To observe CPR/DPANN–host interac- appear to be in the process of dividing (Fig. 5b,e,f), raising the pos- tions in the Ag groundwater community in a near-native state, we sibility that CPR/DPANN replication is correlated with host attach- used tangential flow filtration (TFF) to gently concentrate cells from ment (discussed in the next section). Overall, cryo-TEM imaging of groundwater and preserved them by cryo-plunging them in liquid TFF-concentrated groundwater shows that some ultrasmall cells in ethane on-site for later characterization by cryo-TEM (Methods). Ag groundwater—which are likely to be CPR or DPANN organisms Many ultrasmall cells (longest dimension <500 nm) were on the basis of size—are episymbionts of prokaryotic hosts, attach- observed attached to the surface of larger host cells (Fig. 5a). ing through pili-like structures. The ultrasmall cells have cell envelopes that are decorated by pili An important question regarding the biology of CPR bacteria (Fig. 5, white arrows; a magnified view is shown in Extended Data relates to the nature of their cell envelope and the degree to which

Nature Microbiology | VOL 6 | March 2021 | 354–365 | www.nature.com/naturemicrobiology 359 Articles NATuRE MICRObIOlOGy it resembles that of their host cells. Genomic analysis indicates have significantly higher cell counts in the 2.5+ µm fraction versus that CPR bacteria cannot de novo synthesize fatty acids4 but do the 0.2–0.65 µm fraction (Extended Data Fig. 8). possess fatty-acid-based membrane lipids46, raising the possibil- Cryo-EM images of dividing, host-attached ultrasmall cells ity that CPR bacteria receive or building blocks from (Fig. 5a,b) suggest that attachment to a host may stimulate CPR/ host organisms. The ultrasmall cell in Fig. 5e has a surface layer DPANN cell division. To investigate this hypothesis, we calculated with a periodicity of 3.8 nm and 10 nm, but lacks an outer mem- instantaneous replication rates (iRep values49) for CPR genomes in brane expected for Gram-negative bacteria (Fig. 5f,g), consistent Ag groundwater (archaeal genomes were excluded as archaeal repli- with previous cryo-TEM images of groundwater CPR bacteria45. cation is not generally bidirectional). For reference, iRep = 1.0 indi- Meanwhile, the host’s cell envelope appears to have two lipid lay- cates that, on average, no cells represented by a genome are actively ers, suggesting a Gram-negative structure (Fig. 5d,e). In Fig. 5h, replicating, whereas iRep = 2.0 indicates that, on average, every cell from a different ultrasmall cell–host pair, we also resolve two lipid represented by a genome is creating one copy of its genome. We layers in the host cell envelope and no outer membrane in the found that at three Ag sampling points—March 2017, February ultrasmall cell. Interestingly, in this case, a periodicity of 3.2 nm is 2018 and June 2017—CPR organisms exhibit significantly higher detected in the outermost layers of both the host cell (Fig. 5i) and replication rates in the 0.65–2.5 µm fraction than the 0.1–0.2 µm the attached ultrasmall cell (Fig. 5k), but it is unclear whether the fraction (Fig. 6c), suggesting that host-attached CPR bacteria con- outer layers of the two cells have the same structure. On the basis sistently exhibit a higher replication rate than non-host-attached of an apparent lack of an outer membrane, the ultrasmall cells CPR bacteria. We found that CPR bacteria as a whole (measured observed in Ag groundwater have cell envelopes that do not resem- in the bulk filtered community) exhibited higher replication rates ble those of Gram-negative bacteria, but seemingly can attach to during the height of the 2016–2017 rainy season (March 2017) and Gram-negative hosts. the beginning of the next rainy season (September 2018) compared with during the height of the 2018 dry season (June 2018; Fig. 6d). Host attachment and replication of CPR/DPANN cells. Imaging Significant differences in bulk filtration replication rates were not of likely CPR/DPANN organisms directly attached to host cells led observed between any point and the height of the 2017–18 rainy us to investigate how widespread physical attachment is across the season (February 2018; Fig. 6d), which may be explained by the diversity of both radiations. We analysed the distribution of CPR/ >25 cm more rainfall during the 2016–2017 rainy season compared DPANN organisms among size fractions across five sites, which with during the 2017–2018 rainy season43. Together, these findings should reflect two factors—cell size and attachment to host cells. support the deduction that CPR cell replication is stimulated by Most outside the CPR and DPANN groups are host attachment and may be more prevalent during the rainy season present in the 0.65–2.5 µm or 2.5+ µm fractions (Fig. 4b). Owing to compared with the dry season. their small cell size (average ~0.2 µm diameter47), a CPR/DPANN cell present in the 2.5+ µm or 0.65–2.5 µm fraction is probably Discussion attached to a larger organism, whereas a CPR/DPANN cell pres- We sampled one agricultural and seven pristine groundwater sites ent in the 0.1–0.2 µm fraction is probably unattached. Substantial in Northern California that are situated in a range of rock types and coverage of CPR/DPANN genomes in the 2.5+ µm and 0.65–2.5 µm sourced from multiple aquifers. We recovered a total of 746 draft fractions indicate that a fair number of CPR/DPANN cells retain quality CPR and DPANN genomes that derive from most of the host attachment throughout the filtration process, and we consider major lineages within both radiations and from two apparently new it probable that pili penetrating from ultrasmall cells into the host phylum-level lineages within the CPR, hereafter named ‘Candidatus (Fig. 5) are strong enough to resist disruption. We therefore con- Genascibacteria’ and ‘Candidatus Montesolbacteria’. To our knowl- sider the distribution of CPR/DPANN organisms among size frac- edge, only two previous studies have recovered and compared CPR tions as indicative of the degree of host attachment. bacterial genomes across multiple groundwater sites21,50, and nei- To assess the distribution of organisms among size fractions, ther reported DPANN genomes. Very little species-level overlap the absolute number of cells represented by each genome was esti- (defined as >95% ANI) exists between genomes recovered from this mated from the genome relative abundance and the mass of DNA study and previous studies of Crystal Geyser22,39 and Rifle2,19 aqui- extracted (Methods). We observed high cell counts (>1028 cells) of fers, a finding that may reflect a combination of species adaptation CPR and DPANN genomes in 2.5+ µm and 0.65–2.5 µm fractions to different geochemical conditions of the groundwater system51–53, (Fig. 6a), representing a diverse range of lineages (Supplementary bottleneck effects and/or founder effects. Our findings suggest that Table 7). In the case of Ag on March 2017 and September 2017, cell characterization of microbiomes of additional groundwater sites— counts of diverse CPR and DPANN genomes (Extended Data Fig. 7) using 0.1 µm filters rather than 0.22 µm filters and binning of MAGs were several orders of magnitude higher in the 0.65–2.5 µm fraction to capture maximum CPR/DPANN diversity—is likely to reveal than in the 0.1–0.2 µm fractions (Fig. 6a). These cell count distribu- further diversity in the CPR and DPANN radiations. tions suggest that a host-attached lifestyle is common across diverse The pristine sites that we sampled serve as sources of local drink- CPR and DPANN lineages and across groundwater sites. ing water. Notably, at the time of sampling, the Pr2 site (Rattlesnake For most CPR and DPANN lineages, estimated cell counts were Spring), which has been a popular source of public drinking water significantly higher in the 0.1–0.2 µm fractions compared with the for over a century, contained more than 30% CPR bacteria and 3% 2.5+ µm fractions, whereas the other bacterial and archaeal lineages DPANN archaea, raising the possibility that CPR/DPANN organ- exhibited the reverse trend (paired t-test; Fig. 6b). Two notable isms in groundwater are the source for human-associated mem- exceptions are the CPR lineage ‘Candidatus Kerfeldbacteria’ and the bers. CPR bacteria have been detected in multiple human body DPANN lineage ‘Candidatus Pacearchaeota’, which were enriched sites and correlated with inflammatory bowel disease28, vaginosis54, in the 2.5+ µm fraction relative to the 0.1–0.2 µm fraction (paired periodontitis55–57 and herpes viral titres30,58, and DPANN archaea t-test, P = 0.027 and 0.021; Fig. 6b), indicating that a high frac- have been detected in lung fluids31. However, few genomes of tion of these populations is host-attached and/or the attachment is human-associated CPR or DPANN organisms exist, giving lim- more resistant to the disruptive effects of filtration compared with ited information about their role in human microbiomes and other CPR/DPANN lineages. Ca. Pacearchaeota genomes encode their relationship with environmental counterparts31. One recent especially minimal metabolic capacities among DPANN lineages study found remarkably low variation and high synteny between (Fig. 3b), suggesting a heavy dependence on host resources48. An human-associated and groundwater Saccharibacteria59, suggesting additional CPR lineage, ‘Candidatus Woesebacteria’, was found to the possibility that drinking water is a source of CPR bacteria in

360 Nature Microbiology | VOL 6 | March 2021 | 354–365 | www.nature.com/naturemicrobiology NATuRE MICRObIOlOGy Articles

a b CPR bacteria DPANN archaea 8 Ag (Mar 2017) Ag (Sep 2017) Ag (Feb 2018) Ag (Jun 2018) CPR or DPANN , n = 6 , n = 294 Non-CPR, non-DPANN –2 40 6 –4 , n = 219 , n = 45 , n = 39 –2 36 –2 –2 , n = 117 –2

32 P = 1.8 × 10 P = 4.6 × 10 [cell count] 4 10 28 P = 1.6 × 10 , n = 33 P = 2.1 × 10 , n = 120 log , n = 87 , n = 21 , n = 105 , n = 42 P = 2.7 × 10 –2 –2 –3 –2 –3 –2 24 P = 4.7 × 10 t 2 Bulk Bulk Bulk Bulk 2.5+ 2.5+ 2.5+ P = 4.4 × 10 P = 3.1 × 10 P = 8.1 × 10 P = 2.6 × 10 P = 1.6 × 10 P = 3.3 × 10 0.1–0.2 0.1–0.2 0.1–0.2 0.1–0.2 0.65–2.5 0.2–0.65 0.65–2.5 0.2–0.65 0.65–2.5 0.2–0.65 0.65–2.5 0.2–0.65 0 Pr1 (Dec 2018) Pr3 (Sep 2018) Pr4 (Sep 2018) Pr7 (Nov 2018) 40 –2 36 32 [cell count]

10 –4 28 log 24 bacteria archaeota Bulk Bulk Bulk Bulk 2.5+ 2.5+ 2.5+ 2.5+ Firmicutes Chloroflexi Deltaproteo- Ca . Aenigm- Acidobacteria 0.65–2.5 0.2–0.65 0.65–2.5 0.2–0.65 0.65–2.5 0.2–0.65 0.65–2.5 Ca . Parcubacteria Ca . Liptonbacteria Ca . Niyogibacteria

Size fraction (µm) Ca . Kaiserbacteria Ca . Daviesbacteria Ca . Kerfeldbacteria Ca . Pacearchaeota c d t = 8.14 Phylum-level lineage 2.2 P = 1.3 × 10–8 0.65–2.5 µm 0.1–0.2 µm t = 2.79; P = 5.8 × 10–3 CPR bacteria All cells are making one t = 2.22; P = 2.8 × 10–2 2.0 Non-CPR copy of their genome 2.2 bacteria

1.8 All cells are making 2.0 one copy of their t = 4.09 t = 4.00 genome –3 –3 P = 2.7 × 10 P = 7.1 × 10 1.8 1.6 iRep

iRep 1.6 1.4 1.4

1.2 1.2 n = 81 n = 226 n = 17 n = 21 n = 24 n = 23 n = 101 n = 26 n = 27 n = 129 No cells are replicating No cells are 1.0 1.0 replicating Mar 2017 Sep 2017 Feb 2018 Jun 2018 Mar 2017 Sep 2017 Nov 2017 Feb 2018 Jun 2018 Ag time point Ag time point

Fig. 6 | Analysis of host attachment and growth rates of CPR/DPANN organisms. a, Estimated cell counts (log transformed) for all size fraction data collected in this study. Each size fraction shown corresponds to a single sampling event. It was logistically infeasible to perform size filtration at some sites, and some filters collected did not contain enough biomass for DNA sequencing. b, Results from a two-sided paired t-test on estimated cell counts of genomes in the largest (2.5+ µm) and smallest (0.1–0.2 µm) size fractions after serial size filtration of Ag groundwater. A positive t statistic indicates enrichment of cells in the 2.5+ µm compared with the 0.1–0.2 µm fraction. Values listed above each bar are the calculated P value and sample size (n, number of genomes tested) for each phylum-level lineage. c, Calculated iRep values for CPR bacteria genomes in the 0.65–2.5 µm fraction versus the 0.1–0.2 µm fraction, across all Ag sampling points. n = 28 (March 2017), n = 8 (September 2017), n = 11 (February 2018) and n = 8 (June 2018) genomes tested. Note that iRep values represent the average replication state of the cell population represented by a genome. An iRep value of 1.0 indicates that, on average, no cells in the population are actively replicating, whereas an iRep value of 2.0 indicates that, on average, every cell is actively creating one copy of its genome. The statistically significant results (P < 0.05) of a two-sided paired t-test on iRep values between the two size fractions are shown above the box plots. Note that the November 2017 time point was excluded because only bulk filtration (no size filtration) was performed. d, Calculated iRep values for Ag bacteria caught in the bulk 0.1 µm filter (whole-community filtration). The statistically significant results (P < 0.05) of independent two-sided t-tests on iRep values of CPR bacteria between all possible pairs of sampling points are shown above the box plots. For the box plot, the centre line is the median; the top and bottom lines are the first and third quartiles, respectively; and the whiskers show 1.5× the interquartile range; individual dots are outliers; n values (number of genomes tested) are indicated on the plot. the human oral cavity25–27. However, all of these human-associated bacteria, it will be necessary to sequence groundwater sites together Saccharibacteria genomes derive from people who use tap water with the microbiomes of specific humans who use the groundwater (although CPR bacteria can persist in drinking water after treat- versus tap water as their primary source of drinking water. ment25–27) rather than groundwater as a drinking source. To inves- The Ag groundwater microbial community, which includes tigate whether groundwater is a source of human-associated CPR organisms from most CPR and DPANN lineages, is extremely stable

Nature Microbiology | VOL 6 | March 2021 | 354–365 | www.nature.com/naturemicrobiology 361 Articles NATuRE MICRObIOlOGy at the strain level (>99% ANI). This stability may be due to consis- To extract DNA, either a quarter or a half of a filter was placed in PowerBead solution tent, heavy input of carbon and nitrogen from agricultural waste from the Qiagen DNeasy PowerSoil kit (no bead-beating was performed), then vortexed for 10 min with massaging to remove cells from the entire filter surface. (cow manure) collected on-site in lagoons, dried piles and used to After vortexing, the filter was removed, solution C1 (Qiagen DNeasy PowerSoil Kit) 33 fertilize the on-site corn field . Although the Ag community com- was added to the PowerBead solution and the solution was placed in a 65 °C water position is stable, increases in community metabolic capacities and bath for 30 min. The rest of the DNA extraction procedure was performed according CPR bacterial replication rates occur during rainy seasons. Several to the Qiagen DNeasy PowerSoil kit manufacturer’s instructions, beginning with the factors may contribute to these changes during the rainy season: the addition of solution C2. Ethanol precipitation was performed to concentrate and purify the extracted DNA before sequencing. Genomic DNA was quantified using onset of more anoxic conditions in the groundwater, greater runoff the Qubit dsDNA High Sensitivity assay and, when quantity permitted, DNA quality from agricultural waste piles, increased volume of and changes in was assessed using agarose gel electrophoresis. Library preparation and sequencing microbial composition of the cow manure after calves are born in were performed at the California Institute for Quantitative Biosciences’ (QB3) the spring, and soil changes associated with the adjacent corn field genomics facility and the Chan Zuckerberg BioHub’s sequencing facility. Libraries that supplies much of the recharge to the sampled Ag well33. Analysis were prepared with target insert sizes of 400–600 bp. Samples were sequenced using 150 bp paired-end reads on either a HiSeq 4000 platform or a NovaSeq 6000 of coabundance patterns over time indicated potential parasitic as platform, with a read depth of ~10 Gbp per sample except for Ag March 2017 well as commensal/mutualistic relationships between several CPR samples, which were sequenced at 150 Gbp. lineages and Planctomycetes, Ignavibacteria and Betaproteobacteria hosts, although more investigation is required to directly connect Metagenomic assembly. BBTools (v.38.78) was used to remove Illumina adapters 60 these observations to symbiotic relationships. These observations as well as PhiX and other Illumina trace contaminants . Reads were trimmed using Sickle61 (v.1.33) using the default quality threshold of 20 (quality type set to sanger, provide a starting point for targeted cultivation of CPR and DPANN which is CASAVA v.1.8 or higher). Each physical filter was considered to be an organisms based on conditions favourable to growth of puta- independent sample, that is, metagenomic reads from a single filter were assembled tive hosts. The recovery of diverse DPANN but few non-DPANN together, rather than coassembing total reads from all filters/size fractions. archaeal genomes from Ag groundwater poses the intriguing ques- Assembly was performed using MEGAHIT (v.1.2.9) with the default parameters62. tion of whether bacteria may serve as hosts for DPANN archaea. Assembled contigs were then scaffolded using the scaffolding function from IDBA-UD63 (v.1.1.3). Scaffold coverage values were calculated as the ratio of total An important aspect of our study was the use of cryo-TEM, a length of mapped reads to the total length of the scaffold, using bowtie2 (v.2.3.5.1)64 technique that has only rarely been applied to study environmental for mapping. Only scaffolds of >1 kb in length were considered for gene prediction communities in a near-native state15,44,45, to observe physical attach- and genome binning. Gene prediction was performed using Prodigal (v.2.6.3) using ment between the ultrasmall cells and hosts in Ag groundwater. the ‘meta’ option65 and genes were annotated using USEARCH66 (v.10.0.240) against 67,68 69 70 Combined with genomic analysis of CPR and DPANN cell counts the KEGG , Uniref100 (ref. ) and UniProt databases. 16S rRNA genes were identified using a custom HMM2 (16SfromHMM.py, available at GitHub (https:// in serial size fractions, our data suggest that physical attachment to github.com/christophertbrown/bioscripts)) and insertions of 10 bp or greater were host organisms is a common lifestyle in both radiations, with the removed. Prediction of tRNA genes was performed using tRNAscan-SE71 (v.1.3.1). lineages Ca. Kerfeldbacteria in the CPR and Ca. Pacearchaeota within DPANN exhibiting particularly strong physical attachment Genome binning, curation, dereplication and coverage calculation. Scaffolds to hosts relative to other CPR and DPANN organisms. On the basis longer than 1 kb only were considered for protein annotation and binning. Scaffolds were binned on the basis of GC content, coverage, presence/copies of of replication-rate analysis and with cryo-TEM imaging of dividing ribosomal proteins and single-copy genes, taxonomic profile, tetranucleotide host-attached ultrasmall cells, higher CPR instantaneous replication frequency and patterns of coverage across samples. On ggKbase (https://ggkbase. rates are associated with physical attachment to hosts, suggesting berkeley.edu/), protein annotations were performed using USEARCH (v.10.0.240) that the availability of host-supplied resources may stimulate repli- against the KEGG, UniRef100 and UniProt databases as well as against an internal cation of CPR organisms. One recent study instead concluded that database comprised of publicly available genomes from NCBI. Scaffold taxonomic profiles were then determined on the basis of a voting scheme, whereby the there is no widespread attachment of CPR bacteria to hosts on the winning taxonomic profile had to have more than 50% of protein ‘votes’ for each basis of failure to detect co-occurring CPR and host genomic sig- taxonomic rank on the basis of protein annotations. A combination of manual natures in SAGs47. However, we believe the incompleteness of the binning on ggKbase (https://ggkbase.berkeley.edu/) and automated binning using reported SAGs and the small absolute number of organisms anal- CONCOCT72 (v.1.1.0), Maxbin273 (v.2.2.7) and Abawaca2 (v.1.07) was used to ysed per site render the results inconclusive. Our study highlights generate candidate bins for each sample. The best bins were determined using DASTool 34 (v.1.1.1) and manually checked using ggKbase to remove incorrectly the need for high-quality MAGs and high-resolution microscopy to assigned scaffolds according to the criteria listed above. Bacterial genomes were assess interactions among community members in a robust fashion. then filtered for completeness (>70%) using a set of 43 single copy genes previously used for the CPR2,19, and archaeal genomes were filtered using 48 single-copy genes Methods for DPANN. Contamination was assessed using checkM74 (<10%; Supplementary Groundwater sampling, chemistry measurements and surface geology Table 2). The program dRep35 (v.2.5.3) was used to dereplicate genomes from determination. All groundwater sites were sampled at shallow depths (<100 m below each site at 99% ANI (strain level), resulting in a representative set of 2,007 the surface). Groundwater was pumped from each well using a submersible pump genomes across all sites. The median estimated genome completeness of each site’s (Geotech Environmental Equipment) into a sterile container, and then pumped using representative set is over 90%, with 18–58% of each site’s raw reads mapping back a peristaltic pump into an apparatus that was custom built for fltering high volumes to the representative set (Supplementary Table 1). Singlefold coverage values for of water (Harrington Industrial Plastics) at a rate of 3.8–7.6 l min−1. Before fltration, genomes were calculated as the ratio of the total length of mapped reads (bowtie2 at least 100 l of water was pumped to purge the well volume and to fush the system. v.2.3.5.1) to the total length of the genome. Polyethersulfone membrane flter cartridges designed for high-volume fltration (Graver Technologies) from the ZTEC G series (0.1 µm and 0.2 µm), ZTEC B series Phylogenetic classification. Genomes with a clear taxonomic classification on (0.65 µm) and PMA series (2.5 µm) were used. When a sufcient volume of water had the basis of the internal ggKbase database (>50% of the genome sequence had a been fltered (400 l for bulk fltration and an additional 800 l for serial size fltration), clear scaffold-level taxonomic winner, based on best matches of protein sequences flters were removed and stored on dry ice. Filters were stored in a −80 °C freezer to those in genomes of a taxonomically comprehensive database) were classified until processed. Te surface geology of each sampling site was determined from the according to their predicted ggKbase . For genomes without a clear California Department of Conservation’s 2010 geological map of California (https:// predicted ggKbase taxonomy, phylogenetic analysis was performed using several maps.conservation.ca.gov/cgs/gmc/), rock fragments recovered during drilling (Pr4) marker sets as follows: concatenated ribosomal proteins (encoded by a syntenic and by on-site geological surveys. Pumped groundwater was shipped on dry ice to block of genes and selected to avoid binning error chimaeras), rpS3 proteins and the UC Davis Analytical Laboratory for water chemistry measurements of electrical 16S rRNA genes (for CPR bacteria). Reference sequences for all of the phylogenetic conductivity (EC), sodium adsorption ratio (SAR), total organic carbon (TOC), trees were taken from previously published studies that recovered many dissolved organic carbon (DOC), NH4-N, NO3-N, SO4-S (soluble S), HCO3, CO3, high-quality CPR and DPANN genomes2,3,19,22. soluble Zn, Cu, Mn, Fe, Cd, Cr, Pb, Ni, K, Ca, Mg, Na, Cl and B. Water chemistry The concatenated ribosomal protein set for bacteria includes 15 proteins measurements are shown in Supplementary Table 5. (L2, L3, L4, L5, L6, L14, L15, L18, L22, L24, S3, S8, S10, S17 and S19), whereas the archaeal set includes 14 proteins (the bacterial set without S10, which is missing DNA extraction and sequencing. The plastic housing was removed from the filter from many archaeal genomes). Ribosomal proteins were identified by searching cartridges under sterile conditions and the filters were retained for DNA extraction. predicted open reading frames (ORFs) against ribosomal protein databases using

362 Nature Microbiology | VOL 6 | March 2021 | 354–365 | www.nature.com/naturemicrobiology NATuRE MICRObIOlOGy Articles

USEARCH66. For each individual ribosomal protein, hits and reference sequences 1. Te presence/absence of relevant genes (for example, either the large or small were aligned to the Pfam HMM model using hmmalign from HMMer75 (v.3.3), RuBisCo subunit, phosphoribulokinase, phosphoglycerate kinase) was deter- alignments were converted from the Stockholm format to FASTA and insertions mined by profling against a custom set of HMMs, utilizing Kofam-suggested added by hmmalign were stripped. All individual ribosomal protein alignments cut-of values for Kofam HMMs and custom cut-of values for TIGRfam, were concatenated together, and concatenated sequences with an ungapped Pfam and custom HMMs. Custom cut-ofs were chosen by adjusting noise length of greater than 1,100 residues were combined with reference cut-ofs and trusted cut-ofs to avoid potential false-positive hits19. sequences to build a maximum-likelihood tree using IQ-Tree (v.1.6.12; iqtree -s 2. Te presence/absence of each reaction in the relevant KEGG module was -st AA -nt 48 -bb 1000 -m LG+G4+FO+I). determined by combinations of key genes (as defned by the KEGG database). For rpS3 gene phylogenetic analysis, rpS3 genes were identified using a For example, the KEGG reaction R00024 (the carboxylation of RuBP by custom HMM with an HMM alignment score cut-off of 40 (ref. 36). Identified RuBisCo) in the KEGG module M00165 (the Calvin–Benson–Bassham cycle) rpS3 genes were aligned with rpS3 reference sequences using mafft76 (using the is considered present only if the genome contains a hit for either the large or default parameters) and columns with >95% gaps were removed with trimal77. The small subunits of RuBisCo (KEGG entries K01601–K01602). alignment was used to build a maximum likelihood tree using IQ-Tree (iqtree -s 3. A given KEGG module was considered to be present if genes identifed for -st AA -nt 48 -bb 1000 -m LG+G4+FO+I). >75% of the reactions in the module were present. Tis 75% cut-of was cho- For 16S rRNA gene phylogenetic analysis of CPR bacterial genomes, 16S rRNA sen to refect the fact that MAGs, which are in most cases neither complete genes were identified using a custom HMM2 (using 16SfromHMM.py, available nor circularized (in our case, we have a 70% cut-of for genome complete- at GitHub (https://github.com/christophertbrown/bioscripts)) and insertions of ness), will have incomplete metabolic pathways. 10 bp or greater were removed (using strip_masked.py from https://github.com/ 4. Finally, a genome was considered to have broad metabolic capacity (carbon christophertbrown/bioscripts). Sequences with lengths of >800 bp were used for fxation) if any relevant KEGG module was present (CBB pathway, 3HP cycle, phylogenetic analysis. SSU-align was used to align 16S sequences from this study 3HP/4HB cycle, Wood Ljungdahl pathway or reverse tricarboxylic acid cycle). with reference sequences from the previous studies mentioned above as well as Te results from METABOLIC for each site are provided in Supplementary CPR bacteria sequences from SILVA database78. The resulting alignment was used Tables 8–15. to build a maximum-likelihood tree using RAxML-HPC BlackBox79 (v.8.2.12) on the CIPRES Science Gateway80 with the general time reversible model of nucleotide substitution (raxmlHPC-HYBRID -T 4 -s infile -N autoMRE -n result -f a -p 12345 Cryo-TEM sample preparation in the field. Cryo-TEM samples were prepared -x 12345 -m GTRCAT). onsite at the Ag dairy farm on 5 February 2018. Approximately 30 l of pumped Ag Genomes forming the new phylum-level lineages ‘Candidatus Genascibacteria’ groundwater was concentrated to a final volume of ~5 ml, using TFF (Millipore and ‘Candidatus Montesolbacteria’ were identified on the basis of the following Pellicon Cassette Standard Acrylic Holder) with a 30 kDa ultrafiltration cassette criteria: (1) they formed a monophyletic group in the 16S rRNA gene phylogeny; (Millipore Pellicon 2 Biomax). Aliquots of 5 μl were taken directly from the (2) 16S rRNA genes shared less than 76% sequence identity to the closest suspensions and deposited onto 300 mesh lacey carbon coated Cu-grids (Ted representatives; (3) they were also supported by the concatenated ribosomal Pella, 01895) that had been treated by glow discharge within 24 h. Grids were blotted with filter paper and plunged into liquid ethane held at liquid nitrogen protein phylogeny; and (4) more than one representative draft genome was 85 available. A list of ‘Candidatus Genascibacteria’ and ‘Candidatus Montesolbacteria’ temperatures using a portable, custom-built cryo-plunging device . Plunged grids genomes and ANI with closest 16S rRNA hits from SILVA78 is provided in were stored in liquid nitrogen before transfer to the microscope and maintained at Supplementary Table 3. 80 K during acquisition of all datasets.

Ordination analysis. Principal component analysis was performed on rpS3 Cryo-TEM imaging. Imaging was performed using a JEOL–3100-FFC electron relative coverage values and on the metabolic capacities of whole communities microscope (JEOL) equipped with a FEG electron source operating at 300 kV. (the summed relative coverage values of genomes encoding a particular metabolic An Omega energy filter (JEOL) attenuated electrons with energy losses that transformation). Principal component analysis was performed using the exceeded 30 eV of the zero-loss peak before detection by a Gatan K2 Summit FactoMineR package81 and visualized using factoextra82. Relative abundance values direct electron detector. Dose-fractionated images were acquired with a pixel size 1 2 were scaled to unit variance before the calculation of the principal components. of 3.41 Å px− using a dose of 7.27 e− Å− per frame. Data were collected using the NMDS analysis was performed on normalized read counts (reads per million total Gatan Microscopy Suite (v.3.4.1) and SerialEM (v.3.7). Up to 30 frames per image 86 reads) for all genomes from Ag groundwater, based on read mapping with BBMap60. were aligned and averaged using IMOD (v.4.9) and image contrast was adjusted NMDS analysis was performed using the metaMDS function in the Vegan package in ImageJ (v.2.0.0). for R83, using the default parameters. In brief, the data were transformed using Wisconsin double standardization of the square root of the matrix, followed by Analysis of cell distribution across serial size filters. To analyse the distribution construction of a Bray–Curtis dissimilarity matrix, then an NMDS with 20 random of Ag cells across size fractions, we needed to estimate total cell counts, whereas starts. Finally, the results were scaled to maximize variation to the first principal sequencing data can generate only relative abundance values (in the absence of component. Results were visualized using the ggplot2 package for R84. an internal standard). We began with the general equation: total cell count of a genome = relative abundance from sequencing × microbial load, an approach that Assessing index hopping between Pr1 and Pr7. Pr1 and Pr7 were the only pair of has been discussed and tested in depth previously87. Our method takes the form: analysed sites that shared more than a few strains (44 pairs of genomes with >99% c = x × l × m where c is the total cell count of a genome; x is the relative coverage of ANI). These two sites are separated by Putah Creek, a major hydrological feature, a genome; l is the total cell counts of all community members per ng of DNA in the and so are unlikely to be fed from the same aquifer. Although DNA from Pr1 and community; and m is the ng of DNA extracted from the size fraction. The term l × m Pr7 was sequenced on the same NovaSeq 6000 lane, we do not attribute this strain estimates microbial load, that is, the total cell count of all members in a community. overlap to index hopping, as dual indexing was used and reads with mismatched In our method, we utilize DNA yield (measured variable m in our equation) indices were not analysed, reducing the already low incidence of index hopping as an estimate of microbial load in a sample. DNA yield is an imperfect estimate (<2% of reads). Furthermore, although the 44 genome pairs share >99% ANI, they of true microbial load for a number of reasons, including potential ploidy88 and are not identical, differing in sequence by up to 10,000 bp per Mb of genome. bias in sequencing representation depending on the DNA extraction method89. However, there are also limitations and problems with other estimates of Genome and community-level metabolic predictions. To analyse the metabolic microbial load, such as flow cytometry-based cell counting90. Given that we capacity of the sampled groundwater communities at both the genome and extracted all samples in this study using the same DNA extraction kit and have community level, the program METABOLIC41 (v.4.0) was used to search predicted fluorometry-based measurements of DNA yield (Qubit dsDNA HS Assay), we ORFs against a curated set of KEGG, TIGRfam, Pfam and custom HMM profiles chose to use DNA yield as the best available measurement of microbial load. corresponding to key marker genes for biogeochemical cycling. For specific sets Fluorometry-based quantification of DNA yield measures DNA mass (that of proteins that are often misannotated due to high sequence similarity despite is, the number of double-stranded DNA base pairs). Meanwhile, the relative divergent function (for example, amoABC/pmoABC), an additional motif-validation abundance of a genome (relative coverage) is proportional to the relative fraction step was performed in which sequences were searched for conserved residue of total cells represented by the genome, rather than the relative fraction of total patterns indicative of either amoABC or pmoABC. On the basis of the presence/ DNA represented by the genome. For example, a CPR genome with a relative absence of this manually curated set of marker genes, the presence/absence of abundance of 1% will constitute less than 1% of the total DNA yield from a metabolic capacities encoded by each genome was determined, and the number and groundwater community, owing to its smaller genome size than other members of relative abundance of genomes in the community that encode a metabolic capacity the community. To account for genome-size-dependent DNA yield, we calculated were calculated. The biogeochemical cycling diagrams shown in Fig. 4 and Extended how many microbial cells would correspond to 1 ng of DNA on the basis of the Data Fig. 4 are based off this manually curated set of key marker genes. genome sizes of each genome recovered from the community (parameter l in our In addition to marker gene analysis, METABOLIC was also used to evaluate equation). The molecular weight of each genome calculated as number of base the completeness of KEGG modules for key biogeochemical cycling processes. In pairs × 650 Da per base pair. The relative coverage of a genome in a given size brief, the capacity of a genome for a broad metabolic function (for example, carbon fraction was calculated as the total length of reads mapping to the genome divided fixation) was determined using the following steps: by the total length of the genome (mapping was performed with bowtie2)64.

Nature Microbiology | VOL 6 | March 2021 | 354–365 | www.nature.com/naturemicrobiology 363 Articles NATuRE MICRObIOlOGy

To find significant differences in cell counts between two given size fractions 18. Gong, J., Qing, Y., Guo, X. & Warren, A. ‘Candidatus Sonnebornia yantaiensis’, (that is, 2.5+ µm versus 0.1–0.2 µm), paired t-tests were performed on cell counts a member of candidate division OD1, as intracellular bacteria of the ciliated from each phylum with more than 5 representative genomes and with cell count Paramecium bursaria (Ciliophora, Oligohymenophorea). Syst. Appl. distributions in each size fraction that did not deviate significantly from normality Microbiol. 37, 35–41 (2014). (assessed by plotting cell count distributions and performing a Shapiro–Wilks test). 19. Anantharaman, K. et al. Tousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat. Commun. iRep analysis. Instantaneous replication rates were calculated for Ag bacterial 7, 13219 (2016). genomes using iRep49 (v.1.1.14) with a tolerance of three mismatches per read. Reads 20. Spang, A., Caceres, E. F. & Ettema, T. J. G. Genomic exploration of the from each size fraction were mapped to the bacterial genomes using bowtie2 (ref. 64). diversity, ecology, and evolution of the archaeal of . Science 357, eaaf3883 (2017). Reporting Summary. Further information on research design is available in the 21. Danczak, R. E. et al. Members of the candidate phyla radiation are Nature Research Reporting Summary linked to this article. functionally diferentiated by carbon- and nitrogen-cycling capabilities. 5, 112 (2017). Data availability 22. Probst, A. J. et al. Diferential depth distribution of microbial function and NCBI accession numbers for metagenome reads and metagenome assembled putative symbionts through sediment-hosted aquifers in the deep terrestrial genomes (BioProject: PRJNA640378) are provided in Supplementary Table 18. subsurface. Nat. Microbiol. 3, 328–336 (2018). Metagenome assembled genomes are also available online (http://ggkbase.berkeley. 23. Herrmann, M. et al. Predominance of Cand. Patescibacteria in groundwater is edu/all_nc_groundwater_genomes; please note that it is necessary to register for an caused by their preferential mobilization from soils and fourishing under account by provision of an email address before download). oligotrophic conditions. Front. Microbiol. 10, 1407 (2019). 24. Vigneron, A. et al. Ultra-small and abundant: candidate phyla radiation bacteria are potential catalysts of carbon transformation in a thermokarst lake Code availability ecosystem. Limnol. Oceanogr. 7, 13219 (2019). Identification of 16S rRNA genes and removal of insertions was performed using 25. Pinto, A. J., Schroeder, J., Lunn, M., Sloan, W. & Raskin, L. Spatial-temporal the custom scripts 16SfromHMM.py and strip_masked.py, which are available survey and occupancy-abundance modeling to predict bacterial community in the ctbBio Python package (https://github.com/christophertbrown/bioscripts/ dynamics in the drinking water microbiome. mBio 5, e01135-14 (2014). blob/master/ctbBio/16SfromHMM.py and https://github.com/christophertbrown/ 26. Bautista-de los, Q. et al. Emerging investigators series: microbial communities bioscripts/blob/master/ctbBio/strip_masked.py). Identification of rpS3 genes in full-scale drinking water distribution systems—a meta-analysis. Environ. was performed using an HMM trained as previously described36 on a published Sci. Water Res. Technol. 2, 631–644 (2016). alignment of rpS3 sequences from across the tree of life3. The custom HMMs 27. Bruno, A. et al. Exploring the under-investigated ‘microbial dark matter’ of used to identify key genes in metabolic cycling are available in the METABOLIC drinking water treatment . Sci. Rep. 7, 44350 (2017). program (https://github.com/AnantharamanLab/METABOLIC). 28. Kuehbacher, T. et al. Intestinal TM7 bacterial phylogenies in active infammatory bowel disease. J. Med. Microbiol. 57, 1569–1576 (2008). Received: 24 May 2020; Accepted: 20 November 2020; 29. Kowarsky, M. et al. Numerous uncharacterized and highly divergent microbes Published online: 25 January 2021 which colonize humans are revealed by circulating cell-free DNA. Proc. Natl Acad. Sci. USA 114, 9623–9628 (2017). 30. Urbaniak, C. et al. Te infuence of spacefight on the astronaut salivary References microbiome and the search for a microbiome biomarker for viral reactivation. 1. Rinke, C. et al. Insights into the phylogeny and coding potential of microbial Microbiome 8, 56 (2020). dark matter. Nature 499, 431–437 (2013). 31. Koskinen, K. et al. First insights into the diverse human archaeome: specifc 2. Brown, C. T. et al. Unusual biology across a group comprising more than detection of archaea in the gastrointestinal tract, lung, and nose and on skin. 15% of domain Bacteria. Nature 523, 208–211 (2015). mBio 8, e00824-17 (2017). 3. Hug, L. A. et al. A new view of the tree of life. Nat. Microbiol. 1, 32. Bar-On, Y. M., Phillips, R. & Milo, R. Te biomass distribution on Earth. 16048 (2016). Proc. Natl Acad. Sci. USA 115, 6506 (2018). 4. Castelle, C. J. & Banfeld, J. F. Major new microbial groups expand diversity 33. Ludington, W. B. et al. Assessing biosynthetic potential of agricultural and alter our understanding of the tree of life. Cell 172, 1181–1197 (2018). groundwater through metagenomic sequencing: A diverse anammox 5. Wrighton, K. C. et al. Fermentation, hydrogen, and sulfur metabolism in community dominates nitrate-rich groundwater. PLoS ONE 12, multiple uncultivated . Science 337, 1661–1665 (2012). e0174930 (2017). 6. Kantor, R. S. et al. Small genomes and sparse metabolisms of 34. Sieber, C. M. K. et al. Recovery of genomes from metagenomes via a sediment-associated bacteria from four candidate phyla. mBio 4, dereplication, aggregation and scoring strategy. Nat. Microbiol. 3, e00708-13 (2013). 836–843 (2018). 7. Wrighton, K. C. et al. Metabolic interdependencies between phylogenetically 35. Olm, M. R., Brown, C. T., Brooks, B. & Banfeld, J. F. dRep: a tool for fast and novel fermenters and respiratory organisms in an unconfned aquifer. ISME J. accurate genomic comparisons that enables improved genome recovery from 8, 1452–1463 (2014). metagenomes through de-replication. ISME J. 11, 2864–2868 (2017). 8. Nelson, W. C. & Stegen, J. C. Te reduced genomes of Parcubacteria (OD1) 36. Diamond, S. et al. Mediterranean grassland soil C–N compound turnover is contain signatures of a symbiotic lifestyle. Front. Microbiol. 6, 713 (2015). dependent on rainfall and depth, and is mediated by genomically divergent 9. Castelle, C. J. et al. Genomic expansion of domain archaea highlights roles for microorganisms. Nat. Microbiol. 1, 1356–1367 (2019). organisms from new phyla in anaerobic carbon cycling. Curr. Biol. 25, 37. Yarza, P. et al. Uniting the classifcation of cultured and uncultured bacteria 690–701 (2015). and archaea using 16S rRNA gene sequences. Nat. Rev. Microbiol. 12, 10. He, X. et al. Cultivation of a human-associated TM7 phylotype reveals a 635–645 (2014). reduced genome and epibiotic parasitic lifestyle. Proc. Natl Acad. Sci. USA 38. Emerson, J. B., Tomas, B. C., Alvarez, W. & Banfeld, J. F. Metagenomic 112, 244–249 (2015). analysis of a high carbon dioxide subsurface microbial community populated 11. Cross, K. L. et al. Targeted isolation and cultivation of uncultivated bacteria by chemolithoautotrophs and bacteria and archaea from candidate phyla. by reverse genomics. Nat. Biotechnol. https://doi.org/10.1038/s41587-019- Environ. Microbiol. 18, 1686–1703 (2016). 0260-6 (2019). 39. Probst, A. J. et al. Genomic resolution of a cold subsurface aquifer 12. Huber, H. et al. A new phylum of Archaea represented by a nanosized community provides metabolic insights for novel microbes adapted to high

hyperthermophilic symbiont. Nature 417, 63–67 (2002). CO2 concentrations. Environ. Microbiol. 19, 459–474 (2017). 13. Wurch, L. et al. Genomics-informed isolation and characterization of a 40. Olm, M. R. et al. Consistent metagenome-derived metrics verify and symbiotic Nanoarchaeota system from a terrestrial geothermal environment. delineate bacterial species boundaries. mSystems 5, e00731-19 (2020). Nat. Commun. 7, 12115 (2016). 41. Zhou, Z., Tran, P., Liu, Y., Kief, K. & Anantharaman, K. METABOLIC: a 14. St John, E. et al. A new symbiotic nanoarchaeote (Candidatus Nanoclepta scalable high-throughput metabolic and biogeochemical functional trait minutus) and its host (Zestosphaera tikiterensis gen. nov., sp. nov.) from a profler based on microbial genomes. Preprint at bioRxiv https://doi. New Zealand hot spring. Syst. Appl. Microbiol. https://doi.org/10.1016/j. org/10.1101/76164 (2019). syapm.2018.08.005 (2018). 42. Speth, D. R., In’t Zandt, M. H., Guerrero-Cruz, S., Dutilh, B. E. & Jetten, M. 15. Baker, B. J. et al. Enigmatic, ultrasmall, uncultivated Archaea. Proc. Natl Acad. S. M. Genome-based microbial ecology of anammox granules in a full-scale Sci. USA 107, 8806–8811 (2010). wastewater treatment system. Nat. Commun. 7, 11172 (2016). 16. Golyshina, O. V. et al. ‘ARMAN’ archaea depend on association with 43. Historical Rainfall Data for Years 1888 to 2020 (Modesto Irrigation District, euryarchaeal host in culture and in situ. Nat. Commun. 8, 60 (2017). accessed March 2020); https://www.mid.org/weather/historical.jsp 17. Hamm, J. N. et al. Unexpected host dependency of Antarctic 44. Comolli, L. R. & Banfeld, J. F. Inter-species interconnections in acid mine Nanohaloarchaeota. Proc. Natl Acad. Sci. USA 116, 14661–14670 (2019). drainage microbial communities. Front. Microbiol. 5, 367 (2014).

364 Nature Microbiology | VOL 6 | March 2021 | 354–365 | www.nature.com/naturemicrobiology NATuRE MICRObIOlOGy Articles

45. Luef, B. et al. Diverse uncultivated ultra-small bacterial cells in groundwater. 78. Quast, C. et al. Te SILVA ribosomal RNA gene database project: improved Nat. Commun. 6, 6372 (2015). data processing and web-based tools. Nucleic Acids Res. 41, D590–D596 (2013).

46. Probst, A. J. et al. Lipid analysis of CO2-rich subsurface aquifers suggests an 79. Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and autotrophy-based deep biosphere with lysolipids enriched in CPR bacteria. post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014). ISME J. https://doi.org/10.1038/s41396-020-0624-4 (2020). 80. Miller, M. A., Pfeifer, W. & Schwartz, T. Creating the CIPRES science 47. Beam, J. P. et al. Ancestral absence of electron transport chains in gateway for inference of large phylogenetic trees. In Proc. 2010 Gateway patescibacteria and DPANN. Front. Microbiol. 11, 1848 (2020). Computing Environments Workshop (GCE) 1–8 (IEEE, 2010). 48. Castelle, C. J. et al. Biosynthetic capacity, metabolic variety and unusual 81. Lê, S., Josse, J. & Husson, F. FactoMineR: an R package for multivariate biology in the CPR and DPANN radiations. Nat. Rev. Microbiol. 16, 629–645 analysis. J. Stat. Sofw. 25, 1–18 (2008). (2018). 82. Kassambara, A. & Mundt, F. Factoextra: Extract and visualize the results of 49. Brown, C. T., Olm, M. R., Tomas, B. C. & Banfeld, J. F. Measurement of multivariate data analyses. R package version 1 (2017); https://www. bacterial replication rates in microbial communities. Nat. Biotechnol. 34, rdocumentation.org/packages/factoextra/versions/1.0.7 1256–1263 (2016). 83. Oksanen, J., Kindt, R., Legendre, P. & O’Hara, B. Te vegan package version 50. Tian, R. et al. Small and mighty: adaptation of superphylum Patescibacteria 2.4.2 (2017). to groundwater environment drives their genome simplicity. Microbiome 8, 84. Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2016). 51 (2020). 85. Comolli, L. R. et al. A portable cryo-plunger for on-site intact cryogenic 51. Amalftano, S. et al. Groundwater geochemistry and microbial community microscopy sample preparation in natural environments. Microsc. Res. Tech. structure in the aquifer transition from volcanic to alluvial areas. Water Res. 75, 829–836 (2012). 65, 384–394 (2014). 86. Mastronarde, D. N. & Held, S. R. Automated tilt series alignment and 52. Ben Maamar, S. et al. Groundwater isolation governs chemistry and microbial tomographic reconstruction in IMOD. J. Struct. Biol. 197, 102–113 (2017). community structure along hydrologic fowpaths. Front. Microbiol. 6, 1457 87. Morton, J. T. et al. Establishing microbial composition measurement (2015). standards with reference frames. Nat. Commun. 10, 2719 (2019). 53. Magnabosco, C. et al. Te biomass and biodiversity of the continental 88. Soppa, J. Polyploidy in archaea and bacteria: about desiccation resistance, subsurface. Nat. Geosci. 11, 707–717 (2018). giant cell size, long-term survival, enforcement by a eukaryotic host and 54. Fredricks, D. N., Fiedler, T. L. & Marrazzo, J. M. Molecular identifcation additional aspects. J. Mol. Microbiol. Biotechnol. 24, 409–419 (2014). of bacteria associated with bacterial vaginosis. N. Engl. J. Med. 353, 89. Knudsen, B. E. et al. Impact of sample type and DNA isolation procedure on 1899–1911 (2005). genomic inference of microbiome composition. mSystems 1, e00095-16 (2016). 55. Paster, B. J. et al. Bacterial diversity in necrotizing ulcerative periodontitis in 90. Kunin, V., Copeland, A., Lapidus, A., Mavromatis, K. & Hugenholtz, P. A HIV-positive subjects. Ann. Periodontol. 7, 8–16 (2002). bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev. 72, 56. Kumar, P. S. et al. New bacterial species associated with chronic periodontitis. 557–578 (2008). J. Dent. Res. 82, 338–344 (2003). 57. Liu, B. et al. Deep sequencing of the oral microbiome reveals signatures of Acknowledgements periodontal disease. PLoS ONE 7, e37919 (2012). We thank E. Genasci, A. Book, K. Kritikos, K. Fults, D. Fults, P. Smith and J. Smith for 58. Dewhirst, F. E. et al. Te human oral microbiome. J. Bacteriol. 192, access to their groundwater wells; C. J. Castelle, L. Valentin, B. Al-Shayeb, K. Lane, 5002–5017 (2010). M. Olm, N. Oberleitner, R. Méheust, A. Jaffe, J. West-Roberts, A. Probst and D. 59. McLean, J. S. et al. Acquisition and adaptation of ultra-small parasitic Geller-McGrath for their assistance with field work; C. J. Castelle and A. Jaffe for reduced genome bacteria to mammalian hosts. Cell Rep. 32, 107939 (2020). assistance with archaeal and CPR phylogenetic analysis, respectively; and E. Montabana 60. Bushnell, B. BBTools Sofware Package (2014); http://sourceforge.net/ for help with collection of cryo-TEM data. C.H. was supported by a Camille and Henry projects/bbmap Dreyfus Foundation Postdoctoral Fellowship in Environmental Chemistry. Funding 61. Joshi, N. A., Fass, J. N. Sickle: a sliding-window, adaptive, quality-based for groundwater sampling and sequencing was provided by the Innovative Genomics trimming tool for FastQ fles v.1.33 (2011). Institute, the Allen Foundation and the Chan Zuckerberg BioHub. 62. Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly Author contributions via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015). C.H. and J.F.B. designed the study. C.H., R.K. and I.F.F. collected samples, extracted 63. Peng, Y., Leung, H. C. M., Yiu, S. M. & Chin, F. Y. L. IDBA-UD: a de novo DNA and performed manual binning. C.H. and R.K. performed on-site TFF filtration. assembler for single-cell and metagenomic sequencing data with highly C.H. performed automated binning, bin selection, bin curation and dereplication, uneven depth. Bioinformatics 28, 1420–1428 (2012). phylogenetic analysis, abundance and community comparison analysis, metabolic 64. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. analysis, cell count distribution analysis and on-site TFF filtration. M.L.W. prepared Nat. Methods 9, 357–359 (2012). cryo-TEM grids and collected cryo-TEM data. R.K. performed ordination analysis. 65. Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation C.H. and J.F.B. drafted the manuscript. All of the authors reviewed the manuscript. initiation site identifcation. BMC Bioinform. 11, 119 (2010). 66. Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Competing interests Bioinformatics 26, 2460–2461 (2010). The authors declare no competing interests. 67. Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30 (2000). 68. Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG Additional information as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, Extended data is available for this paper at https://doi.org/10.1038/s41564-020-00840-5. D457–D462 (2016). Supplementary information is available for this paper at https://doi.org/10.1038/ 69. Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for s41564-020-00840-5. improving sequence similarity searches. Bioinformatics 31, 926–932 (2015). Correspondence and requests for materials should be addressed to J.F.B. 70. Te UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169 (2017). Peer review information: Nature Microbiology thanks Kirsten Kusel and the other, 71. Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection anonymous, reviewer(s) for their contribution to the peer review of this work. Peer of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 reviewer reports are available. (1997). Reprints and permissions information is available at www.nature.com/reprints. 72. Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in Nat. Methods 11, 1144–1146 (2014). published maps and institutional affiliations. 73. Wu, Y.-W., Simmons, B. A. & Singer, S. W. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Open Access This article is licensed under a Creative Commons Bioinformatics 32, 605–607 (2016). Attribution 4.0 International License, which permits use, sharing, adap- 74. Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. tation, distribution and reproduction in any medium or format, as long CheckM: assessing the quality of microbial genomes recovered from isolates, as you give appropriate credit to the original author(s) and the source, provide a link to single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015). the Creative Commons license, and indicate if changes were made. The images or other 75. Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive third party material in this article are included in the article’s Creative Commons license, sequence similarity searching. Nucleic Acids Res. 39, W29–W37 (2011). unless indicated otherwise in a credit line to the material. If material is not included in 76. Katoh, K. & Toh, H. Recent developments in the MAFFT multiple sequence the article’s Creative Commons license and your intended use is not permitted by statu- alignment program. Brief. Bioinform. 9, 286–298 (2008). tory regulation or exceeds the permitted use, you will need to obtain permission directly 77. Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for from the copyright holder. To view a copy of this license, visit http://creativecommons. automated alignment trimming in large-scale phylogenetic analyses. org/licenses/by/4.0/. Bioinformatics 25, 1972–1973 (2009). © The Author(s) 2021

Nature Microbiology | VOL 6 | March 2021 | 354–365 | www.nature.com/naturemicrobiology 365 Articles NATuRE MICRObIOlOGy

PCA − Biplot

10

Nitrospirae

Betaproteobacteria Pr3(09/18) Pr2 Pr3(12/17)

Pr4

5 Pr6

Deltaproteobacteria

Gammaproteobacteria Color Pr7 Pr5 Actinobacteria Chloroflexi Bacteria Alphaproteobacteria CPR Acidobacteria Ignavibacteria DPANN

0 Pr1 Contribution

Dim2 (21.2%) Rokubacteria Levybacteria 5 Uhrbacteria Lloydbacteria Omnitrophica 10 Aenigmarchaeota Wolfebacteria Ag (06/18) 15 Daviesbacteria Niyogibacteria Ag (03/17)

−5 Planctomycetes Ag (02/18)

Ag (11/17) Ag (09/17)

Woesearchaeota

0 5 10 Dim1 (43.1%)

Extended Data Fig. 1 | A biplot representation of phylum-level lineages in ordination space. Arrows show the direction of greatest gradient change according to site. The transparency of the points reflects the contribution of the phylum to the principal components.

Nature Microbiology | www.nature.com/naturemicrobiology NATuRE MICRObIOlOGy Articles

Extended Data Fig. 2 | Distribution of CPR (top) and DPANN (bottom) phylum-level lineages across groundwater sites. Color/legend indicate relative coverage values (percentages).

Nature Microbiology | www.nature.com/naturemicrobiology Articles NATuRE MICRObIOlOGy

Extended Data Fig. 3 | Genome similarity at the strain (>99% ANI) and species (>95% ANI) level between Pr1 and Pr7 genomes. Blue bars indicate non-CPR bacteria, aqua bars indicate CPR bacteria, and green bars indicate archaea. The magnitude of the y-axis indicates the number of genomes shared according between the two sites according to the ANI threshold.

Nature Microbiology | www.nature.com/naturemicrobiology NATuRE MICRObIOlOGy Articles

NO - NO - 3 Organic carbon 3 Organic carbon 23% (199) 6.9% (169) 12% (13) 15% (9) 91% 28% (240) 2- 52% (43) 2- 53% (592) 37% (21) - SO4 9.3% (70) - SO4 (72) NO 14% (160) 3% (44) NO 23% (11) 12% (7) 2 Acetate Ethanol 2 Acetate Ethanol SO 2- Pr4 SO 2- NO 3 - NO 3 2- 4% (45) 2 22% (18) 37% (30) Ag (09/17) S O H2 2.8% (64) 0.12 mg/L NH4-N S O H2 0 18% (225) 2 3 19% (214) 73% (859) 0 35% (26) 2 3 65% (53) 94% (77) 0 42% (34) 0 <0.05 mg/L NH4-N 18% (230) H O <0.05 mg/L NO3-N H O N O 2 N O 2 165 mg/L NO3-N 2 63.7 mg/L SO4-S 2 33% (359) 0 CH 75% (55) 0 CH 4.0% (65) 4 0 12% (6) 13% 4 0 69 mg/L SO4-s 6.6% (7) S 2.5% 79 genomes 0 0 S + 0.1% (2) + (9) NH N 21% (276) (46) NH N 58% (42) 1108 genomes 4 2 4 2 H S 0.2% (25) 25% (14) 1.0% (23) 2 CO2 27% (18) H2S CO2 - - NO3 NO Organic carbon 3 Organic carbon 14% (34) 11% (50) 27% (78) 2- 0 9.3% (7) 10% (9) 76% SO 53% (182) 11% (39) 2- - 4 - SO4 (39) 11% (7) NO 9.0% (60) 0.2% (29) 29% (14) 2 Acetate Ethanol Pr5 NO2 12% (9) Pr1 <0.1% (1) 2- Acetate Ethanol SO 2- NO 3 - 0.07 mg/L NH4-N NO SO <0.05 mg/L NH4-N 0.9% (6) 2 3 2- S O H2 6.3% (39) 0 H 17% (8) 33% (69) 2 3 7.0% (73) 79% (223) <0.05 mg/L NO3-N S2O3 15% (11) 2 0.22 mvg/L NO3-N 0 0 37% (12) 95% (47) 34% (93) 39% (20) 0 H2O 3.7 mg/L SO4-S 226 mg/L SO4-S N O H2O 2 N2O 37% (120) 0 CH 47 genomes 248 genomes 5.3% (12) 4 0 45% (28) 0 CH 0 0 S 5.1% 8.1% (4) 19% 4 0 0 0 S + 37% (127) (17) + (9) NH4 N NH 68% (35) 2 4 N2 0.02% (19) 0.9% (39) H S CO2 4.9% (4) 2 1.6% (3) H2S CO2 - NO - 3 Organic carbon NO 3 Organic carbon 7.2% (43) 48% (116) 56% (153) 2- 87% 19% (4) 18% (4) 81% - SO4 (363) 27% (101) 17% (3) 2- 37% (100) SO (14) 3.8% (2) NO2 17% (37) - 4 Pr3 (11/17) 0.1% (1) Acetate Ethanol Pr6 NO 8.4% (3) 2.7% (2) 2- 2 Acetate Ethanol NO SO 2- 0.07 mg/L NH4-N 3 2- 38% (127) <0.05 mg/L NH4-N SO 7.5% (59) S O H2 NO 3 - <0.05 mg/L NO3-N 29% (107) 2 3 57% (200) 93% (399) 4.7% (2) 2 H 14% (3) 0 <0.05 mg/L NO3-N S2O3 46% (7) 2 64% (199) H O 0 21% (5) 98% (20) 65.1 mg/L SO4-S N O 2 1.6 mg/L SO4-S 58% (9) 0 H O 2 N O 2 62% (203) 0 CH 2 439 genomes 11% (55) S 6.3% 4 0 20 genomes 74% (12) 1.1% (5) 0.7% (4) 0 32% CH4 0 + (25) 14% (3) 0 S NH N 66% (229) + 0 (5) 4 2 NH N 74% (12) H S 42% (82) CO 4 2 32% (88) 2 2 0 0 H2S CO2 - - NO3 NO Organic carbon 3 Organic carbon 16% (43) 41% (116) 52% (153) 2- 95% 20% (11) 21% (12) 94% 25% (101) 37% (24) 2- - SO4 (363) SO (51) 22% (10) 41% (100) Pr7 - 4 NO2 13% (37) 20% (12) Acetate Ethanol NO2 3.2% (5) Pr3 (09/18) 2.6% (1) 2- Acetate Ethanol SO <0.05 mg/L NH4-N 2- 0.07 mg/L NH4-N NO 3 2- SO 12% (59) H 49% (127) <0.05 mg/L NO3-N NO 3 2- S2O3 65% (200) 2 1.6% (3) H 16% (6) <0.05 mg/L NO3-N 38% (107) 98% (399) S2O3 20% (13) 2 0 0.7 mg/L SO4-2 0 59% (27) 99% (58) 69% (199) H O 0 65.1 mg/L SO4-S 2 28% (22) H O N2O 2 63 genomes N2O 62% (203) 0 6.5% CH4 0 62% (33) 0 439 genomes 22% (55) S 11% CH4 0 0.5% (5) 0.1% (4) 7.0% (8) 0 S + (25) 0 (6) NH 74% (229) + 51% (32) 4 N2 NH 4 N2 28% (82) 41% (88) H2S CO2 8.6% (5) 0.8% (4) H2S CO2

Thiosulfate oxidation to sulfate Thiosulfate oxidation to sulfate Ammonia oxidation Sulfur oxidation Thiosulfate disproportionation Ammonia oxidation Sulfur oxidation Thiosulfate disproportionation Nitrite oxidation Anammox Nitrite oxidation Anammox Nitrate reduction sulfate reduction Nitrate reduction sulfate reduction

Hydrogen oxidation Methanotrophy Hydrogen oxidation Methanotrophy Fermentation Ethanol oxidation Acetate oxidation Fermentation Ethanol oxidation Acetate oxidation Hydrogen generation Methanogenesis Organic carbon oxidation Hydrogen generation Methanogenesis Organic carbon oxidation

Extended Data Fig. 4 | Community-level cycling of nitrogen, sulfur, and carbon in the eight groundwater communities sampled in this study. Listed next to each metabolic step are the total relative abundance of all genomes capable of carrying out the step, and the number of genomes containing the capacity for that step. Arrow sizes are drawn proportional to the total relative abundance of genomes capable of carrying out the metabolic step.

Nature Microbiology | www.nature.com/naturemicrobiology Articles NATuRE MICRObIOlOGy

Acetate oxidation Hydrogen oxidation Nitrite Arsenate reduction ammoni- Pr3 (09/18) 1 Pr3 (11/17) Carbon Pr2 Pr4 Selenate Nitrate Thiosulfate Methano- Pr5 reduction reduction oxidation trophy Pr6 Arsenite Hydrogen oxidation generation Sulfur oxidation 0 Pr7 Methano- Nitrous oxide reduction Pr1 genesis Ag (06/18) Fermentation Nitrite reduction Ag (03/17)

Dim2 (5.6%) Chlorate Metal reduction Ag (09/17) reduction −1 Nitric oxide Ag (11/17) Organic carbon Anammox reduction Thiosulfate Ag (02/18) oxidation disproportionation sulfate reduction Nitrite oxidation −2

0 5 10 Dim1 (86.9%)

Metabolism type Mean relative abundance across sites Carbon Nitrogen Sulfur Metal, chlorate, 0.2 0.4 0.6 0.8 arsenate, selenate

Extended Data Fig. 5 | Depiction of total relative coverages of different metabolic capacities, in principal component space. Principal component analysis was performed on the relative abundance of organisms with specific metabolic capacities across all sites.

Nature Microbiology | www.nature.com/naturemicrobiology NATuRE MICRObIOlOGy Articles

200 nm

200 nm

Extended Data Fig. 6 | Zoomed in view of cryo-TEM images of ultra-small cells connected to host cells with pili-like appendages in Ag groundwater (Fig. 5) concentrated by tangential flow filtration. White arrows indicate pili-like appendages extending into the host from the ultra-small cell.

Nature Microbiology | www.nature.com/naturemicrobiology Articles NATuRE MICRObIOlOGy

Extended Data Fig. 7 | Relative coverage for CPR bacteria, non-CPR bacteria, DPANN archaea, and non-DPANN archaea genomes in all size fractions sequenced in this study.

Nature Microbiology | www.nature.com/naturemicrobiology NATuRE MICRObIOlOGy Articles

Paired t test on cell counts in 2.5+ μm fraction vs. 0.2-0.65 μm fraction t

CPR bacteria Non-CPR bacteria DPANN archaea Non-DPANN archaea

Phylum-level lineage

Extended Data Fig. 8 | Results from a paired t-test (two-tailed) on estimated cell counts of genomes in the 2.5+ µm versus the 0.2–0.65 µm size fractions after serial size filtration. A positive t statistic indicates enrichment on the 2.5+ µm compared to the 0.2–0.65 µm size fraction. Values listed by each bar are the calculated p value (top value) and sample size (bottom value) for each phylum-level lineage.

Nature Microbiology | www.nature.com/naturemicrobiology nature research | reporting summary

Corresponding author(s): Jillian Banfield

Last updated by author(s): Nov 17, 2020

Reporting Summary Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in reporting. For further information on Nature Research policies, see our Editorial Policies and the Editorial Policy Checklist.

Statistics For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section. n/a Confirmed The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one- or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section. A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)

For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated

Our web collection on statistics for biologists contains articles on many of the points above.

Software and code Policy information about availability of computer code Data collection Gatan Microscopy Suite (v 3.4.1), SerialEM (v 3.7)

Data analysis BBTools (v 38.78); Sickle (v 1.33); MEGAHIT (v 1.2.9); IDBA-UD (v 1.1.3); bowtie2 (v 2.3.5.1); Prodigal (v 2.6.3); usearch (v 10.0.240); 16SfromHMM.py (available at https://github.com/christophertbrown/bioscripts); tRNAscan-SE (v 1.3.1); CONCOCT (v 1.1.0); Maxbin2 (v 2.2.7); Abawaca (v 1.07); DASTool (v 1.1.1); ggkbase (https://ggkbase.berkeley.edu/); dRep (v 2.5.3); METABOLIC (v 1.3); IMOD (v 4.9); ImageJ (v 2.0.0); iRep (v 1.1.14); For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data Policy information about availability of data

All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: April 2020 - Accession codes, unique identifiers, or web links for publicly available datasets - A list of figures that have associated raw data - A description of any restrictions on data availability

SRA accession numbers for metagenome reads are in SI Table 18. All metagenome-assembled genomes from this study are deposited in NCBI under Bioproject PRJNA640378. The genomes are also available at: http://ggkbase.berkeley.edu/all_nc_groundwater_genomes (please note that it is necessary to register for an account by provision of an email address prior to download).

1 nature research | reporting summary Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Ecological, evolutionary & environmental sciences study design All studies must disclose on these points even when the disclosure is negative. Study description This study performs genome-resolved metagenomics analysis and cryo-electron microscopy on bacterial and archaeal communities in 8 groundwater sites.

Research sample For the metagenomics portion of the study, the research samples are metagenomes sequenced from 8 groundwater sites in northern California. For the cryo-electron microscopy, the sample is groundwater from one of these sites concentrated by tangential flow filtration.

Sampling strategy At each site, 400-1200 L of groundwater (planktonic portion) was pumped onto filters from which DNA was extracted. For cryo- electron microscopy, 20 L of groundwater was pumped and concentrated to <5 mL using tangential flow filtration.

Data collection Extracted DNA was sequenced on either HiSeq 4000 or NovaSeq 6000 platforms, at either the California Institute for Quantitative Biosciences’ (QB3) genomics facility or the Chan Zuckerberg Biohub’s sequencing facility.

Timing and spatial scale Groundwater sampling dates by site: Ag (03/17, 09/17, 11/17, 02/18, 06/18), Pr1 (12/18), Pr2 (05/19), Pr3 (11/17, 09/18), Pr4 (09/18), Pr5 (05/19), Pr6 (05/19), Pr7 (11/18).

Data exclusions No data were excluded from analysis.

Reproducibility No explicit measures were taken to ensure reproducibility of assembled genomes from each site. Time series sampling of Ag groundwater show that similar genomes are recovered from each time point.

Randomization Genomes were taxonomically classified based on a phylogenetic tree of concatenated ribosomal proteins, allowing us to categorize genomes as CPR bacteria, non-CPR bacteria, DPANN archaea, and non-DPANN archaea.

Blinding Blinding was not relevant to our study. Did the study involve field work? Yes No

Field work, collection and transport Field conditions All groundwater was pumped from shallow wells (<100 m deep).

Location Sites Pr1 through Pr7 are located in Lake/Napa County, California, while site Ag is located in Modesto, California.

Access & import/export Private sites were sampled with explicit permission from the property owner.

Disturbance To our knowledge, our groundwater sampling did not cause any disturbance.

Reporting for specific materials, systems and methods We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material, system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response. April 2020

2 Materials & experimental systems Methods nature research | reporting summary n/a Involved in the study n/a Involved in the study Antibodies ChIP-seq Eukaryotic cell lines Flow cytometry Palaeontology and archaeology MRI-based neuroimaging and other organisms Human research participants Clinical data Dual use research of concern April 2020

3