Extended Data Figure 1 | Phylogeny of Phylum Saccharibacteria based on SSU rRNA (see also Supplemental Data File 1) Full length sequences of publically available genomes and those from this study along with the HOMD representative sequences were searched against the Silva Database (nearest neighbors references (n-20; 89% cutoff) and extracted for each sequence following alignment and trimming. The resulting 400 sequences in the SILVA alignment were masked with >10% gaps removed. Trees were inferred by Maximum likelihood (RAXML) with 100 bootstrap resamplings and metadata included to highlight sources. CPR genomes from SR-1, Kazan and Berkelbacteria were used as outgroups.

Extended Data Figure 2 | Distribution of CPR Phylum across body sites from the HMP datasets. (a) curation of available oligotyping data of HMP data to show the relative percentages of the phylum in each body site derived using 16S rRNA V1-V3 datasets (~250 nucleotides) from Eren et al. {Eren, 2014 #5}) . (b) percentage of hits to 16S rRNA gene from each Saccharibacteria group in 1129 HMP body site assemblies (cutoff ; > 299 bp and > 79% pairwise identity). (c) percentage of hits to 16S rRNA gene from each GN02 and SR-1 groups in 1129 HMP body site assemblies (cutoff ; > 299 bp and > 79% pairwise identity). Extended Data Figure 3 | Core gene Alignment Tree and Pan genome analysis of 15 Saccharibacteria genomes and genome bins. (a) Core gene alignment of the 121 concatenated shared genes within 15 Saccharibacteria genomes (with 10% gaps stripped resulting in 1,998 columns and tree inferred with Fasttree) agrees with the G3, G5, G6 groups being the most distant with some variation in the G1 clades showing the environmental derived metagenome bins more closely related compared to the 16S rRNA gene and concatenated ribosomal tree (Figure1). (b) Reconstruction of gene content evolution in Saccharibacteria. COUNT software was used to infer gene gains and losses on the concentrated marker tree in Figure 1 from the matrix of phyletic patterns (containing a presence/absence information for each genome in an orthologous group out of 2905 groups) using Dollo parsimony. (c) the BLAST percentage identity and number of pairwise BLAST results highlights the overall low percentage identities across the compared genomes. (d) number of new and unique genes as genomes are added to the Saccharibacteria pangenome. (e) comparing the percentage identities of the different genomes to the genes found in the cultivated G1 oral type strain TM7x highlighting the overall distribution of amino acid identities across the genome with higher average percent identities with the environmental G1 genomes than other oral derived groups outside of G1.

Extended Data Figure 4 | Comparative analyses of predicted gene functions within the Saccharibacteria phylum and across various based on lifestyles. (a) PCA clustering of data presented in Figure 4 for COG distribution percentages averaged across oral and environmental Saccharibacteria groups for functionally annotated genes within free-living, facultative host-associated and host dependent as well as obligate intracellular (mutualistic and parasitic bacteria) (derived from Merhej et al. (20)). (b) Clustered heatmap of percentages functions within each assembled genome by genome type (column) and functions (rows). Oral groups do cluster together for most G1 and outside of the G1 cluster by ecotype. Unique functions within the groups is visible. (c) PCA of data in b.

Extended Data Figure 5 | Maximum likelihood phylogenetic gene tree of Anaerobic ribonucleoside-triphosphate reductase nrdD genes acquired in human associated groups. The nrdD was identified as being a unique gene among the oral genomes that was not in the environmental genomes. The taxonomic best hits (Figure 4) and the gene trees shown here indicated that diverent phylogenetic relatedness between the acquired genes supporting independent acquisition from different bacteria. This Anaerobic version of the ribonucleoside-triphosphate reductase would enable function under host conditions in the oral cavity. In addition, the G3 (oral and rumen) and G5 groups have nrdG which enables activation of anaerobic ribonucleoside-triphosphate reductase under anaerobic conditions by generation of an organic free radical, using S-adenosylmethionine and reduced flavodoxin as cosubstrates to produce 5'-deoxy-adenosine.

Extended Data Figure 6 | Maximum likelihood phylogenetic gene tree of Lactate dehydrogenase genes acquired in human associated groups. The Ldh gene was identified as being a unique gene among the oral genomes that was also not in the environmental genomes. The taxonomic best hits (Figure 4) and the gene trees for this gene indicate the divergent phylogenetic relatedness between the acquired genes amongst the different groups supporting independent acquisition from different bacteria.

Extended Data Figure 7 | Global taxonomic profiles of Saccharibacteria proteome. All protein sequences from the 19 representative genomes were assigned putative based on best hits to the Kegg non-redundant database using GhostKOALA metagenomic annotation server. (a) Relative proportions of the full protein taxonomy highlight the overall similarities using this method (b) subset of , Unclassified Archaea and hits. (c) Principal coordinate plot of the taxonomic data highlighting the clustering of the G1 groups in general agreement with the phylogeny and G5 whereas the G3 and G6 are distinct.

Extended Data Figure 8| Diversity of genes, repeats, spacers and operon structure of CRISPR regions in Saccharibacteria. (a) CRISPR regions for Saccharibacteria were not reported in previous published genomes and were thought to lack this phage immunity mechanism. A CRISPR system was only found in the oral genomes belonging to G1 (S7, S2 and 12Alb) and the G5_1 but thus far lacking in environmental representatives. No overlaps in the direct repeat sequences and the spacer sequences were found between genomes. Uniquely, a cas6 gene was found within the G1_TM7S7 which was also missing the cas9 gene. The variability between G1 and G5 as well as some variability within the G1 also support independent acquisition. (b) gene tree for the csn2 gene unique to G1 12Alb genome that clusters with host associated bacteria. (c) gene tree for the G1 and G5 cas2 protein gene sequences supporting independent acquisition from unrelated bacterial groups.

Extended Data Figure 9| VIRB2 type IV secretion system shared among genomes. We found part of the syntenic region of the G1 group includes a novel region that has an array of small hypothetical proteins (4-6) all containing N- terminal signal peptides. This unique array of proteins was also found in G3, G5 and G5 genomes as well as other Saccharibacteria and the amoebae symbiont belonging to the lineage TM6 ((29) proposed Phylum name Dependentiae). The closest homology of the proteins in this array to any known sequence indicates it may be related to VIRB2 type IV secretion system.