Supporting Information

Campbell et al. 10.1073/pnas.1303090110 SI Results natural selection, as the ratio of nonsynonymous to synonymous Human Oral SR1 Bacterial Genome and the Oral SR1 Pangenome. A substitution rates (dN/dS) across the RubisCO tree is low (0.023, healthy human oral subgingival plaque sample was used for SD = 0.005). Separate analysis of another gene (thymidine phos- bacterial single-cell sorting by flow cytometry. Following genomic phorylase, deoA) across the SR1 pangenome space (10 alleles) multiple displacement amplification (MDA) using phi29 DNA yielded the same result (dN/dS = 0.05, SD = 0.017). This finding polymerase and characterization of single amplified genomes suggests the low G+C content of the SR1 genome and re- (SAGs) by sequencing of small subunit (SSU) rRNA gene am- programming are not results of neutral mutational drift under plification and sequencing, we identified one SAG corresponding a relaxed selection, which would be signaled by high dN/dS values, to an SR1 bacterium (SR1-OR1). The ’s SSU rRNA se- as in the genomes of clonal bacterial pathogens (1). quence is 99–100% identical to that of several other human oral and skin SR1 clones from GenBank (Fig. 1). The SAG DNA was SI Discussion sequenced using both 454 Titanium and Illumina HiSeq plat- UGA-decoding tRNAGly suppressors isolated from selection forms, generating 124.8 Mbp and 21 Gbp of sequence data, experiments (reviewed in ref. 2) were found to contain full or respectively. partial modification of A37, which is normally unmodified in Computational normalization and hybrid assembly resulted in canonical tRNAGly, to 2-methylthio-N6-isopentenyladenosine 56 contigs, totaling 0.46 Mbp of genomic sequence. Initial gene (ms2i6A). The modification enhances UGA translation (3) and calls and annotation were performed using the Integrated Mi- increases mistranslation (4). Clear homologs of the gene (miaA) crobial Genomes (IMG)/M system. The SR1-OR1 predicted encoding the ms2i6A modification enzyme are present in SR1, protein sequences were used to search for close homologs in the ACD78, and ACD80 genomes. Although glycyl-tRNA synthe- 415 oral Human Microbiome Project (HMP) metagenomic as- tase (GlyRS) activity is typically not sensitive to modified bases semblies. Metagenomic scaffolds containing regions syntenic with (5), it is possible that modification at position 37 or at other the SAG and with high amino acid and nucleotide similarity levels locations may improve UGA decoding for the SR1 and ACD78 > ( 95%) were used for a hybrid, pangenomic superassembly. tRNAs in their native cellular context. Over 70 scaffolds from 13 metagenomes (supragingival plaque and tongue) were used in the superassembly (Table S1). Even SI Experimental Procedures though the metagenomic scaffolds originate from samples col- Sample Collection. Subgingival samples (crevicular fluid and bio- lected from different individuals and therefore do not represent film) were collected from a healthy volunteer using medium a clonal population of SR1 bacteria or even a single phylotype, we paperpoints. Signed informed consent was obtained and the Oak applied the supra-assembly to not only improve our SR1-OR Ridge National Laboratory Institutional Review Board approved single-cell assembly but also to assess host-dependent micro- the human subject protocol. The paperpoints were pooled and heterogeneity within SR1. First, the metagenomic scaffolds that homogenized by vortexing in sterile PBS to resuspend the cells had the highest similarity level (identical protein sequence, followed by filtration through a CellTrics 30-μm disposable filter synonymous DNA substitutions allowed on overlapping regions) (Partec). The bacterial suspension was fixed with an equal vol- to the SAG dataset were used to expand the SAG assembly by ume of absolute ethanol at −20 °C overnight. Fixed cells were bridging the single-cell data into larger contigs. PCR walking from centrifuged at 3,000 × g for 5 min at 4 °C and washed once with SAG DNA and Illumina reads remapping was also applied to 1× PBS before sorting. minimize the heterogeneity introduced by the different meta- genomic scaffolds. This process resulted in an expanded SR1- + Single-Cell Sorting, MDA, and Taxonomic Characterization. A bacte- OR1 assembly of 1.1 Mbp in 49 scaffolds, with an average G C rial cell sample was stained with the nucleic acid dyes SYTO 9 content of 36%. Second, overlapping independent metagenomic μ > (green) and SYTO 62 (red) (Life Technologies), each at 5 M for regions, which in some cases included 10 scaffolds, allowed 15 min. Flow cytometry cell sorting was performed using a Cy- analyses of variability between microbiota samples and human topeia Influx cell sorter (BD), which was cleaned as previously hosts, including distinguishing oral SR1 strains or species present described (6, 7). Several thousand cells were sorted individually, in the human population. based on green and red fluorescence into 96-well plates con- taining 3 μL TE per well. TGA Use in RubisCO Alleles Across the Pangenome. For the SR1 μ ribulose-1,5-bisphosphate carboxylase (RubisCO) gene, we iden- MDA was performed in 20- L reactions essentially as de- tified 65 alleles in the HMP oral metagenomes, and mapped four scribed in ref. 6, using a dedicated DNA-free hood, reagents and fl μ sites with alternative use of TGA vs. GGA codons (Fig. 4 and plasticware (8). Brie y, cells were lysed by addition of 3 L Dataset S1). PCR amplification and sequencing of a fragment of 0.13 M KOH, 3.3 mM EDTA pH 8.0, and 27.7 mM DTT, heated the RubisCO gene from six individuals (healthy and with perio- to 95 °C for 30 s, and placed on ice for 10 min. Neutralization μ dontitis) confirmed these polymorphisms in the oral microbiota. buffer was added (3 L of 0.13 M HCl, 0.42 M Tris pH 7.0, 0.18 M Phylogenetic analysis of 30 full-length alleles from HMP meta- Tris 8.0) followed by 11 μL of amplication mix that contained genomes and the SR1-OR1 copy grouped the SR1 RubisCO genes 90.9 μMof3′-end phosphorothioate-protected random hexamers in two main clades that are congruent with the groupings in the (Integrated DNA Technologies), 1.09 mM dNTPs (Roche), 1.8× 16S rRNA phylogeny (Fig. S6). For several metagenomes, alleles phi29 DNA polymerase buffer (New England BioLabs), 4 mM representing both clades were present, indicating that an in- DTT (Roche), and 100 U phi29 DNA polymerase (9). Amplifi- dividual can harbor several SR1 phylotypes. The average nucleo- cation was for 10 h at 30 °C followed by inactivation at 80 °C for tide similarity level ranges from 96% within clades to 84% 20 min. To increase the amount of available DNA, secondary between the two clades. Only one of the four reprogrammed TGA MDAs were performed on 0.5 μL of primary MDA product using locations appears correlated with the phylogenetic grouping of the the same protocols used for primary MDAs, with the exception of RubisCO genes (Fig. 4 and Fig. S6). The gene is under strong amplification duration (6 h). MDA products were purified using

Campbell et al. www.pnas.org/cgi/content/short/1303090110 1of20 standard phenol:chloroform:isoamyl alcohol extraction and alco- 21) and trimmed reads that contain unique k-mers. These fil- hol precipitation. tering steps reduced the dataset to 0.6 million reads. The amplified products (SAGs) were screened for the presence of Assembly was performed in several steps: (i) Filtered Illumina bacterial SSU rRNA genes using PCR amplification with the pri- reads were assembled using Velvet v1.1.04 (21). The Velve- mers 27Fm (5′-AGAGTTTGATYMTGGCTCAG-3′) and 1492R tOptimizer script (v2.1.7) was used with default optimization (5′-TACCTTGTTACGACTT-3′)followedbydirectSangerse- functions (n50 for k-mer choice, total number of base pairs in quencing of the products. The sequences were classified using the large contigs for cov_cutoff optimization). (ii) The Velvet con- Ribosomal Database Project (10) and compared with previously tigs were used to simulate reads from long-insert libraries, which know oral bacteria using CORE (11). Over 200 SAGs representing were used together with the filtered reads as input for Allpaths- of a large diversity of oral bacteria were obtained from this healthy LG (22) assembly. (iii) Next, Allpaths contigs larger than 1 kb oral sample including one representing candidate SR1, were shredded into 1-kb pieces with 200-bp overlaps. (iv) Finally, previously described in oral microbial rRNA diversity analyses as the Allpaths shreds and raw 454 pyrosequence reads were as- “candidate division SR1 bacterium taxon 345” (12). sembled using the 454 Newbler assembler v2.4 (Roche/454 Life Sciences). This process resulted in a total assembly size of Analysis of SR1 Body Site Distribution and Phylogenetic Reconstructions. 460,033 bp (56 contigs, N50:14,267 bp). Further inspection of the SSU rRNA sequences (454 pyrosequences of the V3–5hypervariable draft 454-Illumina hybrid assembly and contig refinement was region) corresponding to SR1 bacteria that colonize different hu- performed using Geneious v.5.6 (19). man body niches (13) were obtained from the HMP Data Analysis and Coordination Center (www.hmpdacc.org). The sequences Metagenome-Assisted SR1 Pangenome Assembly. Predicted protein were aligned, preclustered (14), and clustered into operational sequences from various regions of the SR1 contigs were used to taxonomic units (OTUs) at 97% sequence identity using mothur search for close homologs in the 415 HMP metagenomic datasets (15). The resulting OTU table was structured with body sites as deposited in IMG_HMP (23) using blastp. Top hits that were part samples. Raw OTU counts were converted to percentage within of relatively large contigs (>1kb),withsignificant e values (<10e- sample, square-root transformed, used to calculate a Bray–Curtis 20) and with DNA composition similar to that of the SR1 genomic resemblance matrix and visualized with nonmetric multidimen- assembly (35–38%) were retrieved for further analysis. Re- sional scaling and hierarchical clustering in PRIMER-E (16, 17). cruitment of the HMP metagenomic contigs on the SR1 SAG as- Representative sequences from each OTU were also added to sembly was performed using Geneious (high-sensitivity assembly an alignment containing SR1 SSU rRNA sequences available in setting), with GenBank formatted sequences (i.e., with ORFs and GenBank and representing a wide range of environments (18). complete annotations) and followed several steps: (i) The HMP Phylogenetic reconstructions were calculated by both neighbor contigs were assembled with the SR1 reference contig that served as joining (using Jukes Cantor-corrected distances) in Geneious blast query, any contigs that did not assemble with the reference v5.6 (19) and by maximum likelihood (using a general time re- being eliminated. (ii) The assembled contigs were analyzed in terms versible model with parameters estimated from the data) in of gene content and synteny with the SR1 reference. The contigs PhyML (20). that were highly dissimilar and that only had the query gene as a common, well assembly region were eliminated. (iii)These- SR1 SAG Sequencing, Assembly, and Annotation. The SR1 SAG quence similarity (protein and DNA level) between the SR1 contigs DNA was sequenced using both 454 and Illumina HiSeq. For 454 and the HMP metagenomic contigs was inspected along the as- sequencing, 3 μg DNA was sheared and used to construct a 3-kb sembled region. Contigs with the highest similarity to the SR1 SAG insert library according to the manufacturer’s instructions. One- over the overlap (generally >95% at DNA level) were retained as half of a 454 Titanium sequencing plate was used, generating part of the pangenomic assembly and analysis and are listed in >985,000 reads that totaled 123.8 Mbp of genomic sequence. Tables S1 and S3. Further recruitment of Illumina reads from the The average library insert length, calculated based on >262,000 SAG SR1 on the metagenomic scaffolding and PCR walking using paired reads was 2.5 kb. primers designed based on the SR1 SAG contigs joined by meta- For Illumina HiSeq sequencing, 1 μg DNA was sheared at 400- genomic data were used to confirm their presence in the original nt average fragment size and used to construct a paired-end li- MDA product and to further reduce the number of SNPs. The final brary (Hudson Alpha Institute for Biotechnology, Huntsville, SR1 assembly contained 49 scaffolds totaling 1.1 Mbp. AL). Following sequencing, 212 million paired-end reads (100-nt Forannotation we used the IMG platform developed by the Joint length) were obtained, representing >21 Gbp of genomic se- Genome Institute (Walnut Creek, CA) (24). To accommodate quence. All sequences were subjected to screening for non-SR1 TGA reassignment, the ORF identification software Prodigal (25) contamination using the Joint Genome Institute single-cell ge- was compiled with a custom translation table to implement TGA = nomic analysis pipeline. Reads corresponding to human DNA Gly. Gene annotations were inferred from searches against the represented 2.9% of the sequences, but bacterial contamination National Center for Biotechnology Information (NCBI) non- was essentially absent (four reads matching to E. coli) and all redundant database, UniProt, TIGRFam, Pfam, KEGG, COG, those were removed from the dataset. The resulting raw Illumina and InterPro databases. tRNAScan-SE (26) was used to find sequence data were passed through a filtering program de- tRNA genes, whereas ribosomal RNA genes were found by veloped at the Joint Genome Institute, which filters out known searches against models of the ribosomal RNA genes built from Illumina sequencing and library preparation artifacts. Specifi- SILVA (27). Other noncoding RNAs, such as the RNA compo- cally, all reads containing sequencing adapters, low complexity nents of the protein secretion complex and the RNase P, were reads, and reads containing short tandem repeats were removed. identified by searching the genome for the corresponding Rfam Duplicated read pairs derived from PCR amplification during profiles using INFERNAL (http://infernal.janelia.org). library preparation were identified and consolidated into a single The SR1-OR1 final assembly sequence data were submitted to consensus read pair. The artifact-filtered sequence data were NCBI GenBank under BioProject accession number PRJNA189303. screened and trimmed according to the k-mers present in the The sequence data are readily available with full annotations under dataset. High-depth k-mers, presumably derived from MDA the public IMG portal at http://img.jgi.doe.gov/cgi-bin/w/main.cgi amplification bias, cause problems in the assembly, especially if (taxon ID: 2517572135). the k-mer depth varies in several orders-of-magnitude for dif- ferent regions of the genome. We, therefore, removed reads Genomic Annotation and Evolutionary Analyses. Protein encoding representing high-abundance k-mers (>64 × k-mer coverage, k = genes were identified using a Prodigal version that recoded TGA =

Campbell et al. www.pnas.org/cgi/content/short/1303090110 2of20 Gly (25). RNA gene prediction, automated and manual functional and the completeness was estimated as the ratio of found CSCG to annotation were performed within the IMG (28) platform. Phy- total CSCGs in the set after normalization to 90%. The estimated logenies were calculated with PhyML (20) and RAxML (29), co- complete genome size was calculated by dividing the estimated don use and nucleotide substitution rates were determined using genome completeness by the total assembly size. codonw (codonw.sourceforge.net), CAT (30), and PAML (31). In Vivo UGA Translation Assay. Escherichia coli TOP10 (Invitrogen) Protein-Based Phylogenies and Codon Composition Analyses. Protein cells were cotransformed with pKTS (empty or with SR1 glyS)and Gly sequences were alignedin Geneious using theMUSCLE algorithm, pTECH vectors (empty or containing a tRNA UCA variant and followed by visual inspection and masking of regions containing lacZ with a cognate TGA codon at M3) constructed as described high variability. For RubisCo and the RNA polymerase subunits we previously (37). β-Galactosidase activity was measured as in ref. also performed structure-informed alignments using 3D-Coffee 38. Inoculated (1:10 dilution) from fresh overnight cultures, cells (32). Phylogenetic trees and bootstrap support were calculated were grown in LB (37 °C) to midlog phase (OD600 ∼0.6) in the using both PhyML (20) and RAxML (29), with the best evolu- presence of 0.1 mM isopropyl β-D-1-thiogalactopyranoside; 20 μL tionary model selected based on ProtTest (33) for each gene. To of culture was used for each replicate and six independent cultures amplify a 0.8-kb fragment of the SR1 RubisCo gene from oral were measured in duplicate reactions. microbiota samples, PCR primers were designed manually (RbcF: 5′-CAGGRAATGTCTTTGGAATG-3′; RbcR: 5′-GCTCTCC- In Vitro Aminoacylation Assay. The SR1 glyS gene was codon opti- ARCTATCATC-3′). mized for expression in E. coli, synthesized (Invitrogen), and Codon use was calculated using CodonW (http://sourceforge. cloned into pET-15b (NdeI/BamHI). The SR1 GlyRS was over- net/projects/codonw) and CAT v1.0 (30), where we modified the produced by auto-induction in BL21(DE3) E. coli cells grown to = tables to accommodate the TGA = Gly. To calculate the syn- OD600 0.8 at 37 °C, then incubated at 16 °C for 16 h. The His6- onymous and nonsynonymous substitution rates, codon-based tagged enzyme was purified over TALON affinity resin according alignments of RubisCO and DeoA were analyzed in PAML v4.6 to manufacturer’s instructions (Clontech). The SR1, ACD78, and Gly (31) using codeml, with parameters: model = 2, NSsites = 0, ACD80 tRNA UCA genes were cloned into separate pUC18 icode = 4, Mgene = 0. Pairwise dN/dS between the SR1 SAG vectors (BamHI/XbaI) for in vitro transcription as described allele and those from selected HMP metagenomic contigs were previously (39). Aminoacylation assays were performed as de- calculated and are displayed in Fig. 4. scribed previously (39), with the following modifications. The re- action was performed at 37 °C in 60 mM Tris•HCl pH 7.5, 10 mM Genome Completeness Estimation. SR1-OR1 genome-size and MgCl2, 30 mM KCl, 5 mM DTT, 10 mM ATP (pH 7.0), and -completeness was estimated using a conserved single copy gene 10 mM Gly. All reactions contained 10 μM SR1 GlyRS and 0.2 μM (CSCG) set that has been determined from all 1,516 finished α-[32P]-labeled tRNA, which was prepared as described previously bacterial genome sequences in the IMG database (34). The set (39). The reaction was stopped with 1.5 volumes of 0.66 μg/μL consists of 139 bacterial specific CSCG (Table S2) that were found nuclease P1 (Sigma) in 100 mM sodium citrate, pH 5.49. Because to occur only once in at least 90% of all genomes by analysis of an the tRNA is labeled only at A76, the radiolabeled reaction prod- abundance matrix based on hits to the protein family (Pfam) da- ucts after digestion are AMP (representing unaminoacylated tabase (35). Hidden Markov models of the identified Pfams were tRNA) and aminoacyl-AMP (representing aminoacylated tRNA). used to search both the final hybrid assembly and the meta- The reaction products are separated on PEI-cellulose plates in genome-enhanced assembly by means of the HMMER3 software 0.1 M ammonium acetate and 5% (wt/vol) acetic acid then visu- (36). Resulting best hits above precalculated cut-offs were counted alized, and quantified using phosphorimaging.

1. Hershberg R, Petrov DA (2010) Evidence that mutation is universally biased towards 16. Clarke K, Warwick R (2001) Change in marine communities: An approach to statistical AT in bacteria. PLoS Genet 6(9):e1001115. analysis and interpretation (PRIMER-E, Plymouth, United ), 2nd Ed. 2. Murgola EJ (1994) Translational suppression: When two wrongs DO make a right. tRNA: 17. Clarke KR, Gorley RN (2006) PRIMER v6: User Manual/Tutorial (PRIMER-E, Plymouth, Structure, Biosynthesis and Function, eds Söll D, RajBhandary UL (ASM, Washington, United Kingdom). DC), pp 491–510. 18. Davis JP, Youssef NH, Elshahed MS (2009) Assessment of the diversity, abundance, and 3. Petrullo LA, Gallagher PJ, Elseviers D (1983) The role of 2-methylthio-N6-isopentenyl- ecological distribution of members of candidate division SR1 reveals a high level of adenosine in readthrough and suppression of nonsense codons in Escherichia coli. Mol phylogenetic diversity but limited morphotypic diversity. Appl Environ Microbiol Gen Genet 190(2):289–294. 75(12):4139–4148. 4. Díaz I, Ehrenberg M (1991) ms2i6A deficiency enhances proofreading in translation. 19. Drummond AJ, et al. (2011) Geneious Pro (v5.6.3). Available at http://www.geneious.com. J Mol Biol 222(4):1161–1171. 20. Guindon S, Delsuc F, Dufayard JF, Gascuel O (2009) Estimating maximum likelihood 5. Mazauric MH, et al. (1996) An example of non-conservation of oligomeric structure in phylogenies with PhyML. Methods Mol Biol 537:113–137. prokaryotic aminoacyl-tRNA synthetases. Biochemical and structural properties of 21. Zerbino DR, Birney E (2008) Velvet: Algorithms for de novo short read assembly using glycyl-tRNA synthetase from . Eur J Biochem 241(3):814–826. de Bruijn graphs. Genome Res 18(5):821–829. 6. Rodrigue S, et al. (2009) Whole genome amplification and de novo assembly of single 22. Gnerre S, et al. (2011) High-quality draft assemblies of mammalian genomes from bacterial cells. PLoS ONE 4(9):e6864. massively parallel sequence data. Proc Natl Acad Sci USA 108(4):1513–1518. 7. Stepanauskas R, Sieracki ME (2007) Matching phylogeny and metabolism in the 23. Markowitz VM, et al. (2012) IMG/M-HMP: A metagenome comparative analysis uncultured marine bacteria, one cell at a time. Proc Natl Acad Sci USA 104(21):9052–9057. system for the Human Microbiome Project. PLoS ONE 7(7):e40151. 8. Woyke T, et al. (2011) Decontamination of MDA reagents for single cell whole 24. Markowitz VM, et al. (2012) IMG: The Integrated Microbial Genomes database and genome amplification. PLoS ONE 6(10):e26161. comparative analysis system. Nucleic Acids Res 40(Database issue, D1):D115–D122. 9. Blainey PC, Quake SR (2011) Digital MDA for enumeration of total nucleic acid 25. Hyatt D, et al. (2010) Prodigal: Prokaryotic gene recognition and translation initiation contamination. Nucleic Acids Res 39(4):e19. site identification. BMC Bioinformatics 11(119):1–11. 10. Cole JR, et al. (2009) The Ribosomal Database Project: Improved alignments and new 26. Lowe TM, Eddy SR (1997) tRNAscan-SE: A program for improved detection of transfer tools for rRNA analysis. Nucleic Acids Res 37(Database issue):D141–D145. RNA genes in genomic sequence. Nucleic Acids Res 25(5):955–964. 11. Griffen AL, et al. (2011) CORE: A phylogenetically-curated 16S rDNA database of the 27. Pruesse E, et al. (2007) SILVA: A comprehensive online resource for quality checked core oral microbiome. PLoS ONE 6(4):e19051. and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res 12. Dewhirst FE, et al. (2010) The human oral microbiome. J Bacteriol 192(19):5002–5017. 35(21):7188–7196. 13. Zhou Y, et al. (2013) Biogeography of the ecosystems of the healthy human body. 28. Markowitz VM, et al. (2008) The integrated microbial genomes (IMG) system in 2007: Data Genome Biol 14(1):R1. content and analysis tool extensions. Nucleic Acids Res 36(Database issue):D528–D533. 14. Huse SM, Welch DM, Morrison HG, Sogin ML (2010) Ironing out the wrinkles in the 29. Stamatakis A (2006) RAxML-VI-HPC: Maximum likelihood-based phylogenetic rare biosphere through improved OTU clustering. Environ Microbiol 12(7):1889– analyses with thousands of taxa and mixed models. Bioinformatics 22(21):2688–2690. 1898. 30. Zhang Z, et al. (2012) Codon deviation coefficient: A novel measure for estimating 15. Schloss PD, et al. (2009) Introducing mothur: Open-source, platform-independent, codon usage bias and its statistical significance. BMC Bioinformatics 13:43. community-supported software for describing and comparing microbial communities. 31. Yang Z (1997) PAML: A program package for phylogenetic analysis by maximum Appl Environ Microbiol 75(23):7537–7541. likelihood. Comput Appl Biosci 13(5):555–556.

Campbell et al. www.pnas.org/cgi/content/short/1303090110 3of20 32. Armougom F, et al. (2006) Expresso: Automatic incorporation of structural 36. Eddy SR (2011) Accelerated profile HMM searches. PLOS Comput Biol 7(10):e1002195. information in multiple sequence alignments using 3D-Coffee. Nucleic Acids Res 37. O’Donoghue P, et al. (2012) Near-cognate suppression of amber, opal and quadruplet 34(Web Server issue):W604–W608. codons competes with aminoacyl-tRNAPyl for genetic code expansion. FEBS Lett 33. Darriba D, Taboada GL, Doallo R, Posada D (2011) ProtTest 3: Fast selection of best-fit 586(21):3931–3937. models of protein evolution. Bioinformatics 27(8):1164–1165. 38. Zhang X, Bremer H (1995) Control of the Escherichia coli rrnB P1 promoter strength by 34. Markowitz VM, et al. (2012) IMG/M: The integrated metagenome data management ppGpp. J Biol Chem 270(19):11181–11189. and comparative analysis system. Nucleic Acids Res 40(Database issue):D123–D129. 39. O’Donoghue P, Sheppard K, Nureki O, Söll D (2011) Rational design of an evolutionary 35. Punta M, et al. (2012) The Pfam protein families database. Nucleic Acids Res precursor of glutaminyl-tRNA synthetase. Proc Natl Acad Sci USA 108(51):20485– 40(Database issue, D1):D290–D301. 20490.

99 EU382051 bovine rumen AF125207 human oral cavity Human, HMP data 99 SR1 - OR1 Human, HMP data S. III DQ639513 human esophagus 88 52 JQ447732 human mouth JF224345 human forearm skin EU474681 Indian rhinoceros 99 FJ480084 bovine rumen * FJ480082 bovine rumen EF687415 sulfur oxidizing mat EU101267 Frasassi Cave FJ482215 Lake Pavin water column S. VII * 58 FJ482217 Lake Pavin water column FJ482221 Lake Pavin water column AB015542 deep sea sediment FJ482229 Lake Pavin water column AY193200 Sulfur River S. IV * 52 FJ482227 Lake Pavin water column EU104066 activated sludge AY193169 Sulfur River 93 Group BD2-14 AY193170 Sulfur River S. VIII 50 AY193201 Sulfur River 100 FJ480066 Sperm Pool mat 100 FJ480068 Sperm Pool mat S. I FJ479993 Sperm Pool mat FJ479837 Zodletone Spring ZSR1-4 S. IX 87 AF352532 geothermal spring YNP AY862060 Washburn Spring DQ243737 geothermal spring YNP EU924256 hot spring sediment Group BH1 AY193178 thermal spring mat 97 EF205568 geothermal spring mat EF205544 geothermal spring mat AY993797 mouse cecum AF385500 human oral plaque Candidate 100 AF525836 soil EF020237 rhizosphere soil Division TM7 EU134918 prarie preserve soil 0.03

Fig. S1. Phylogeny of candidate phylum SR1 based on SSU rRNA (neighbor joining, Jukes Cantor distances), using candidate phylum TM7 as outgroup. Subgroup definitions are based on ref. 18. All HMP SR1 pyrosequences clustered as two clades within subgroup III, which also includes the SR1-OR1 single cell. Bootstrap support values for major clades are shown at nodes (* is <50).

Campbell et al. www.pnas.org/cgi/content/short/1303090110 4of20 AB35 SR1-OR1 recoded 30 SR1-OR1 standard code 50 E.coli K12 25 40

R2 = 0.58 20 30 % ORFs 15

20 10 Number of UGA / ORF 10 5 genes with UGA genes without UGA

0 0 100 500 1000 1500 20002500 3000 0 500 1000 1500 2000 2500 ORF size ORF size (codons) C 65% 14 SR1-OR1 12 M. capricolum

10

8

% proteins 6

4

2

0 0 20 40 60 80 100 % recoded

Fig. S2. ORF size distribution (in number of codons) in SR1-OR1 (A) according to the standard code (TGA, stop, red curve) or with TGA as a sense codon (blue). ORF size distribution in E. coli K12 is also shown (green). (B) Plot showing the frequency of TGA codons per ORF in SR1-OR1 vs. ORF size. Genes containing at least one TGA (red) and ORFs lacking in-frame TGAs (blue) are indicated. (C) To measure the extent of TGA reassignment in SR1-OR1 genes, the distribution (percent of proteins) of the percentage of Gly codons encoded by UGA (percent recoded) from individual protein coding genes is plotted. The distribution of the reassigned UGA to Trp codons in Mycoplasma capricolum is shown for comparison.

Campbell et al. www.pnas.org/cgi/content/short/1303090110 5of20 Gly Gly Fig. S3. Sequence and secondary structure comparison of tRNA variants. Secondary structures (Upper) are shown for the canonical tRNA UCC from SR1 Gly Trp (Left), SR1 tRNA UCA (Center), and SR1 tRNA (Right). GlyRS identity elements are highlighted in green, and TrpRS identity elements are highlighted in Gly Gly magenta. The two bases (G15, U47) that differ between SR1 and ACD80 tRNA UCA are circled in red. The D-arm and anticodon stem of tRNA UCA is divergent from canonical tRNAGly species (orange highlight, Upper and Lower). A DNA sequence alignment (Lower, colored according to sequence identity) of tRNAGly genes (i.e., tDNAGly) from SR1, ACD78, and ACD80 also includes canonical tDNAGly genes from related bacterium Rhodothermus marinus (Rm).

Campbell et al. www.pnas.org/cgi/content/short/1303090110 6of20 Number % of Total DNA, total number of bases 1111604 100.00% DNA protein coding number of bases 991242 89.17% DNA G+C number of bases 396380 35.71% 1 DNA scaffolds 49 100.00% Genes total number 1009 100.00% Protein coding genes 994 98.5% RNA genes 15 1.5% 5S rRNA 1 16S rRNA 2 23S rRNA 1 tRNA genes 10 Other RNA genes 1 Protein coding genes with function prediction 572 57.55% without function prediction 422 42.45% Protein coding genes connected to Transporter Classification 69 6.94% Protein coding genes connected to KEGG pathways 178 17.91% Protein coding genes connected to KEGG Orthology (KO) 314 31.59% Protein coding genes connected to MetaCyc pathways 159 16.00% Protein coding genes with COGs 555 55.84% with KOGs 235 23.64% with Pfam 577 58.05% with TIGRfam 251 25.25% with IMG Terms 141 14.19% in paralog clusters 221 22.23% Protein coding genes coding signal peptides 307 30.89% Protein coding genes coding transmembrane proteins 300 30.18%

40

30

20

in indicated taxa in indicated 10

% of SR1 ORFs with top blast hits % of SR1 ORFs with top 0 PER OD1 OP11 OD1−i BD1−5 ACD−80 Aquficae Unknown Eukaryota Caldiserica Tenericutes Euryarchaeota Nanoarchaeota Thermodesulfobacteria Deinococcus−Thermus

Fig. S4. General features of the annotated SR1-OR1 genome (Upper) and taxonomic distribution of protein homologs (top BLASTp hits) (Lower).

Campbell et al. www.pnas.org/cgi/content/short/1303090110 7of20 Pelagibacter ubique α Roseobacter denitrificans Rhodobacter sphaeroides 100 Rhodopseudomonas palustris c.Accumulibacter phosphatis 78 Burkholderia xenovorans Proteobacteria Bordetella bronchiseptica Desulfovibrio piger Desulfobulbus propionicus * Thermodesulfatator indicus Deferribacter desulfuricans Denitrovibrio acetiphilus 63 Calditerrivibrio nitroreducens Deferribacteres Flexistipes sinusarabici 97 Desulfurispirillum indicum Chrysiogenetes Nitrospira defluvii 96 Thermodesulfovibrio yellowstonii Holophaga foetida Solibacter usitatus Ellin345 Acidobacteria Acidobacterium capsulatum Terriglobus saanensis Acidobacterium sp MP5ACTX9 Acidobacterium sp MP5ACTX8 Prosthecochloris aestuarii Chlorobium limicola Pelodictyon luteolum Chlorobi Chlorobium tepidum 99 Chloroherpeton thalassium Flavobacterium johnsoniae 100 Capnocytophaga ochracea Bacteroides thetaiotaomicron Bacteroidetes 54 Porphyromonas gingivalis Tannerella forsythensis Caldithrix abyssi Caldithrix 98 * Fibrobacter succinogenes Gemmatimonas aurantiaca c. Cloacamonas acidaminovorans Elusimicrobium minutum WWE1 c. Endomicrobium sp. Rs-D17 (TG1) Waddlia chondrophila 98 Chlamydia trachomatis Chlamydia muridarum Chlamydophila felis Chlamydophila pneumoniae 100 * Akkermansia muciniphila Verrucomicrobium spinosum Coraliomargarita akajimensis Verrucomicrobia Opitutus terrae 100 Lentisphaera araneosa 76 Kuenenia stuttgartiensis Gemmata obscuriglobus Planctomyces limnophilus Planctomycetes Blastopirellula marina Rhodopirellula baltica Isosphaera pallida Synechococcus elongatus Cyanobacteria 59 Actinomyces odontolyticus Rothia dentocariosa Actinobacteria Gardnerella vaginalis 97 Clostridium difficile Faecalibacterium prausnitzii Staphylococcus aureus Gemella haemolysans Firmicutes Catonella morbi Enterococcus faecalis 95 89 Streptococcus pneumoniae Herpetosiphon aurantiacus Oscillochloris trichoides Thermomicrobium roseum 73 Sphaerobacter thermophilus Chloroflexi Thermobaculum terrenum Anaerolinea thermophila 91 Dehalogenimonas lykanthroporepellens Dehalococcoides ethenogenes OP11 OP11 83 OD1_ACD63 OD1 66 98 TM7 oral taxon 349 TM7 99 PER_ACD28 PER 100 BD1_ACD49 BD1-5 SR1-OR1 SR1 ACD80 ACD80 Pyramidobacter piscolens Jonquetella anthropi Thermovirga lienii 55 Aminomonas paucivorans Streptobacillus moniliformis Leptotrichia buccalis Fusobacterium nucleatum Fusobacteria Ilyobacter polytropus Marinithermus hydrothermalis Thermus thermophilus * 100 Meiothermus ruber Thermi-Deinococci 100 Truepera radiovictrix Deinococcus geothermalis Deinococcus deserti Thermotoga maritima 53 Thermotoga naphthophila Thermosipho africanus Thermotogae 71 Fervidobacterium nodosum Kosmotoga olearia Petrotoga mobilis Dictyoglomus turgidum Dictyoglomi Turneriella parva Spirochaeta thermophila Spirochaeta caldaria Treponema denticola Spirochaetes Treponema succinifaciens Spirochaeta smaragdinae Desulfurobacterium thermolithotrophum

Sulfurihydrogenibium azorense Hydrogenobacter thermophilus 0.3 Aquficae Aquifex aeolicus

Fig. S5. Maximum-likelihood phylogenetic classification of Bacteria using RNA polymerase protein sequences (β-β′ subunits). Node labels denote branch support. The cluster of candidate phyla that includes SR1 is highlighted.

Campbell et al. www.pnas.org/cgi/content/short/1303090110 8of20 M

a II g

n

e

Rhodopseudomonas palustris t

o D

s

p

e Thiomicrospira crunogena crunogena Thiomicrospira

Dechloromonas aromatica aromatica Dechloromonas Hydrogenovibrio marinus marinus Hydrogenovibrio

i s

Rhodobacter capsulatus r

i u

l

l l

u

f o

m

v

i b

m

r i

a o

g

n a

e e

t s

o p

t

a o

c e

t e

i

c n

u s

m

i s

Anaerofustis stercorihominis Pseudoramibacter alactolyticus

Pseudoramibacter alactolyticus

Methanospirillum hungatei

Methanosaeta thermophila Anaerostipes caccae Methanomethylovorans hollandica Anaerofustis stercorihominis Methanococcoides burtonii Methanohalophilus mahii Methanosalsum zhilinae Human oral SR1-OR1 ACD80 Blautia hydrogenotrophica Archaeoglobus fulgidus 92 PER-ACD65 PER-ACD51 Clostridium asparagiforme 100 100 Anaerococcus hydrogenalis Clostridium bolteae 90 I-CD 100 IV-DeepYkr Clostridium sp D5 Bradyrhizobium japonicum Rhodopseudomonas palustris Acidiphilium cryptum Nitrobacter hamburgensis Sinorhizobium meliloti Dialister invisus sp WH 8102 Dialister microaerophilus 99 Paracoccus denitrificansHydrogenovibrio marinus 0 Rhodobacter sphaeroidesSynechococcus 78 10 Prochlorococcus marinus 8 7 98 Rhodopseudomonas palustris Veillonella sp 3 100 80 Thiomicrospira crunogena 67 100 Veillonella parvula Synechococcus elongatusAcidithiobacillus ferrooxidans sp 6 Anabaena variabilis Nitrobacter vulgaris Veillonella Nicotiana tabacum 100 Chlamydomonas reinhardtii Allochromatium vinosum Euglena gracilis Halorhodospira halophila sp oral taxon 158 Methanosarcina barkeri PyrococcusMethanocaldococcus abyssi jannaschii Veillonella Veillonella dispar I-A Veillonella atypica Thermofilum pendens Archaeoglobus fulgidus

Thermococcus kodakaraensis Rhodopseudomonas palustris Rhodospirillum rubrum I-B Halorhodospira halophila Alkalilimnicola ehrlichei sp BT1B Geobacillus kaustophilus Bordetella bronchiseptica Bacillus licheniformis Bacillus Acidiphilium cryptum sp 2 A 57 Bacillus cereus Fulvimarina pelagi sp HGF7

sp HGF5

Bacillus A c

D

h III

e

r o l

f m t

IV-YkrW i

a o Sinorhizobium meliloti

ExiguobacteriumPaenibacillus sibiricum b

a a

c

Paenibacillus c

i

d

t e

o r

v

o

p Allochromatium vinosum vinosum Allochromatium

ra i

Prosthecochloris aestuarii e

Chlorobium luteolum n c

s h

0.4 a

u d

Rhodopseudomonas palustris

i i

Chlorobaculum thiosulfatiphilum IV photosynthetic IV non-photosynthetic

Fig. S6. Maximum-likelihood phylogenetic tree of RubisCO and RubisCO-like protein families as defined by Tabita et al. (1). Bootstrap support is indicated for the major clades. The family containing the SR1-OR1 gene is highlighted.

1. Tabita FR, et al. (2007) Function, structure, and evolution of the RubisCO-like proteins and their RubisCO homologs. Microbiol Mol Biol Rev 71(4):576–599.

Campbell et al. www.pnas.org/cgi/content/short/1303090110 9of20 1 10 20 30 40 50 60 70 80 90 100 Identity

1. Rhodobacter capsulatus MDQSN------RYARL D LKEADLIAGGRHVL CAYIMKPKAGYGYL ETAAHFAAESSTGTNVEVSTTDDFTRGVDALVYEIDPEKEIMKIAYPVEL FDRNII 2. Methanohalophilus mahii MSSKIEELVQSLNPKQQGYVNL E L PDPTNGEYLLTVFRLVPGGEMNML QAAAEIAAESSTGTNFRVNTETKFSKVMNALVYQMDL ERELVWIAYPWRL FDRGG- 3. Methanomethylovorans hollandica MNPIYADLVASLNPKQQSYVNL Q L PDPYNGEYLLSVFHLVPEGKLNIL QAAAETAAESSTGTNFKVNTETPFSKTMNALVYKL D L EQNLVWIAYPWRL FDRGG- 4. Methanosalsum zhilinae MNSI-DDLVQSLNSKQQAYVNLN L EDPENGEYLLGVFHLIPGGKMNIL QAASEVAAESSTGTNFKVNTETAFSRTMNALVYKL D L DKNLVWIAYPWRL FDRGG- 5. Methanococcoides burtonii MSLIYEDLVKSL DSKQQAYVDLK L PDPTNGEFLLAVFHMIPGGDLNVL QAAAEIAAESSTGTNIKVSTETAFSRTMNARVYQL D L ERELVWIAYPWRL FDRGG- 6. SR1-OR1 MSNV-KDLLPTLNDHQLAYVNL D L PNPKNGQYMLIAAHFQPGKSMNIL QAACEVAAESSTGTNFLVETETAFSKEMNALVYKVDVEKELIWIAYPWRL FDRGG- 7. ACD80 MDIKELRATLNSHQAAYVNL D L PNPQNGEYML CAFHLVPGEGLNFL QAACEVSAESSTGTNFLVRTETPFSREMNSLVYKIDL ERNLVWIAYPRRL FDRGG- 8. PER_ACD51 MQKEYLK LGFDPIAAGNYMLVVFHLVPGEGRDLLDAASEVAAESSTGSNL TIGTATEFSKSMDALVYKIDEAKNLVWIAYPVDIFDRGG- 9. PER_ACD65 MQKEYINLK LNPLKGGKYMLAVFHLVPKPGEDFLSCASEVASESSTGSNLRVGTATKFSDNLNAIVYKIDKKKNLVWIAFPWKIFDRGG- 110 120 130 140 150 160 170 180 190 200 210 Identity

1. Rhodobacter capsulatus DGGAML CSFL T L TIGNNQGMGDVEYAKMHDFYVPPCYLR L FDGPSMNIADMWRVLGRPVVDGGMVVGTIIKPKLG LRPKPFADACYEFWLGG- DFIKNDEPQGNQ 2. Methanohalophilus mahii ----NVQNIL TYIVGNVLGMKEISALK LLDVWFPPSML EQYDGPGFTVDDMRSYLG--VYDRP-ILGTIVKPKMGL TSAEYAEVCYDFWAGGGDFVKNDEPQANQ 3. Methanomethylovorans hollandica ----NIQNIL TYIVGNILGIKEIKALK LLDVWFPSSML EQYDGPSYTL DDMRKYLG--VYDRP-ILGTIVKPKMGL TSAEYAEVCYDFWAGGGDFVKNDEPQANQ 4. Methanosalsum zhilinae ----NVQNIL TYIVGNILGMKEISALK LLDVWFPPAML EQYDGPSYTL DYMRQYLG--VYDRP-ILGTIVKPKMGL TSAEYAEVCYDFWTGGGDFVKNDEPQANQ 5. Methanococcoides burtonii ----NVQNIL TYIIGNILGMKEIQALK LMDIWFPPSML EQYDGPSYTVDDMRKYL D--VYDRP-ILGTIVKPKMGL TSAEYAEVCYDFWVGGGDFVKNDEPQANQ 6. SR1-OR1 ----NIQNIL TYIAGNVFGMKEVKALKIL DVWFPPAML EQYDGPSYTLADMRKYLN--VYNRP-ILGTIIKPKMGL TSAEYAEVCYDFWVGGGDFVKNDEPQANQ 7. ACD80 ----NVQNIL TYIVGNVLG MKQINALK LLDVRFPSSML EQYDGPSYTL DDMRKYL D--VYERP-ILGTIIKPKMGL TSAEYAEVAYDFRVGGGDFVKNDEPQANQ 8. PER_ACD51 ----NVQNIL TYIVGNVFGMADVKAIKAL DCWFPPEMLKNYDGPYTTIGDMKKYLGIDGDARP-VLGTIIKPKIGLKTDEFADVCYRFWKGGGDFVKFDEPQADQ 9. PER_ACD65 ----NVQNIL TYVVGNVFGMGDLSALKAL DCWFPKEML EHYDGPATTIHDLKKYLG--VKGRP-VLGTIVKPKIGLKPKQFADVCYKFWKGGGDFVKFDEPQADQ 220 230 240 250 260 270 280 290 300 310 Identity

1. Rhodobacter capsulatus TFAPLKETIRLVADAMKRAQDETGEAKL FSANITADDHYEMVARGEYIL ETFGNADHVAFLVDGYVTGPAAITTARRSFPRQFL HYHRAGHGAVTSPQSMRGYTA 2. Methanohalophilus mahii DFCPYDKMVKHVKEAMDKAVKETGKKKVHSFNVSAPDFDTMIERCEMIRNAGFEPGSYAFLIDGITAGWMAVQTIRRRYPDVFL HFHRAAHGAFTRPENPIGFSV 3. Methanomethylovorans hollandica DFCPYDKMVKHVKEAMDKAVKETGRNKVHSFNVSAADFDTMIERCEMIVNAGFEPGSYAFLIDGITAGWMAVQTLRRRYPGVFIHFHRAAHGAFTRPENPIGFSV 4. Methanosalsum zhilinae DFCPYEKMVMHVKEAMDKAVRETGQKKVHSFNVSAADFDIMIQRCEMIRNAGFEPGSYAFLIDGITAGWMAVQTLRRRYPDVFIHYHRAGHGGFTRPENPIGFSV 5. Methanococcoides burtonii DFCPYEKMVAHVKEAMDKAVKETG QKKVHSFNVSAADFDTMIERCEMITNAGFEPGSYAFLIDGITAGWMAVQTLRRRYPDVFL HFHRAAHGAFTRQENPIGFSV 6. SR1-OR1 DFCPYDKMVKHVKEAMDKAVKETG QKKVHSFNVSAADYDTMIERCEMIRNAGFEPGSYAFLIDGTTAGWMAVQTLRRKYPDVFIHFHRAGHGAFTRPENPIGYSV 7. ACD80 DFCPYDKMVKHIAEAMAKAVKETGHKKVHSFNVSAADYDTMISRCEMIKNSG MEAGSYAFLIDGTTAGWMAVQTLRRKYPDVFIHFHRAGHGAFTRPENPIGFTV 8. PER_ACD51 VFCPFEDAVKAIAKKMEQVRKETGKNKVMSFNISAADFMTMQKRAEIVMK-YMEKGSYAFLVDGL TAGWTAVQTARRMWPDVFL HFHRAGHGAMTREENPIGYTV 9. PER_ACD65 EFCPFKEAIDEIVKAMAKVEKETGKKKVMSINISAADFMTMQKRAEYVIK-KMKKGSYAFLVDGL TAGWTAVQTARRMWPGVFL HFHRAGHGAMTRPENPIGYTV 320 330 340 350 360 370 380 390 400 410 420 Identity

1. Rhodobacter capsulatus FVLSKMSRL QGASGIHTGTMGYGKMEGDAS-DKIMAYMLNDEAAQGPFYHQDWL------GMKATTPIISGGMNALR L PGFF 2. Methanohalophilus mahii LV LSKFARLAGASGIHTGTAGVGKMKGTPEEDVVAAHGIQYLSSHGHFFDQSWAKIMETDKDAIELVNEDIAHHVIL EKDSWRGMKKCCPIVSGGLNPVRLKPFI 3. Methanomethylovorans hollandica LV LSKFARLAGVSGIHTGTAGVGKMKGTPEEDVVAAHGIL YMRSKGHFFEQTWTKIPENDKDAMHLVSEDSAHHVIL EEDSWRGVKKCCPIVSGGLNPIRLKPFI 4. Methanosalsum zhilinae LV LSKFARLAGASGIHTGTAGVGKMKGTPEEDVVAAHGIL YFKSKGHFFEQIWTKIMESDKDAVNIVNEDMSRHIIL EDDSWRGVKKCCPIVSGGLNPVRLKPFI 5. Methanococcoides burtonii LV LSKFARLAGASGIHTGTAGIGKMKGTPAEDVVAAHSIQYLKSPGHFFEQTWSKIMDTDKDVINLVNEDLAHHVIL EDDSWRAMKKCCPIVSGGLNPVKLKPFI 6. SR1-OR1 LV LSKFARLAGASGIHTGTAGVGKMAGDKAEDVTAAEGIRYMKKTGHIFEQSWGTIPETDKDFVALVQKDEDNEEVL TDDSWRAIKTCCPIISGGLNPTLLKPFI 7. ACD80 LV LSKFARLAGASGIHTGTAGVGKMAGSPEEDITAAHNILN LVAEGHIFHQSRGTIPETDDDFIRDIQEDIAHHTVL QDDSRRAVKKCCPIISGGLNPTLLKPFI 8. PER_ACD51 EVL TKFGRLAGASGMHTGTAGIGKMAGDGDTDVRAAHLA L DKVASGPFFEQDWG------DMKPMCPIASGGLNPVLLKPFA 9. PER_ACD65 PFMTKMGRLAGASGMHTGTAGIGKMEGSAKEDVMAAHHAL FAKSEGDFFDQDWY------GMKPMCPIASGGLNPILLKPFA 430 440 450 460 470 480 490 500 503 Identity

1. Rhodobacter capsulatus DNLGHSNVIQTSGGGAFGHL DGATAGAKSLRQSCDAWKAGVDLVTYAKSHRELARAFESFPNDADKL YPGWRVALGVN 2. Methanohalophilus mahii DVMGNVDFITTMGSGVHAHPEGTRSGAKALIQACDAYL QGIDIKDYAKNHREL EQAIEFFPEK 3. Methanomethylovorans hollandica DVMGNVDFITTMGSGVHAHPGGTKDGAKALVQACDAYLNKMDIAEYAREHSELAQAIDHFTKAQKDAM 4. Methanosalsum zhilinae DVMGNVDFITTMGSGVHSHPEGTKAGAKALVQACDAYLKGIDIEEYAKDHNELAQSL EYFSKAKEGV 5. Methanococcoides burtonii DVMENVDFITTMGSGVHSHPGGTQSGAKALVQACDAYL QGMDIEEYAKDHKELAEAIEFYLNR 6. SR1-OR1 D LMGNVDFITTMGAGCHAHPKGTQSGAKALVQACEAYQKGVDIHEYAKNKPELAEAIEFFEKPSTKEKMKTLRIAAC 7. ACD80 DVMGNVDFITTMGAG CHAHPGGTQKGATALVQSCEAYKAG IDIHEYAKTHKELAQAIEFFEKNLNKDIKHAERNEIS 8. PER_ACD51 DVIGTVDFITTMGGGVHSHPSGTEKGAMALVQACEAWKQKIDMNEYAKTHAELGQAVEFYKEHVEYTKKYAGK 9. PER_ACD65 DVVGTTDFITTMGGGVHSHPGGTEKGAMALVQACDAWKKGISIKEYAKNHKELAQAIGFYKEKVGYSKKYL

Fig. S7. Alignment of family IIB RubisCO proteins compared with the Rhodobacter capsulatus sequence (family 2). The four sites where SR1 genes encode Gly Gly by TGA are indicated (red star). Based on activity of the ACD80 tRNA UCA (Fig. 5), we assume TGA also encodes Gly in ACD80. The loop region characteristic to the and to SR1 and ACD80 RubisCOs is indicated by a rectangle.

A B

100

70 100 100

100 92

99 100

98 100 58

97 99

62 68 75

98 100 100

99

78 100

83

Fig. S8. Maximum-likelihood phylogenies of DeoA-encoded AMP phosphorylase (A) and eIF-2B-encoded ribose-1,5,-bisphosphate isomerase (B) homologs of SR1-OR1.

Campbell et al. www.pnas.org/cgi/content/short/1303090110 10 of 20 Table S1. HMP metagenomes and scaffolds representing uncultured SR1 bacteria HMP Metagenome Metagenome Metagenome Sample human no. Scaffold name IMG name Scaffold names source subject

1 C2262299 7000000416 Human supragingival Supragingival plaque Supragingival 700015552 plaque microbiome microbiome from plaque from visit no. 1 of human subject subject 158944319 700015552: C2262299 1 SRS011343_Baylor_ Supragingival plaque scaffold_14821_ microbiome from Supragingival human subject 700015552: SRS011343_ Baylor_scaffold_14821 1 SRS011343_Baylor_ Supragingival plaque scaffold_70725 microbiome from human subject 700015552: SRS011343_Baylor_ scaffold_70725 1 SRS011343_Baylor_ Supragingival plaque scaffold_70851_ microbiome from Supragingival human subject 700015552: SRS011343_ Baylor_scaffold_70851 1 SRS011343_Baylor_ Supragingival plaque scaffold_70917_ microbiome from human Supragingival subject 700015552: SRS011343_Baylor_ scaffold_70917 1 SRS011343_Baylor_ Supragingival plaque scaffold_70982_ microbiome from human Supragingival subject 700015552: SRS011343_Baylor_ scaffold_70982 1 SRS011343_Baylor_ Supragingival plaque scaffold_71184_ microbiome from human Supragingival subject 700015552: SRS011343_Baylor_ scaffold_71184 1 SRS011343_Baylor_ Supragingival plaque scaffold_71185 microbiome from human subject 700015552: SRS011343_Baylor_ scaffold_71185 2 C3005194_ 7000000132 Human supragingival Supragingival plaque Supragingival 700013727 Supragingival plaque microbiome microbiome from plaque from visit no. 1 of human subject subject 158337416 700013727: C3005194 2 C3005288_ Supragingival plaque Supragingival microbiome from human subject 700013727: C3005288 2 C3005539 Supragingival plaque microbiome from human subject 700013727: C3005539 3 C3070136_ 7000000692 Human supragingival Supragingival plaque Supragingival 700095225 Supragingival plaque microbiome microbiome from plaque from visit no. 1 of human subject subject 160643649 700095225: C3070136 3 C3070294 Supragingival plaque microbiome from human subject 700095225: C3070294

Campbell et al. www.pnas.org/cgi/content/short/1303090110 11 of 20 Table S1. Cont. HMP Metagenome Metagenome Metagenome Sample human no. Scaffold name IMG name Scaffold names source subject

3 C3070382 Supragingival plaque microbiome from human subject 700095225: C3070382 3 SRS019980_Baylor_ Supragingival plaque scaffold_61030 microbiome from human subject 700095225: SRS019980_ Baylor_scaffold_61030 3 SRS019980_Baylor_ Supragingival plaque scaffold_61505_ microbiome from Supragingival human subject 700095225: SRS019980_ Baylor_scaffold_61505 3 SRS019980_Baylor_ Supragingival plaque scaffold_61572 microbiome from human subject 700095225: SRS019980_ Baylor_scaffold_61572 3 SRS019980_Baylor_ Supragingival plaque scaffold_61591 microbiome from human subject 700095225: SRS019980_ Baylor_scaffold_61591 4 C3705446 7000000518 Human supragingival Supragingival plaque Supragingival 700101940 plaque microbiome microbiome from plaque from visit no. 2 of human subject subject 159611913 700101940: C3705446 4 C3705542 Supragingival plaque microbiome from human subject 700101940: C3705542 5 SRS013252_Baylor_ 7000000046 Human Supragingival Supragingival plaque Supragingival 700015959 scaffold_64588 plaque microbiome microbiome from plaque from visit no. 1 of human subject subject 159591683 700015959: SRS013252_ Baylor_scaffold_64588 5 SRS013252_Baylor_ Supragingival plaque scaffold_65261_ microbiome from human Supragingival subject 700015959: SRS013252_Baylor_ scaffold_65261 5 SRS013252_Baylor_ Supragingival plaque scaffold_65403_ microbiome from human Supragingival subject 700015959: SRS013252_Baylor_ scaffold_65403 6 SRS015044_WUGC_ 7000000742 Human supragingival Supragingival plaque Supragingival 700023699 scaffold_29101 plaque microbiome microbiome from plaque from visit no. 1 of human subject subject 763901136 700023699: SRS015044_ WUGC_scaffold_29101 6 SRS015044_WUGC_ Supragingival plaque scaffold_68722 microbiome from human subject 700023699: SRS015044_ WUGC_scaffold_68722

Campbell et al. www.pnas.org/cgi/content/short/1303090110 12 of 20 Table S1. Cont. HMP Metagenome Metagenome Metagenome Sample human no. Scaffold name IMG name Scaffold names source subject

6 SRS015044_WUGC_ Supragingival plaque scaffold_71866_ microbiome from Supragingival human subject 700023699: SRS015044_WUGC_ scaffold_71866 6 SRS015044_WUGC_ Supragingival plaque scaffold_71908 microbiome from human subject 700023699: SRS015044_ WUGC_scaffold_71908 7 SRS017209_Baylor_ 7000000195 Human tongue Tongue dorsum Tongue 700033815 scaffold_9706_Tongue dorsum microbiome microbiome from (dorsum) from visit no. 1 of human subject subject 159814214 700033815: SRS017209_ Baylor_scaffold_9706 7 SRS017209_Baylor_ Tongue dorsum scaffold_70572 microbiome from human subject 700033815: SRS017209_ Baylor_scaffold_70572 7 SRS017209_Baylor_ Tongue dorsum scaffold_54029_Tongue microbiome from human subject 700033815: SRS017209_ Baylor_scaffold_54029 7 SRS050669_LANL_ 7000000737 Human tongue Tongue dorsum Tongue 700107046 scaffold_43937 dorsum microbiome microbiome from (dorsum) from visit no. 2 of human subject subject 159814214 700107046: SRS050669_ LANL_scaffold_43937 7 SRS050669_LANL_ Tongue dorsum scaffold_68248_Tongue microbiome from human subject 700107046: SRS050669_ LANL_scaffold_68248 8 SRS022536_LANL_ 7000000005 Human supragingival Supragingival plaque Supragingival 700098441 scaffold_8504 plaque microbiome microbiome from plaque from visit no. 1 of human subject subject 809635352 700098441: SRS022536_ LANL_scaffold_8504 8 SRS022536_LANL_ Supragingival plaque scaffold_78601_ microbiome from Supragingival human subject 700098441: SRS022536_ LANL_scaffold_78601 8 SRS022536_LANL_ Supragingival plaque scaffold_67292 microbiome from human subject 700098441: SRS022536_ LANL_scaffold_67292 8 SRS022536_LANL_ Supragingival plaque scaffold_61331 microbiome from human subject 700098441: SRS022536_ LANL_scaffold_61331 8 SRS022536_LANL_ Supragingival plaque scaffold_48442_ microbiome from Supragingival human subject 700098441: SRS022536_ LANL_scaffold_48442

Campbell et al. www.pnas.org/cgi/content/short/1303090110 13 of 20 Table S1. Cont. HMP Metagenome Metagenome Metagenome Sample human no. Scaffold name IMG name Scaffold names source subject

8 SRS022536_LANL_ Supragingival plaque scaffold_126111_ microbiome from Supragingival human subject 700098441: SRS022536_ LANL_scaffold_126111 8 SRS022536_LANL_ Supragingival plaque scaffold_117075 microbiome from human subject 700098441: SRS022536_ LANL_scaffold_117075 8 SRS022536_LANL_ Supragingival plaque scaffold_113207 microbiome from human subject 700098441: SRS022536_ LANL_scaffold_113207 8 SRS065099_LANL_ 7000000074 Human supragingival Supragingival plaque Supragingival 700110824 scaffold_62287 plaque microbiome microbiome from plaque from visit no. 2 of human subject subject 809635352 700110824: SRS065099_ LANL_scaffold_62287 8 SRS065099_LANL_ Supragingival plaque scaffold_100854 microbiome from human subject 700110824: SRS065099_ LANL_scaffold_100854 9 SRS022725_LANL_ 7000000018 Human supragingival Supragingival plaque Supragingival 700098681 scaffold_9371 plaque microbiome microbiome from plaque from visit no. 1 of human subject subject 370425937 700098681: SRS022725_ LANL_scaffold_9371 9 SRS022725_LANL_ Supragingival plaque scaffold_110555_ microbiome from human Supragingival subject 700098681: SRS022725_LANL_ scaffold_110555 9 SRS022725_LANL_ Supragingival plaque scaffold_114175_ microbiome from Supragingival human subject 700098681: SRS022725_ LANL_scaffold_114175 9 SRS022725_LANL_ Supragingival plaque scaffold_12269_ microbiome from Supragingival human subject 700098681: SRS022725_ LANL_scaffold_12269 9 SRS022725_LANL_ Supragingival plaque scaffold_13241 microbiome from human subject 700098681: SRS022725_ LANL_scaffold_13241 9 SRS022725_LANL_ Supragingival plaque scaffold_9371 microbiome from human subject 700098681: SRS022725_ LANL_scaffold_9371 9 SRS053917_LANL_ 7000000510 Human supragingival Supragingival plaque Supragingival 700109395 scaffold_2585 plaque microbiome microbiome from plaque from visit no. 2 of human subject subject 370425937 700109395: SRS053917_ LANL_scaffold_2585

Campbell et al. www.pnas.org/cgi/content/short/1303090110 14 of 20 Table S1. Cont. HMP Metagenome Metagenome Metagenome Sample human no. Scaffold name IMG name Scaffold names source subject

9 SRS053917_LANL_ Supragingival plaque scaffold_56053_ microbiome from Supragingival human subject 700109395: SRS053917_ LANL_scaffold_56053 9 SRS053917_LANL_ Supragingival plaque scaffold_92278_ microbiome from Supragingival human subject 700109395: SRS053917_ LANL_scaffold_92278 9 SRS053917_LANL_ Supragingival plaque scaffold_9773_ microbiome from Supragingival human subject 700109395: SRS053917_ LANL_scaffold_9773 10 SRS024087_LANL_ 7000000288 Human supragingival Supragingival plaque Supragingival 700100552 scaffold_38636_ plaque microbiome microbiome from plaque Supragingival from visit no. 2 of human subject subject 159510762 700100552: SRS024087_ LANL_scaffold_38636 10 SRS024087_LANL_ Supragingival plaque scaffold_42263 microbiome from human subject 700100552: SRS024087_ LANL_scaffold_42263 11 SRS052876_LANL_ 7000000274 Human supragingival Supragingival plaque Supragingival 700106074 scaffold_2021 plaque microbiome microbiome from plaque from visit no. 1 of human subject subject 737052003 700106074: SRS052876_ LANL_scaffold_2021 11 SRS052876_LANL_ Supragingival plaque scaffold_9886_ microbiome from Supragingival human subject 700106074: SRS052876_ LANL_scaffold_9886 11 SRS052876_LANL_ Supragingival plaque scaffold_7735 microbiome from human subject 700106074: SRS052876_ LANL_scaffold_7735 11 SRS052876_LANL_ Supragingival plaque scaffold_35366_ microbiome from Supragingival human subject 700106074: SRS052876_ LANL_scaffold_35366 11 SRS052876_LANL_ Supragingival plaque scaffold_27609 microbiome from human subject 700106074: SRS052876_ LANL_scaffold_27609 11 SRS052876_LANL_ Supragingival plaque scaffold_27256_ microbiome from Supragingival human subject 700106074: SRS052876_ LANL_scaffold_27256 11 SRS052876_LANL_ Supragingival plaque scaffold_24581 microbiome from human subject 700106074: SRS052876_ LANL_scaffold_24581

Campbell et al. www.pnas.org/cgi/content/short/1303090110 15 of 20 Table S1. Cont. HMP Metagenome Metagenome Metagenome Sample human no. Scaffold name IMG name Scaffold names source subject

11 SRS052876_LANL_ Supragingival plaque scaffold_23188_ microbiome from Supragingival human subject 700106074: SRS052876_ LANL_scaffold_23188 11 SRS052876_LANL_ Supragingival plaque scaffold_20649_ microbiome from Supragingival human subject 700106074: SRS052876_ LANL_scaffold_20649 11 SRS052876_LANL_ Supragingival plaque scaffold_17767_ microbiome from Supragingival human subject 700106074: SRS052876_ LANL_scaffold_17767 11 SRS052876_LANL_ Supragingival plaque scaffold_17591_ microbiome from Supragingival human subject 700106074: SRS052876_ LANL_scaffold_17591 12 SRS055378_LANL_ 7000000115 Human supragingival Supragingival plaque Supragingival 700103439 scaffold_14252 plaque microbiome microbiome from plaque from visit no. 2 of human subject subject 763860675 700103439: SRS055378_ LANL_scaffold_14252 12 SRS055378_LANL_ Supragingival plaque scaffold_95017_ microbiome from Supragingival human subject 700103439: SRS055378_ LANL_scaffold_95017 12 SRS055378_LANL_ Supragingival plaque scaffold_93022_ microbiome from Supragingival human subject 700103439: SRS055378_ LANL_scaffold_93022 12 SRS055378_LANL_ Supragingival plaque scaffold_14252 microbiome from human subject 700103439: SRS055378_ LANL_scaffold_14252 13 SRS063288_LANL_ 7000000159 Human tongue Tongue dorsum Tongue 700110818 scaffold_28214_ dorsum microbiome from (dorsum) Tongue microbiome from human subject visit no. 2 of 700110818: SRS063288_ subject 809635352 LANL_scaffold_28214 13 SRS063288_LANL_ Tongue dorsum scaffold_54583 microbiome from human subject 700110818: SRS063288_ LANL_scaffold_54583 13 SRS063288_LANL_ Tongue dorsum scaffold_63707 microbiome from human subject 700110818: SRS063288_ LANL_scaffold_63707

Campbell et al. www.pnas.org/cgi/content/short/1303090110 16 of 20 Table S2. Pfams of 139 conserved single-copy genes used for genome completeness estimation Pfam ID HMM name HMM cut-off Description

PF03485.11 Arg_tRNA_synt_N 32.55 Arginyl tRNA synthetase N-terminal PF03484.10 B5 26.55 tRNA synthetase B5 domain PF01121.15 CoaE 84.65 Dephospho-CoA kinase PF03772.11 Competence 82.7 Competence protein PF03602.10 Cons_hypoth95 38.8 Conserved hypothetical protein 95 PF06418.9 CTP_synth_N 208 CTP synthase N terminus PF02224.13 Cytidylate_kin 79.35 Cytidylate kinase PF00712.14 DNA_pol3_beta 48.7 DNA polymerase III β-subunit, N-terminal domain PF02767.11 DNA_pol3_beta_2 49.65 DNA polymerase III β-subunit, central domain PF02768.10 DNA_pol3_beta_3 44.5 DNA polymerase III β-subunit, C-terminal domain PF00035.20 dsrm 20.2 Double-stranded RNA binding motif PF00889.14 EF_TS 108.35 Elongation factor TS PF01176.14 eIF-1a 39.25 Translation initiation factor 1A/IF-1 PF00113.17 Enolase_C 206.35 Enolase, C-terminal TIM barrel domain PF03952.11 Enolase_N 92.45 Enolase, N-terminal domain PF06574.7 FAD_syn 73.25 FAD synthetase PF03147.9 FDX-ACB 39.75 Ferredoxin-fold anticodon binding domain PF01687.12 Flavokinase 59.3 Riboflavin kinase PF02938.9 GAD 40.05 GAD domain PF02527.10 GidB 77.85 rRNA small subunit methyltransferase G PF00958.17 GMP_synt_C 61.3 GMP synthase C-terminal domain PF01025.14 GrpE 72.25 GrpE PF01018.17 GTP1_OBG 98.9 GTP1/OBG PF11987.3 IF-2 61.7 Translation-initiation factor 2 PF04760.10 IF2_N 29.95 Translation initiation factor IF-2, N-terminal region PF00707.17 IF3_C 51.55 Translation initiation factor IF-3, C-terminal domain PF05198.11 IF3_N 47.95 Translation initiation factor IF-3, N-terminal domain PF01715.12 IPPT 129.15 IPP transferase PF06421.7 LepA_C 79.7 GTP-binding protein LepA C terminus PF01795.14 Methyltransf_5 174.05 MraW methylase family PF02873.11 MurB_C 46.15 UDP-N-acetylenolpyruvoylglucosamine reductase, C-terminal domain PF08529.6 NusA_N 61.55 NusA N-terminal domain PF02410.10 Oligomerisation 44.9 Oligomerisation domain PF01195.14 Pept_tRNA_hydro 99.15 Peptidyl-tRNA hydrolase PF01252.13 Peptidase_A8 61.35 Signal peptidase (SPase) II PF00162.14 PGK 236 Phosphoglycerate kinase PF02912.13 Phe_tRNA-synt_N 33.55 Aminoacyl tRNA synthetase class II, N-terminal domain PF03726.9 PNPase 25.6 Polyribonucleotide nucleotidyltransferase, RNA binding domain PF01416.15 PseudoU_synth_1 42.45 tRNA pseudouridine synthase PF02033.13 RBFA 44.9 Ribosome-binding factor A PF00154.16 RecA 276.55 recA bacterial DNA recombination protein PF02132.10 RecR 21.25 RecR protein PF00825.13 Ribonuclease_P 40.55 Ribonuclease P PF00687.16 Ribosomal_L1 75.1 Ribosomal protein L1p/L10e family PF00466.15 Ribosomal_L10 37.3 Ribosomal protein L10 PF00298.14 Ribosomal_L11 42.65 Ribosomal protein L11, RNA binding domain PF03946.9 Ribosomal_L11_N 45.9 Ribosomal protein L11, N-terminal domain PF00542.14 Ribosomal_L12 44 Ribosomal protein L7/L12 C-terminal domain PF00572.13 Ribosomal_L13 81.3 Ribosomal protein L13 PF00238.14 Ribosomal_L14 79.55 Ribosomal protein L14p/L23e PF00252.13 Ribosomal_L16 77.45 Ribosomal protein L16p/L10e PF01196.14 Ribosomal_L17 55.75 Ribosomal protein L17 PF00828.14 Ribosomal_L18e 37.55 Ribosomal protein L18e/L15 PF00861.17 Ribosomal_L18p 54.85 Ribosomal L18p/L5e family PF01245.15 Ribosomal_L19 71.7 Ribosomal protein L19 PF00181.18 Ribosomal_L2 52.45 Ribosomal Proteins L2, RNA binding domain PF03947.13 Ribosomal_L2_C 87.3 Ribosomal Proteins L2, C-terminal domain PF00453.13 Ribosomal_L20 70.5 Ribosomal protein L20 PF00829.16 Ribosomal_L21p 53.45 Ribosomal prokaryotic L21 protein PF00237.14 Ribosomal_L22 57.6 Ribosomal protein L22p/L17e PF00276.15 Ribosomal_L23 37.45 Ribosomal protein L23 PF01016.14 Ribosomal_L27 55.4 Ribosomal L27 protein PF00830.14 Ribosomal_L28 36.8 Ribosomal L28 family

Campbell et al. www.pnas.org/cgi/content/short/1303090110 17 of 20 Table S2. Cont. Pfam ID HMM name HMM cut-off Description

PF00831.18 Ribosomal_L29 33.7 Ribosomal L29 protein PF00297.17 Ribosomal_L3 66.35 Ribosomal protein L3 PF01783.18 Ribosomal_L32p 28.2 Ribosomal L32p protein family PF01632.14 Ribosomal_L35p 29.5 Ribosomal protein L35 PF00573.17 Ribosomal_L4 99.95 Ribosomal protein L4/L1 family PF00281.14 Ribosomal_L5 34.55 Ribosomal protein L5 PF00673.16 Ribosomal_L5_C 61.25 Ribosomal L5P family C terminus PF00347.18 Ribosomal_L6 53.25 Ribosomal protein L6 PF03948.9 Ribosomal_L9_C 37.95 Ribosomal protein L9, C-terminal domain PF01281.14 Ribosomal_L9_N 30.55 Ribosomal protein L9, N-terminal domain PF00338.17 Ribosomal_S10 58.05 Ribosomal protein S10p/S20e PF00411.14 Ribosomal_S11 73.45 Ribosomal protein S11 PF00164.20 Ribosomal_S12 84.3 Ribosomal protein S12 PF00416.17 Ribosomal_S13 56.2 Ribosomal protein S13/S18 PF00312.17 Ribosomal_S15 41.7 Ribosomal protein S15 PF00886.14 Ribosomal_S16 36.25 Ribosomal protein S16 PF00366.15 Ribosomal_S17 41.3 Ribosomal protein S17 PF01084.15 Ribosomal_S18 36.3 Ribosomal protein S18 PF00203.16 Ribosomal_S19 54.9 Ribosomal protein S19 PF00318.15 Ribosomal_S2 133.15 Ribosomal protein S2 PF01649.13 Ribosomal_S20p 35.95 Ribosomal protein S20 PF00189.15 Ribosomal_S3_C 49.5 Ribosomal protein S3, C-terminal domain PF00163.14 Ribosomal_S4 32.9 Ribosomal protein S4/S9 N-terminal domain PF00333.15 Ribosomal_S5 42.55 Ribosomal protein S5, N-terminal domain PF03719.10 Ribosomal_S5_C 44.9 Ribosomal protein S5, C-terminal domain PF01250.12 Ribosomal_S6 45.2 Ribosomal protein S6 PF00177.16 Ribosomal_S7 95.55 Ribosomal protein S7p/S5e PF00410.14 Ribosomal_S8 72.35 Ribosomal protein S8 PF00380.14 Ribosomal_S9 70.4 Ribosomal protein S9/S16 PF01782.13 RimM 30.65 RimM N-terminal domain PF01000.21 RNA_pol_A_bac 34.3 RNA polymerase Rpb3/RpoA insert domain PF03118.10 RNA_pol_A_CTD 38.9 Bacterial RNA polymerase, α-chain C-terminal domain PF01193.19 RNA_pol_L 34.35 RNA polymerase Rpb3/Rpb11 dimerization domain PF04997.7 RNA_pol_Rpb1_1 157.25 RNA polymerase Rpb1, domain 1 PF00623.15 RNA_pol_Rpb1_2 79.8 RNA polymerase Rpb1, domain 2 PF04983.13 RNA_pol_Rpb1_3 41.85 RNA polymerase Rpb1, domain 3 PF05000.12 RNA_pol_Rpb1_4 24.6 RNA polymerase Rpb1, domain 4 PF04998.12 RNA_pol_Rpb1_5 119.4 RNA polymerase Rpb1, domain 5 PF04563.10 RNA_pol_Rpb2_1 39.95 RNA polymerase beta subunit PF04561.9 RNA_pol_Rpb2_2 37.15 RNA polymerase Rpb2, domain 2 PF04565.11 RNA_pol_Rpb2_3 45.6 RNA polymerase Rpb2, domain 3 PF10385.4 RNA_pol_Rpb2_45 36.6 RNA polymerase beta subunit external 1 domain PF00562.23 RNA_pol_Rpb2_6 258.45 RNA polymerase Rpb2, domain 6 PF04560.15 RNA_pol_Rpb2_7 46.45 RNA polymerase Rpb2, domain 7 PF01765.14 RRF 101.65 Ribosome recycling factor PF07499.8 RuvA_C 16.6 RuvA, C-terminal domain PF01330.16 RuvA_N 26.85 RuvA N-terminal domain PF05491.8 RuvB_C 49.5 Holliday junction DNA helicase ruvB C terminus PF02773.11 S-AdoMet_synt_C 108.9 S-adenosylmethionine synthetase, C-terminal domain PF02772.11 S-AdoMet_synt_M 77.05 S-adenosylmethionine synthetase, central domain PF00584.15 SecE 27.5 SecE/Sec61-γ subunits of protein translocation complex PF03840.9 SecG 29.2 Preprotein translocase SecG subunit PF00344.15 SecY 180.7 SecY translocase PF02403.17 Seryl_tRNA_N 43.15 Seryl-tRNA synthetase N-terminal domain PF01668.13 SmpB 43.4 SmpB protein PF02978.14 SRP_SPB 52.8 Signal peptide binding domain PF00763.18 THF_DHG_CYH 62.2 Tetrahydrofolate dehydrogenase/cyclohydrolase, catalytic domain PF02882.14 THF_DHG_CYH_C 103.7 Tetrahydrofolate dehydrogenase/cyclohydrolase, NAD(P)-binding domain PF00121.13 TIM 140.2 Triosephosphate isomerase PF08275.6 Toprim_N 58.6 DNA primase catalytic core, N-terminal domain PF03461.10 TRCF 40.05 TRCF domain PF05698.9 Trigger_C 55.15 Bacterial trigger factor protein (TF) C terminus

Campbell et al. www.pnas.org/cgi/content/short/1303090110 18 of 20 Table S2. Cont. Pfam ID HMM name HMM cut-off Description

PF05697.8 Trigger_N 61.8 Bacterial trigger factor protein (TF) PF01746.16 tRNA_m1G_MT 80.75 tRNA (Guanine-1)-methyltransferase PF00750.14 tRNA-synt_1d 120.85 tRNA synthetases class I (R) PF01409.15 tRNA-synt_2d 161.15 tRNA synthetases class II core domain (F) PF01509.13 TruB_N 83.55 TruB family pseudouridylate synthase (N-terminal domain) PF00627.26 UBA 13.3 UBA/TS-N domain PF02130.12 UPF0054 59.35 Uncharacterized protein family UPF0054 PF02367.12 UPF0079 46.1 Uncharacterized P-loop hydrolase UPF0079 PF03652.10 UPF0081 60.75 Uncharacterized protein family (UPF0081) PF12344.3 UvrB 33.75 Ultra-violet resistance protein B PF08459.6 UvrC_HhH_N 81.9 UvrC Helix-hairpin-helix N-terminal PF10458.4 Val_tRNA-synt_C 26.5 Valyl tRNA synthetase tRNA binding arm PF06071.8 YchF-GTPase_C 61.9 Protein of unknown function (DUF933) PF06689.8 zf-C4_ClpX 31.7 ClpX C4-type zinc finger

Table S3. Pangenomic variation in selected SR1 genes Alignment Internal TGA codons SNPs (excluding TGA codons)

Amino acid Transition, Amino acid Protein N S Total Conserved Variable variations Total Transitions Transversions transversion* substitutions†

LepA 1,458 9 11 9 2 GGA (Gly), 126 97 19 10 2 TGG (Trp) RNA pol. β 3,699 3 6 4 2 GGA (Gly) 65 55 10 0 3 subunit RNA pol. β′ 3,813 6 6 5 1 GGA (Gly) 138 114 23 1 6 subunit Threonine 1,887 11 5 4 1 GGA (Gly) 144 111 27 6 16 synthetase eIF-2B homolog 960 15 3 2 1 GGA (Gly) 62 45 14 3 12 DeoA homolog 1,504 20 13 9 4 GGA (Gly), 213 148 37 28 31 GGT (Gly), GGG (Gly) RuBisCo 1,461 71 4 0 4 GGA (Gly), 465 251 85 129 71 GGT (Gly), GGC (Gly)

N is the length of the nucleotide sequence alignment including gaps; S is the total number of scaffolds used for comparison. Most scaffolds used did not span the entire length of alignment. *SNPs that were represented by at least three bases. † Total amino acid substitutions found along the alignment, irrespective of the number of SNPs contributing to the substitution.

Campbell et al. www.pnas.org/cgi/content/short/1303090110 19 of 20 Table S4. Genomic-level glycine codon use in SR1-OR1 and HMP SR1 metagenomic scaffolds Glycine codons (%)

HMP subject* IMG ID Sample source Scaffolds Genes surveyed TGA GGA GGT GGC GGG

SR1-OR1 10947 Supragingival plaque 49 426 23.9 40.4 16.2 5.5 14.1 158944319 (visit 1) 7000000416 Supragingival plaque 8 144 23.4 41.6 16.2 4.0 14.7 158337416 (visit 1) 7000000132 Supragingival plaque 3 24 24.1 44.1 13.3 2.9 15.6 160643649 (visit 1) 7000000692 Supragingival plaque 7 156 24.4 42.6 15.7 2.8 14.5 159611913 (visit 2) 7000000518 Supragingival plaque 2 21 22.7 48.7 12.6 2.4 13.6 159591683 (visit 1) 7000000046 Supragingival plaque 3 23 21.5 47.4 16.1 1.9 13.1 763901136 (visit 1) 7000000742 Supragingival plaque 4 37 19.1 47.0 18.6 1.8 13.5 159814214 (visit 1) 7000000195 Tongue (dorsum) 5 34 31.6 40.4 13.3 2.0 12.8 809635352 (visit 1) 7000000005 Supragingival plaque 10 946 25.7 41.6 16.1 3.3 13.3 809635352 (visit 2) 7000000159 Tongue (dorsum) 3 20 26.0 39.1 15.7 3.5 15.7 370425937 (visit 1) 7000000018 Supragingival plaque 10 580 25.7 42.1 15.1 3.2 14.0 159510762 (visit 2) 7000000288 Supragingival plaque 2 11 37.4 38.4 16.8 1.3 6.1 737052003 (visit 1) 7000000274 Supragingival plaque 11 73 22.8 42.9 15.9 3.5 15.0 763860675 (visit 2) 7000000115 Supragingival plaque 4 14 28.5 40.5 11.3 4.9 14.8

*Each metagenome was derived from a single human donor.

Dataset S1. Alignment (fasta format) of RubisCO proteins from SR1-O1 and close relatives from HMP metagenomes or amplified and sequenced from human oral samples

Dataset S1

Campbell et al. www.pnas.org/cgi/content/short/1303090110 20 of 20