Supplementary information

CaptureSeq: Capture-based enrichment of cpn60 gene fragments empowers pan- Domain profiling of microbial communities without universal PCR.

Matthew G. Links 1,2, Tim J. Dumonceaux 3,4, Luke McCarthy 5, Sean M. Hemmingsen 5, Edward Topp 6, Alexia Comte 3, Jennifer R. Town 3*

1Department of Animal and Poultry Science, University of Saskatchewan, Saskatoon, SK, Canada 2Department of Computer Science, University of Saskatchewan, Saskatoon, SK, Canada 3Agriculture and Agri-Food Canada, Saskatoon Research and Development Centre, Saskatoon, SK, Canada 4Department of Veterinary Microbiology, University of Saskatchewan, Saskatoon, SK, Canada 5National Research Council of Canada, Saskatoon, SK, Canada 6Agriculture and Agri-Food Canada, London Research and Development Centre, London, ON, Canada

*author for correspondence: [email protected] Supplemental Figure S1: Principal coordinate analysis of Bray-Curtis dissimilarity between soil samples profiled using CaptureSeq (square), shotgun metagenomic (triangle), or amplicon (circle) approaches and clustered using reference type I chaperonin sequences. Both inter- technique (A) and intra-technique (B, C, D) distances were evaluated. Supplemental Figure S2: Alpha diversity metrics for samples profiled using amplicon (red), CaptureSeq (blue), or shotgun metagenomic (green) approaches. Metrics were calculated using libraries that were downsampled from 250-2750 reads and were averaged across 100 bootstrapped datasets. The shaded area corresponds to the standard deviation of the three replicate soil plots for each antibiotic treatment condition. Supplemental Figure S3: The number of cpn60 gene copies for selected organisms from the synthetic community were quantified using -specific quantitative PCR assays before (A) and after (B) CaptureSeq hybridization. Supplemental Figure S4: The number of mapped reads for each microorganism relative to the mean within each spiked synthetic community sample. Plasmids containing cpn60 UT sequences for all 20 microorganisms were mixed at an equimolar concentration for sequencing library preparation. A smaller deviation from the mean reflects reduced bias in the profiling method. Supplemental Information: Pseudocode describing the procedure used to retrieve reference cpn60 sequences from GenBank.

GetNucleotideSequence($protein): for each $feature in $protein- >get_features_by_tag(“CDS”): if $feature->has_tag(“coded_by”): return $feature->get_tag_value(“coded_by”) for each $record in $protein->get_dblinks(“GenBank”): for each $feature in $record- >get_features_by_tag(“CDS”): if $feature->get_tag_value(“protein_id”) == $protein->id or $feature->get_tag_value(“locus_tag”) == $protein->locus_tag or $feature->get_tag_value(“gene”) == $protein- >gene_names: return $feature->get_sequence() for each $record in $protein->get_dblinks(“GenPept”): GetNucleotideSequence($record) Supplemental Table S1: Summary of sequencing output for soil sample libraries processed using amplicon, CaptureSeq and metagenomic profiling methods.

Profiling Treatment Sequencing Reference Plot method (mg kg-1) reads Mapped reads 0 1 254,145 94.9% 0 7 34,221 95.8% 0 12 39,603 94.5% Amplicon 10 4 161,181 94.3% 10 8 65,299 95.1% 10 11 29,014 94.4% 0 1 1,226,019 16.4% 0 7 1,172,125 18.1% 0 12 1,158,656 15.9% CaptureSeq 10 4 1,025,680 17.0% 10 8 827,129 16.4% 10 11 765,101 16.7% 0 1 6,116,268 0.070% 0 7 7,306,386 0.071% Shotgun 0 12 6,953,220 0.070% metagenomic 10 4 4,232,651 0.067% 10 8 4,794,243 0.067% 10 11 4,287,559 0.069% Supplemental Table S2: Sequencing read abundance for seed wash samples spiked with a synthetic community of 20 in 10-fold decreasing dilutions were compared for samples using UT amplification or CaptureSeq profiling methods. Amplicon and CaptureSeq libraries were downsampled to 30,091 and 506,247 sequencing reads respectively and mapped to the reference chaperonin UT sequences for the 20 bacteria in the panel.

Amplicon CaptureSeq High Medium Low High Medium Low Atopobium 171 133 10 943 87 10 vaginae Bifidobacteri 10 9 14 827 15 29 um bifidum Bifidobacteri 95 57 20 527 35 4 um infantis Gardnerella 1,096 974 120 1,369 92 11 vaginalis Lactobacillus 193 86 11 295 15 4 sp. N27 Lactobacillus 859 616 94 741 66 8 iners Lactobacillus 684 356 60 731 55 15 crispatus Lactobacillus 2,090 1,373 143 1,020 92 11 jensenii Lactobacillus 529 281 33 946 79 17 vaginalis Lactobacillus 2,330 1,079 135 827 69 12 gasseri Lactobacillus 1,554 1,254 126 874 63 2 salivarius Lactobacillus 652 324 39 803 65 10 sp. L6 Lactobacillus 2,756 1,864 229 889 83 9 sp. N19 Lactobacillus 426 317 43 297 20 3 johnsonii Lactobacillus 4,142 3,382 416 2,466 244 29 sp. N20 Peptostrepto coccus 2,468 1,681 107 1,416 124 18 anaerobius Prevotella 817 357 50 959 83 11 bivia Streptococcu s gallolyticus 3,789 2,153 173 1,010 89 12

Streptococcu s lutetiensis 1,048 623 46 769 66 10

Lactobacillus 1,291 729 126 1,332 120 18 plantarum Supplemental Table S3: Sequencing libraries prepared using amplicon or CaptureSeq approaches were down-sampled to 27,389 mapped reads and taxonomic clusters that were at least 10-fold less abundant in the amplicon libraries compared to the CaptureSeq libraries were considered. Of the top 25 most under-represented Bacteria in the amplicon libraries, 22 were identified as high G/C gram positive and all had a cpn60 G/C content ≥ 64%.

Amplicon CaptureSeq G/C Genbank Accession Name Mapped Mapped (%) Reads Reads WP_028636473.1 68 Nocardioides sp. URHA0032 25 658 ACC65036.1 66 uncultured soil bacterium 26 614 ACC65181.1 65 uncultured soil bacterium 31 597 WP_028651308.1 70 Nocardioides halotolerans 18 609 WP_030528625.1 68 Phycicoccus jejuensis 15 444 WP_027861271.1 67 Marmoricola sp. URHB0036 24 375 WP_028707293.1 69 Propionicicella superfundia 31 320 EON25274.1 68 Nocardioides sp. CF8 12 325 AEO79019.1 68 Tetrasphaera duodecadis 6 305 ACC65055.1 66 uncultured soil bacterium 21 284 WP_028929632.1 70 Pseudonocardia asaccharolytica 19 282 WP_028644730.1 67 Nocardioides sp. URHA0020 18 245 WP_022889903.1 68 Agromyces italicus 12 249 WP_030482905.1 70 Marmoricola aequoreus 7 242 AEO79020.1 66 Tetrasphaera elongata 11 234 WP_028635614.1 69 Nocardioides sp. URHA0032 20 217 WP_027860415.1 69 Marmoricola sp. URHB0036 2 226 WP_028925406.1 68 Pseudonocardia acaciae 15 210 WP_028643070.1 69 Nocardioides sp. URHA0020 7 179 WP_028660685.1 68 Nocardioides insulae 6 145 WP_019180227.1 67 Microbacterium yannicii 11 123 WP_036332750.1 69 Microbispora sp. PTA-5024 6 125 WP_026857066.1 71 Geodermatophilaceae 2 121 WP_024366553.1 64 Arthrobacter sp. TB 26 10 102 WP_028472535.1 69 Nocardioides alkalitolerans 4 87 Supplemental Table S4: ddPCR primer sets and conditions

OTU Domain cpnDB nearest primer/probe name sequence (5'-3') product ddPCR amplification neighbor size, bp conditions D0498 CATCCCCATGTTGGAAAC 108 94°C, 10 min; 50x 94°C, 30 XP002901426 Eurkarya Phytophthora D0499 CCACGGATCTTGTTGATG sec; 55°C, 1:00 (ramp 2°C/sec); DN2_c0_g1_i1 (type I) infestans Pinf-cpn60 5'-FAM-CGTCCGCTT/ZEN/CTCATCATCGC-3'-IB 98°C, 10 min D0321 ACCATTGGTGACCTGATC 124 94°C, 10 min; 50x 94°C, 30 WP036300323 Bacteria Microbacterium D0322 CCTTGTCGAAACGCATAC sec; 55°C, 1:00 (ramp 2°C/sec); DN4_c3_g1_i2 (type I) sp. C448 C448 5'-FAM-AACACGTTC/ZEN/GGCACCGAGCT-3'-IB 98°C, 10 min D0516 AGGATGGTCACACCGTCGTT 173 94°C, 10 min; 50x 94°C, 30 KUL05486 Archaea Methanoculleus D0517 CTGAAAGAGGGTAGCCAGCG sec; 59°C, 1:00 (ramp 2°C/sec); DN0_c0_g1_i1 (type II) marisnigri Methanoculleus_c0g1i1 5'-FAM-ATGTTGCCC/ZEN/GACTGAGCGTCACGA-3'-IB 98°C, 10 min