bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

1 Soil bacterial populations are shaped

2 by recombination and gene-specific

3 selection across a meadow

1 1 1 4 Alexander Crits-Christoph , Matthew Olm , Spencer Diamond , Keith 1 2,3,4,* 5 Bouma-Gregson , Jillian Banfield

*For correspondence: jbanfi[email protected] 1Department of Plant and Microbial Biology, University of California, Berkeley; 6 (JB) 2Department of Earth and Planetary Sciences, University of California, Berkeley; 7 3Department of Environmental Science, Policy, and Management, University of California, 8 Berkeley, Berkeley, CA, USA; 4Earth Sciences Division, Lawrence Berkeley National 9 Laboratory, Berkeley, CA, USA 10

11

12 Abstract Soil microbial diversity is often studied from the perspective of community 13 composition, but less is known about genetic heterogeneity within species and how population 14 structures are affected by dispersal, recombination, and selection. Genomic inferences about 15 population structure can be made using the millions of sequencing reads that are assembled de 16 novo into consensus genomes from metagenomes, as each read pair describes a short genomic 17 sequence from a cell in the population. Here we track genome-wide population genetic variation 18 for 19 highly abundant bacterial species sampled from across a grassland meadow. Genomic 19 nucleotide identity of assembled genomes was significantly associated with local geography for 20 half of the populations studied, and for a majority of populations within-sample nucleotide 21 diversity could often be as high as meadow-wide nucleotide diversity. Genes involved in specialized 22 metabolite biosynthesis and extracellular transport were characterized by elevated genetic diversity 23 in multiple species. Microbial populations displayed varying degrees of 24 and recombinant variants were often detected at 7-36% of loci genome-wide. Within multiple 25 populations we identified genes with unusually high site-specific differentiation of alleles, fewer 26 recombinant events, and lower nucleotide diversity, suggesting recent selective sweeps for gene 27 variants. Taken together, these results indicate that recombination and gene-specific selection 28 commonly shape local soil bacterial genetic variation.

29

30 Introduction 31 Soil microbial communities play key biogeochemical roles in terrestrial ecosystems (Fierer 2017). 32 Within a single hectare of temperate grassland soil, there can be over 1,000 kg of microbial biomass 33 and corresponding large microbial population sizes (Fierer et al. 2009). However, many of the 34 most abundant soil microorganisms are underrepresented or absent from culture collections and 35 genomic databases, even at the level of class or phylum (Lloyd et al. 2018). Progress has been 36 made in cataloging the diversity of 16S rRNA genes in soils (Thompson et al. 2017), which is useful 37 for understanding microbial community composition, but this technique is incapable of discerning 38 genetic variation within populations. On the other hand, genome-resolved metagenomics studies, 39 in which shotgun sequencing of metagenomic DNA is assembled and binned into draft genomes, 40 have resulted in whole genome characterization of several highly abundant soil (Hultman

1 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

41 et al. 2015; Butterfield et al. 2016; White et al. 2016; Ji et al. 2017; Woodcroft et al. 2018; Diamond 42 et al. 2019). 43 Sequence variation within genome-resolved metagenomic datasets can also be used to track 44 changes in allele frequencies, and to infer the operation of evolutionary forces of genetic drift, 45 natural selection, and homologous recombination (Whitaker and Banfield 2006; Garud et al. 2019). 46 Homologous recombination, in particular, can vary dramatically in its importance relative to other 47 processes for different microbial species. Analyses of reference genomes have shown that homolo- 48 gous recombination frequently occurs in bacteria populations, both globally and locally (González- 49 Torres et al. 2019; Lin and Kussell 2019; Sakoparnig et al. 2019). For example, certain populations 50 of hotspring Cyanobacteria approach panmixia, where recombination is so frequent that individual 51 cells are unlinked random mixtures of alleles (Rosen et al. 2015). In other species like oceanic Vibrio, 52 recombination is high but large blocks of alleles important for ecological niche differentiation are 53 co-inherited and may remain linked due to selection (Cui et al. 2015) (Shapiro et al. 2012). In soils, 54 Streptomyces flavogriseus isolates were also found to approach a freely recombining panmixia 55 (Doroghazi and Buckley 2010). In contrast, Myxoccocus xanthus isolates recovered from a series 56 of soil samples were distinct but highly clonal, implying recombination between strains was low 57 (Wielgoss et al. 2016). The degree of recombination in most abundant soil bacterial lineages has 58 not been investigated, but it may be widespread, as high cell densities could promote the sharing 59 of genetic material via transformation (Thomas and Nielsen 2005), conjugation (Rocha et al. 2005), 60 or the uptake of extracellular vesicles (Tran and Boedicker 2019). 61 When recombination rates are low or selection is extremely strong, several clonal strains can 62 compete until one or more beneficial alleles is highly selected for, resulting in a single clonal geno- 63 type sweeping to fixation. However, when recombination unlinks gene variants within a population, 64 beneficial alleles can sweep through a population independent of genomic context in a selective 65 sweep (Smith and Haigh 1974) with gene/locus-specific effects. One genome-resolved metagenomic 66 study observed a single clonal sweep over a nine year period for one Chlorobium population 67 in a freshwater lake (Bendall et al. 2016), while most of the other bacterial populations studied 68 possessed genomic loci with unusually few SNPs (single nucleotide polymorphisms), an observation 69 interpreted as evidence for gene-specific selective sweeps. However, there are additional signals 70 within genomic data that could further constrain evolutionary history, as positive selection acting on 71 a genomic locus in a recombining population can leave locus-specific signals relative to the genomic 72 average. These include higher linkage disequilibrium (a strong association between alleles) and 73 different allele frequencies between populations (Krause and Whitaker 2015) (Shapiro et al. 2009). 74 In soil microbial populations, comparative frequencies of gene-specific sweeps versus genome-wide 75 clonal expansions are largely unknown, and it is generally uncertain whether selection or dispersal 76 constraints play a larger role in shaping allele frequencies. 77 Previously, we conducted a large scale genome-resolved metagenomics study of soils from a 78 grassland meadow in the Angelo Coast Range Reserve in northern California that established a 79 dataset of 896 dereplicated phylogenetically diverse microbial genomes (Diamond et al. 2019). 80 The soil at the site is a sandy loam with pH values in the range of 4.6 to 4.9, and the grassland is 81 dominated by annual Mediterranean grasses and forbs (Suttle et al. 2007; Butterfield et al. 2016). 82 The meadow has been part of a rainfall amendment climate change study ongoing for over 17 83 years and has also been studied in the context of plant diversity and productivity (Sullivan et al. 84 2016), invertebrate herbivores and predators (Suttle et al. 2007), fungal communities (Hawkes et 85 al. 2011), soil organic matter (Berhe et al. 2012), metabolomics, and metaproteomics (Butterfield 86 et al. 2016). By analyzing the population genomics of 19 highly abundant bacterial species across 87 this meadow, we found high nucleotide diversity within samples, intrapopulation genetic structure 88 often explained by local spatial scales, varying degrees of homologous recombination for different 89 species, and gene-specific population differentiation partially driven by selection.

2 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

90 Results and Discussion

91

92 Bacterial allele frequencies within populations are spatially organized across a 93 grassland meadow 94 Previously, 60 soil samples were collected at depths of 10-20, 20-30, and 30-40 cm from within six 95 10 m diameter plots (see Methods). Samples were collected over a period of three months (Fig 1a). 96 Sampling plots were arranged in blocks of two plots, and one of the two plots in each block received 97 extended spring rainfall ((Suttle et al. 2007).

(a) (b)

100

90

80

70

60 Alignment Coverage (%)

50 86 88 90 92 94 96 98 100 ANI (%) (c) Gammaproteobacteria_ANG_884 Gammaproteobacteria_ANG_56 Betaproteobacteria_ANG_5885 Deltaproteobacteria_ANG-DLTX2_2798 Rokubacteria_ANG_2807 Gemmatimonadetes_ANG_3801 Gemmatimonadetes_ANG_2150 Gemmatimonadetes_ANG_3800 Gemmatimonadetes_ANG_3684 Gemmatimonadetes_ANG_4339 Dormibacteraeota_ANG_750 Verrucomicrobia_ANG_3696 Verrucomicrobia_ANG_3696 Verrucomicrobia_ANG_4402 Verrucomicrobia_ANG_7383 Verrucomicrobia_ANG_7383 Verrucomicrobia_ANG_1653 Deltaproteobacteria_ANG_2457 Deltaproteobacteria_ANG_2457 Armatimonadetes_ANG_6837 ANGP1_ANG_10556 Acidobacteria_ANG_477

Figure 1. Meadow overview and population relative abundances. (a) A bird’s eye view of the grassland meadow ◦ ¨ ¨¨ ◦ ¨ ¨¨ located at the Angelo Coast Range Reserve in Mendocino County, California (39 44 17.7 N, 123 37 48.4 W). (b) Histograms of average nucleotide identity (ANI) and alignment coverage (an approximation of % shared gene content) values for all-vs-all comparisons between genomes assembled from the meadow soils. (c) Relative abundances of the 19 bacterial populations analyzed in this study in each sample, organized by experimental plot where sample was collected.

98 Starting with the dataset of 3,215 draft quality genomes assembled from samples collected 99 from across the meadow (Diamond et al. 2019), we calculated all pairwise genome-wide average 100 nucleotide identities (ANI) and alignment coverages (roughly analogous to shared gene content; 101 Fig 1b). We observed a sharp decrease in pairwise ANI for all of the genomes from the meadow 102 around 96.5-97%, similar to the threshold for bacterial species delineation reported recently (Jain et 103 al. 2018). We observe a more gradual decline in shared gene content, indicating that within species, 104 even within a local environment, shared gene content can often be more variable than average 105 nucleotide identity of conserved regions. We used the 97% ANI cutoff to cluster genomes into 106 groups of species-like populations. Some species-like groups contained dozens of near-complete 107 draft genomes, each assembled from a different sample independently, and we selected species

3 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

108 clusters with at least 12 genomes estimated to be >80% complete with <10% contamination for 109 population genetics analysis. The final set comprised 467 genomes from 19 widespread populations 110 (312 genomes are estimated to be >90% complete; Supplementary Table S1). 111 The bacterial species in this set included many commonly reported highly abundant soil bacteria 112 from phyla including Chloroflexi, Acidobacteria, Verrucomicrobia, and Rokubacteria, which are 113 known to be abundant globally in soils (Fierer 2017; Bergmann 2011), but remain understudied. 114 Most of the species in this set were likely novel at the taxonomic rank of class, and one likely 115 represents a novel candidate phylum tentatively designated ANGP1 (Diamond et al. 2019). Based 116 on measurement of the relative abundance of each population across the entire meadow, these bac- 117 teria are some of the most abundant species in the soil, although no individual species contributed 118 >1% of the DNA in a sample (Figure 1c). 119 For each of the 19 meadow-wide populations, we tested to see if the proximity of sampling 120 sites within meadow plots predicted genetic similarity of the assembled genomes (PERMANOVA; 121 FDR<=5%; adjusted p<=0.05). Further, we tested if samples collected from the same soil depth were 122 more similar than those collected from different depths. We found that the genetic variation of 123 genomes from 12 of the 19 populations were significantly associated with sampling plot, and that 124 genetic variation within 5 of the 19 populations were significantly associated with sampling depth 125 (Fig 2a). Non-metric multidimensional scalings of the nucleotide identity matrices of genomes from 126 each population showed clear associations with both plot of origin and depth (Fig 2b). Because 127 the genome assembly from each sample reflects the most abundant sequence variant in each 128 population, this implies that major allele frequencies varied across the meadow. While local spatial 129 heterogeneity has been shown to highly explain microbial community composition in soils (O’Brien 130 et al. 2016), here we demonstrate that there are also spatial patterns within the genetic variation of individual species.

(a) (b) Acidobacteria ANGP1 Deltaproteobacteria Gammaproteobacteria Verrucomicrobia ANG 7383 Plot ANG 10556 1.0 ANG 6837 ANG 2457 ANG 56 Verrucomicrobia ANG 6721 * * Depth ● 0.5 ● 0.5 * 1 0.5 ● ● ● Verrucomicrobia ANG 6527 ● ●● ● ● ● ● 0.0 ● * ● ● ● 0.0 ● 0.0 Verrucomicrobia ANG 4402 0 ●● ● ● ● ● ● ● ● ● ● Verrucomicrobia ANG 3696 ● −0.5 ● −0.5 ● −0.5 ● ● ● ● * −1 ● ● ● Rokubacteria ANG 2807 ● ● ● ● −1.0 −1.0 −1 0 1 −1 0 1 −1.0 −0.5 0.0 0.5 Gemmatimonadetes ANG 4339 * −0.5 0.0 0.5 * Rokubacteria Gemmatimonadetes ANG 3801 * Gemmatimonadetes Gemmatimonadetes Gammaproteobacteria ANG 2807 ANG 4439 ANG 3684 ANG 884 Gemmatimonadetes ANG 3800 1.0 ● ● ● ● Plot 2 0.5 ● Plot 5 0.5 1 ● 0.5 Gemmatimonadetes ANG 3684 ● ● ● Plot 9 ● 0.0 ● * 0.0 ●● ● ● Gemmatimonadetes ANG 2150 ● Plot 12 ● ● 0.0 ● ● ●● 0 ●●● * ● ● Plot 13 ● ● −0.5 ● ● Gammaproteobacteria ANG 884 ● ● −0.5 ● −0.5 Plot 16 ● ● Gammaproteobacteria ANG 56 * −1.0 −1 * ● ● ● ● −1.0 −0.5 0.0 0.5 1.0 20 cm * −1.5 −1.0 −0.5 0.0 0.5 1.0 −1 0 1 −1.0 −0.5 0.0 0.5 Deltaproteobacteria ANG 2457 30 cm Verrucomicrobia Chloroflexi ANG 2135 * Verrucomicrobia Verrucomicrobia Verrucomicrobia 40 cm ANG 7383 ANG 6721 ANG 6527 ANG 3696 ● 1.0 ● 1.0 ANGP1 ANG 6837 1.0 1.0 ● * 0.5 Dormibacteraeota ANG 750 0.5 * 0.5 ● 0.5 ● ● ● ● ● ● ● Acidobacteria ANG 477 0.0 ● 0.0 ● ● ● ● ● 0.0 ● 0.0 ● ● ● ● ● ● Acidobacteria ANG 10556 −0.5 ● ● −0.5 −0.5 ● −0.5 * −1.0 ● 0.0 0.1 0.2 0.3 0.4 ● ● −1.0 R−squared −1.5 −1.0 −0.5 0.0 0.5 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 −1.0 −0.5 0.0 0.5 1.0

Figure 2. Spatial variation in genetic differences within species. (a) The percentage of variation in genetic similarity (ANI) of consensus genomes explained by plot of origin (red) and sampling depth (blue). (b) NMDS of genetic dissimilarities between genomes within each species, for all 12 populations for which sampling plot explained a significant fraction of the variation in genetic dissimilarity (PERMANOVA; FDR=5%; p<0.05). There is a single point for each genome independently assembled for a population, and genomes are colored by sample plot of origin and the point shape indicates the sample depth.

131

4 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

132 High population nucleotide diversity is observed meadow-wide and within soil 133 samples 134 Each genome was assembled from >100,000 short reads, and millions of reads could be assigned 135 to each population from all 60 samples. Because microbial populations within the 10 g soil samples 136 used for DNA extraction are comprised of orders of magnitude more cells than were sequenced, 137 most read pairs are likely from unique cells (or DNA molecules) in the population. Each genome 138 assembled from each sample was sequenced at around 10x coverage, but meadow-wide coverages 139 for each population ranged from 224x to 908x (Supplementary Table S2). 140 We implemented stringent requirements for read mapping of read pairs that were at least 141 96% combined sequence identity to a representative reference for each genome (see Methods). 142 Replicate genomes from each species were used to eliminate possible contaminating contigs within 143 each metagenome-assembled genome assembly, and SNPs were then called at frequencies >5% 144 in each population using reads from all samples, using a simple null model that assumes a false ~ −6 145 discovery rate < 10 . Compared to a previously published study on freshwater lake microbial 146 populations (Bendall et al. 2016), there were far more polymorphisms per population (4,721 to 147 43,225 SNPs / Mbp, Supplementary Table S2), despite having a minimum allele frequency cutoff of 148 5%. Most populations also had higher rates of SNPs / Mbp than a similar analysis of metagenome 149 assembled genomes found for microbial populations in deep-sea hydrothermal vents (Anderson et 150 al. 2017). 151 To assess the genetic variability within each population across the meadow beyond the SNPs / 152 Mbp metric, we calculated the per-site nucleotide diversity of the sequencing reads (probability that 153 any two sampled reads share the same base at a site) at each locus for that population. Previously, 154 metagenomic studies have used either the average similarity of reads to a reference or the total 155 number of SNPs / Mbp as metrics of genetic diversity (Bendall et al. 2016) (Anderson et al. 2017). 156 We chose to measure nucleotide diversity because (1) it is less sensitive to large changes in coverage 157 (Fig S1), (2) it can be calculated both for a single site and averaged over genes or windows, and 158 (3) it considers not only the number of SNPs but also their frequencies in the population. We 159 found a wide range of per-gene nucleotide diversity values for the 19 different populations (Fig 160 3a). Because nucleotide diversity is less sensitive to changes in coverage, we could use it to track 161 how genetic diversity changes with distance across the meadow. We calculated nucleotide diversity 162 in pooled mapped reads for each population sampled from the same location and soil depth, 163 within plot, within block (pairs of plots), and across the entire meadow for each species (Fig 3a). 164 Although nucleotide diversity tends to be higher at the meadow scale compared to the sample 165 scale, the genetic diversity within some samples was comparable to that across the entire meadow 166 for many populations, indicating that in some cases high genetic diversity persists within soils even 167 at the centimeter scale. This implies that although there are shifts in allele frequencies across the 168 plots within the meadow, within each 10 g sample of soil many alternative alleles are frequently 169 encountered, and extensive gene variant dispersal occurs across the meadow. 170 In almost all populations, the ribosomal genes consistently had lower genetic diversity than 171 the average genes in the genome, consistent with these genes being strongly conserved and 172 under higher purifying selection (Jordan et al. 2002) (Fig 3a). Conversely, biosynthetic genes 173 involved in the production of small molecules, annotated with antiSMASH (Blin et al. 2017), were 174 found to have significantly higher genetic diversity than the genomic average in Chloroflexi, two 175 Gemmatimonadetes species and one Acidobacteria species (Fig 3a). Examining the most genetically 176 diverse genes within each population (Fig 3b), we find that genes for extracellular transport were 177 highly represented, along with polyketide cyclases, transposases, proteases, and a type 2 secretion 178 system pilus gene. Therefore, these genes involved in extracellular uptake, sensing, and interaction 179 may likely be experiencing local selective pressures for diversification across soil microbial species.

5 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

(a) ● 10556 ● 477 ● 6837● 2135● 2457● 750 ● 56 (b)

0.020 ●

● ● ● ● ●

● ● ● Protease PfpI ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ISXO2-like transposase ● ● ● 0.015 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Glycosyl transferase ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.010 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● 0.005 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.000 ●

Alpha/beta hydrolase●

● Polyketide cyclase ● 884 ● 2150● 3684● 3800● 3801● 4339● 2807 ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.020 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● T2SS pilus ● ● ● ● ● ● ● 0.015 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.010 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ABC Transporter ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.005 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Nucleotide Diversity Protease inhibitor

0.000 ● ● ● ● 3696● 4402● 6527● 6721● 7383 plot plot ● block block

Ferritin● sample sample ● meadow meadow replicate replicate Chromate transporter 0.020 ●

Chromate● resistance ● ● ● ● ● ● ● ● ● ● ● ● ● Acidobacteria ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.015 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ANGP1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Beta-lactamase ● ● ● ● ● Chloroflexi 0.010 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Deltaproteobacteria ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.005 ● ● ● ● ● ● ● ● ● Dormibacteraeota ● ● ● ● ● ● ● ● ● Gammaproteobacteria ● 0.000 Polyketide● cyclase ● ● Gemmatimonadetes Transposase IS66 family plot plot plot plot plot

block block block block block ●Rokubacteria sample sample sample sample sample meadow meadow meadow meadow meadow replicate replicate replicate replicate replicate ● Verrucomicrobia 0.01 0.02 0.03 0.04 0.05 0.06

● All genes ● Biosynthetic genes ● Ribosomal genes Nucleotide Diversity

Figure 3. Genetic diversity of 19 highly abundant bacterial populations. (a) Distributions of average per-gene nucleotide diversity for each population, measured across increasing scales of sampling. Separated by all genes (red), ribosomal genes (blue), and biosynthetic genes (green). Lines connect the means of each distribution of points across scales. (b) Nucleotide diversity of the three most genetically diverse genes in each population. Point size scales with the gene’s length, labels indicate PFAM annotations of individual genes.

180 Homologous recombination is common, but populations are not at panmixia 181 Measuring the impact of homologous recombination on the observed genetic diversity in a popu- 182 lation can be accomplished with metagenomic data by measuring linkage disequilibrium of SNPs 183 spanned by paired reads (Rosen et al. 2015; Lin and Kussell 2019). As two sites are further apart on 184 the genome, the chance for a recombination event to occur between them increases, resulting in a 185 characteristic signal known as linkage decay. Given 200 bp reads and 300-600 bp insert sizes, we 186 could reliably assess genomic linkage of SNPs up to 800-1000 bp apart in each population. Consis- 187 tent with the expectation that natural bacterial populations can undergo extensive homologous 2 188 recombination, we observed the r metric of linkage disequilibrium decay as the genomic distance 189 between two polymorphisms increased (Fig S2). In the less genetically diverse populations (Fig 4a), 2 190 there was a noticeable higher r of synonymous variants linked to other synonymous variants than 191 for nonsynonymous variants linked to other nonsynonymous variants. This has been previously 192 observed in hot-spring Cyanobacteria, and was explained as a decrease in coupling linkage for 193 slightly deleterious nonsynonymous variants, where recombinants inheriting the doubly deleterious 194 haplotype (of a pair of variants) are selected against (Rosen et al. 2018). In more genetically diverse 2 2 195 populations, this ratio shifts towards nonsynonymous polymorphisms (r N ) having higher r than 2 196 synonymous (r S )(Fig 4b). One recent study reported a similar positive rN -rS ratio as a signature of 197 positive balancing selection in Neisseria gonorrhoeae (Arnold et al.); for six of the 19 populations 2 2 198 analyzed here, the genomic average r N /r S ratio was greater than 1 (Fig 4c), possibly also indicating 2 2 199 recent balancing selection or admixture in this subset of populations. The increase in the r N /r S 2 200 ratio with nucleotide diversity (linear regression; R =0.29; p=0.009) suggests a population shift from 201 mostly slightly deleterious nonsynonymous SNPs to a higher proportion of nonsynonymous SNPs 202 under positive selection as diversity increases.

6 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

(a) Acidobacteria ANG 10556 (b) Deltaproteobacteria ANG 2457 0.35 0.6 N-N 1,000 pairs of sites N-N 20,000 pairs of sites 0.30 N-S 2,000 N-S 40,000 S-S 3,000 S-S 80,000 0.5 0.25 r 2 r 2 0.29 0.4 0.15 0.3 0 200 400 600 0 200 400 600 800 Genomic Distance (bp) Genomic Distance (bp)

Acidobacteria ANG 477 Gemmatimonadetes ANG 2150 0.5 0.28 N-N 20,000 pairs of sites N-N 2,000 pairs of sites 0.4 N-S 4,000 N-S 40,000 r 2 S-S 6,000 r 2 0.24 S-S 80,000 0.3 0.20 0.2 0 200 400 600 0 200 400 600 800 Genomic Distance (bp) Genomic Distance (bp)

0.4 (c) (d) Nucleotide 884 4402 ● ● 56 Diversity 10556 ● ● 1.1 0.004 0.008 0.012 ● Acidobacteria 2807 2150 ● ● ● ANGP1 4339 884 3800 ● 0.3 ● ● Chloroflexi ● 2457 ● 1.0 2135 3801 6721 Deltaproteobacteria ● ● ● ● ●750 ● 750 ● Dormibacteraeota 2 2 3684 2 ● ● r / r 6837 r 3800 6527 477 N S ● 3684 ● ● ● Gammaproteobacteria ● ● ● ●3801 7383 ● Gemmatimonadetes ●4339 6837 ● 0.9 2135 3696 7383 0.2 ● ● Rokubacteria ● 477 ● 6527 ●3696 ● 2807 2150 ● Verrucomicrobia 10556 ● ● ● D' 2457 56 0.87 ● ● 6721 0.90 ● 0.8 4402 0.93 ● 0.96 0.1 0.004 0.008 0.012 0.016 0.87 0.90 0.93 0.96 Nucleotide Diversity D'

2 Figure 4. Varying rates of linkage disequilibrium within populations. (a, b) Linkage decay of r for pairs of loci within the two populations with (a) the lowest nucleotide diversity, and (b) the highest nucleotide diversity. Each square is an average of pairs of biallelic sites at that distance, with the area of the square point proportional to the number of pairs of biallelic sites that went into the mean. Haplotypes (site pairs) are binned by the predicted function of the mutations of each of the paired SNPs (nonsynonymous: N, synonymous: S). (c) 2 2 The relationship between nucleotide diversity and r N /r S . The mean nucleotide diversity and the mean ratio of the linkage of nonsynonymous-nonsynonymous vs synonymous-synoymous pairs of mutations across species is shown. The size of each point represents the mean D’ value for that species. A linear regression is shown 2 2 (linear regression; R =0.29; p=0.009). (d) The relationship between mean r and mean D’ across the 19 bacterial populations studied. Genomes with evidence for multiple operonic competence related genes are labeled in 2 green. A linear regression model is shown (F-statistic: 11.9, Adjusted R : 0.38, p=0.003).

2 2 203 Although r is often used as a signal to identify the presence or absence of recombination, r 204 values < 1 can occur with or without recombination. For example, three of four possible haplotypes 205 (pairs of variants) for two biallelic sites could occur due to lineage divergence prior to mutation 206 occurring at one of the sites. D’, an alternative metric of linkage equilibrium, is only <1 if all possible 207 combinations of a pair of biallelic sites are observed (VanLiere and Rosenberg 2008), which can 208 only occur in the presence of recombination or recurrent mutation. Generally, we found that 2 209 average D’ for a population linearly correlated with mean r (Fig 4d). Deviation from a linear 2 210 relationship between r and D’ (observed for three populations) could also be explained by rates of 211 recombination being high and clonal diversity being low or vice versa. Inspection of the distribution ~ 212 of pairs of SNPs separated by < 1 kbp revealed that 3 of 4 possible haplotypes is most common,

7 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

213 but there was detection of all four possible biallelic haplotype combinations (D’ < 1), at 7% to 36% 214 of all site pairs in each population (Fig S3). The observed frequencies of the least common of the 215 four SNP combinations also were higher than expected based on sequencing error, although much 216 less frequent than expected from linkage equilibrium, as evidenced by mean D’ values above 0.8. 217 Nonetheless, the extensive appearance of D’ <1 at a significant fraction of loci, along with a signal 218 of linkage decay with genomic distance across all populations, provides firm evidence for ongoing 219 processes of within population homologous recombination, albeit to different degrees between 220 organisms. 221 Given evidence for recent homologous recombination, we searched the genomes for genes that 222 could confer natural competence, such as homologs of the ComEC gene with all three functional 223 domains (Pimentel and Zhang 2018), and identified loci with additional genes involved in DNA 224 uptake and recombination (Cassier-Chauvat et al. 2016) (Supplementary Table S3). We found that 225 presence of a ComEC homolog with multiple adjacent operonic recombination-related genes was 226 strongly associated with the lowest values of D’ (Fig 4d). Thus, it is likely that natural competence is 227 a common mechanism that facilitates homologous recombination for abundant soil bacteria.

228 Gene-specific selective sweeps lead to divergence of allele frequencies across the 229 meadow 230 In neutrally evolving local populations, nucleotide diversity will increase monotonically with popula- 231 tion size. Because we see no such relationship between genetic diversity and relative abundance, 232 purely neutral growth and processes cannot explain the observed differences in genetic diversity 233 between these populations (Fig S4; linear regression; p=0.88). Except for populations with the 234 lowest genetic diversity, ratios of non-synonymous to synonymous polymorphisms for each popula- 235 tion are consistently low. This indicates that purifying selection has eliminated slightly deleterious 236 mutations in the populations that have accrued more genetic diversity over time (Fig S5; linear 2 237 regression; R =0.25; p=0.018). In all species except the least diverse population (an acidobacterium), 238 non-synonymous variants also had consistently higher values of D’ than synonymous variants. 239 Further, as genome wide D’ decreased (more recombination), the degree to which non-synonymous 240 variants were more linked than synonymous (D’N /D’S ) increased (Fig S6-S7). As recombination, 241 nonsynonymous linkage increased in comparison to synonymous linkage. This effect is consistent 242 with stronger selection on non-synonymous variants: both purifying selection and positive selection 243 would increase D’ for nonsynonymous SNPs. 244 Soil ecosystems are exceptionally heterogeneous, and environmental factors can change over 245 millimeter distances, potentially due to changes in aboveground plant productivity, soil geochem- 246 istry, plant litter composition, and soil particulate structure (Vos et al. 2013). While it is difficult to 247 tease apart the effects of changing abiotic parameters over spatial scales, it is possible to examine 248 how allele frequencies change over these scales. We calculated pairwise fixation indexes FST (a 249 measure of differences in allele frequencies between two populations) for each gene between allele 250 frequencies from the three meadow blocks for every species-group (Fig 1b). For most populations, 251 mean gene FST values were low, <5%, consistent with dispersal of most alleles between blocks (Fig 252 S8). For only a minority of populations, mean gene FST values were consistently >10%, indicating 253 that there was significant geographic organization of genetic structure at most loci across the 254 genome. Therefore, while the total variation in genome-wide major consensus alleles is often well 255 explained by meadow geography, most individual alleles have a high chance of being found at fairly 256 similar frequencies across the meadow. 257 When specific loci are characterized by significantly higher FST than the background average for 258 the genome, it is characteristic of geographic location-specific selective pressures acting on that 259 locus. To identify genomic regions of unexpectedly high FST , we scanned over a moving 5 gene 260 window and tested if the mean genetic diversity of that region had an FST greater than 2.5 standard 261 deviations above the genomic mean (Fig S9). We removed genes with abnormally low coverage in 262 either block from this analysis. To define the length of the genomic region with elevated FST values,

8 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

(a) Deltaproteobacteria 2457: blocks 16_13 vs 12_9 Acidobacteria 10556: blocks 16_13 vs 12_9 Verrucomicrobia 7383: blocks 16_13 vs 12_9 1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 0.6 FST FST FST 0.4 0.4 0.4 locus 1 locus 1 0.2 0.2 locus 4 0.2

0.0 0.0 0.0 1000 3000 5000 1000 2000 1000 2000 Verrucomicrobia 6721: blocks 16_13 vs 12_9 Verrucomicrobia 6527: blocks 5_2 vs 12_9 Rokubacteria 2807: blocks 5_2 vs 12_9 1.0 1.0 1.0

0.8 0.8 0.8

0.6 0.6 locus 1 0.6 FST FST FST 0.4 locus 3 0.4 0.4 locus 1

0.2 0.2 0.2

0.0 0.0 0.0 1000 2000 800 1600 1000 2000 3000 4000 Environmental sensing (b) (c) ABC transporters Carbohydrate-related Genomic Average Genomic Average High FST locus High FST locus Hopene biosynthesis Cell wall biosynthesis Acidobacteria_ANG_10556: locus 4 21 Kb Verrucomicrobia_ANG_7383: locus 1 6 Kb ANGP1_ANG_6837: locus 3 9.5 Kb Verrucomicrobia_ANG_6721: locus 3

5.5 Kb Rokubacteria_ANG_2807: locus 1 5.3 Kb Deltaproteobacteria_ANG_2457: locus 1 15 Kb Verrucomicrobia_ANG_6527: locus 1 12 Kb Chloroflexi_ANG_2135: locus 2 6 Kb 0.003 0.006 0.009 0.00 0.25 0.50 0.75 1.00 Nucleotide Diversity r 2

Figure 5. Highly differentiated genomic loci between sites within a meadow. (a) Values of FST for genes across the genomes of 6 bacterial populations. Each point is a gene, and the size of the point is determined by the number of SNPs within that gene. Plotted is the mean FST for that gene. Loci with significantly higher FST than the background are highlighted in red, and labeled by locus numbers used in part (b). (b) Left: Nucleotide diversity at highly differentiated loci (red circle) compared to the average (empty circle) for each population. Right: The extent of linkage disequilibrium at highly differentiated loci (red) compared to the genomic average (black) for each population. (c) Gene diagrams and annotations of highly differentiated loci (genome and loci identities given in panel b). Each block indicates an open reading frame, and blocks are colored by a subset of predicted functions.

263 we extended successful windows until the mean FST fell below this cutoff. This test identified a 264 number of regions of high FST within some microbial genomes, despite those genomes having low 265 average FST (Fig 5a). To further test for evidence of recent selection at these loci we looked for a 266 statistically significant average increase in linkage and a significant change in nucleotide diversity 267 compared to the genomic average in one or both of the blocks. We noticed that both signals of 268 purifying selection (characterized by low N:S ratios) and a reduction in genomic coverage (potentially 269 indicating gene loss in some portion of the population) often correlated with low nucleotide diversity 270 in genomic regions, and based on this, caution against identifying gene-specific selective sweeps in 271 metagenomic data based solely on a reduction in nucleotide diversity or SNP frequency. While we 272 found many loci with unusual FST or strong changes in nucleotide diversity, our stringent criteria 273 narrowed those sets down to 8 high FST loci with significantly increased rates of linkage compared 274 to the genomic background (Fig 5c). These loci also had significant changes in nucleotide diversity 275 within blocks when compared to their genomic averages (Fig 5b and 5c). All of these loci showed 276 decreases in genetic diversity, consistent with selective sweep events in one or multiple meadow

9 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

277 blocks. These loci also genes had higher N:S ratios than genomic averages, possibly consistent with 278 either recent selection acting on beneficial nonsynonymous mutations or a local accumulation of 279 slightly deleterious nonsynonymous genetic hitchhikers. 280 The loci with evidence of recent differential selection across the meadow commonly contained 281 transporter genes (Fig 5c, Supplementary Table S4), which could indicate selective pressures for 282 uptake of different compounds between sites. A Verrucomicrobia population also showed evidence 283 of a selective sweep occurring at a putative hopene biosynthesis operon, often involved in regulating 284 membrane stability. Within a Deltaproteobacteria population, a highly differentiated locus encoded 285 for numerous genes involved in two-component systems and histidine kinases, potentially related 286 to environmental sensing and response. Taken together, these multiple genomic signals suggest 287 that gene-specific selection drives differences in population genetic structure across meadow soils.

288 Conclusions 289 Developing a cohesive picture of the genetic structures of soil bacterial populations is crucial 290 for understanding the evolution and distribution of genes that can play critical ecosystem and 291 niche-specific functions. Doing so has been difficult because the most abundant soil bacteria are 292 difficult to cultivate, and when cultivation is possible, it is uncertain if microbial isolates are truly 293 random samples of a population, considering the intense selective pressure put on cells during the 294 isolation process. Here, we show that recent advances in sequencing technologies and software 295 tools enable the monitoring of heterogeneity within genomically-defined bacterial populations 296 needed to address these questions in soil. 297 Population genetic diversity in abundant soil microbes was higher than that reported for pop- 298 ulations in other environments, mirroring the high species diversity in soils. It is possible that 299 the spatial and biogeochemical complexity of soils promotes niche formation that leads to both 300 increased species diversity and in turn more genetic diversity within those species. This possibility 301 has direct implications for reductionist approaches in soil microbial ecology using synthetic commu- 302 nities of clonal isolates. Synthetic communities are qualitatively different from actual soil microbial 303 communities: in real soil communities, each species contains allelic and gene content diversity that 304 is highly variable, while recombination generates a continuum of genetic distributions of alleles. 305 Therefore, population-level diversity in natural soil communities can expand the available biotic 306 niches into which genotypic variants adapt, which can produce qualitatively different community 307 dynamics and outcomes than synthetic mixtures of genetically identical strains, even if species 308 composition is consistent. 309 The results of this study suggest that recombination and gene-specific selection are important 310 modes of evolution in soil microbes. Even in a single meadow we predicted that there are likely 311 thousands of combinatorial genetic mixtures for each species, with recombination resulting in no 312 easily measurable numbers of irreducible strain-like lineages. Thus, the dynamics of dominant soil 313 bacterial populations may be partially described using ‘quasi-sexual’ models (Rosen et al. 2015; 314 Garud et al. 2019). Although we only observed the importance of these dynamics in structuring 315 populations at the 1-10 meter scale, it is still uncertain how evolutionary dynamics in soil bacterial 316 populations develop on other spatial and temporal scales. We conclude that future work on soil 317 microbial ecology would benefit by considering the role of unlinked and substantial allelic diversity 318 within species in shaping local gene content and allele frequencies.

319 Materials and Methods

320

321 Sampling, genome sequencing, and metagenomic assembly 322 The sampling scheme, local soil characteristics, and study design that were utilized in this analysis 323 have been previously described (Diamond et al. 2019). The 60 soil cores were sampled within the 6 324 plots, each 10 meters in diameter, over a period of two months before and following autumn rainfall.

10 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

325 Every plot was directly adjacent to a paired plot (forming a “block”). Briefly, DNA was extracted from 326 10 g of soil for each sample using the PowerMax Soil DNA isolation kit (MoBio Laboratories) from 327 individual soil cores. Metagenomic libraries were prepared and sequenced with 2x250 paired read 328 sequencing on the Illumina HiSeq2500 platform at the Joint Genome Institute. Reads were quality 329 filtered to a maximimum 200 bp in length using BBduk, and the target inter-read spacing was 500 330 bp. Metagenomes were assembled using IDBA_UD (Peng et al. 2012) and individual genomes were 331 binned using differential coverage beginning and a suite of metagenomic binners as previously 332 described (Diamond et al. 2019).

333 Genome dereplication, filtering, and comparison 334 The 10,538 genome bins previously described were dereplicated using dRep (Olm et al. 2017) 335 with the secondary clustering threshold -sa 0.97. CheckM (Parks et al. 2015) was used to evaluate 336 genome completeness and contamination. For this analysis, we then analyzed species with at least 337 12 dereplicated genomes that were estimated to be >80% complete with <10% contamination, 338 independently assembled and binned out of different samples. Open reading frames were predicted 339 using Prodigal (Hyatt et al. 2010) and annotated using (1) USEARCH against UniProt (UniProt 340 Consortium 2018), Uniref90, and KEGG (Kanehisa 2000), and (2) HMM-based annotation of proteins 341 using PFAM (Finn 2005), and (3) antiSMASH (Blin et al. 2017) for biosynthetic gene prediction. 342 PERMANOVA tests for association of ANI matrices with environmental data were run using the 343 vegan package (Dixon 2003) in R (R Core Team 2018). 344 Possible contamination was stringently removed from representative genomes using the as- 345 sembled replicate genomes for each species population. The pan-genomic analysis pipeline Roary 346 (Page et al. 2015) was run with default settings on the set of genomes for each species. Contigs 347 with at least 50% of their gene products found in less than 25% of each genome set were then 348 discarded as potential contamination (generally fewer than 20 contigs, often small, were removed 349 per genome). Therefore, the final set of contigs used in this analysis contained only contigs that 350 reliably assembled and binned independently for a species in multiple samples. Organism relative 351 abundances were calculated using the total number of reads that mapped to each genome in each 352 sample and dividing by the total number of reads per sample.

353 Read mapping, SNP calling, and nucleotide diversity 354 All metagenomic reads were mapped using Bowtie 2 (Langmead and Salzberg 2012) with default 355 parameters except for the insert size parameter -X 1000 to an indexed database of the 664 derepli- 356 cated genomes obtained from the environment. Read filtering of the resulting BAM files was 357 performed using a custom script, filter_reads.py, available at the link under code availability. Map- 358 ping files were then filtered for reads that meet the following criteria: (1) both reads in a pair map 359 within 1500 bp of each other to the same scaffold (counting end to end insert size), (2) the combined 360 read pair maps with a percent identity of at least 96% to the reference, (3) at least one of the read 361 pairs has a mapq score > 1, indicating that this is a uniquely best mapping for this read pair in 362 the index. Manual inspection of representative mapping files indicated that these filters greatly 363 decreased rates of read mismapping and erroneous read pileups. 364 For all population diversity metrics, we used all filtered cross-sample read mappings that passed 365 a cutoff of at least 50% of the genome being covered with at least 5x coverage. 586 out of 1140 366 sample genome comparisons (19 genomes*60 samples) passed this minimum requirement. Base 367 pairs of reads with less than Phred scores of 30 were not used in SNP or linkage analyses. Nucleotide 368 diversity was calculated on all genomic positions with at least 5x coverage within each sample. 369 Sample read mappings were pooled by replicates, by plot of origin, by block and origin, and finally 370 by all samples in the meadow, and nucleotide diversity was recalculated on each pooled set of 371 samples. For downstream analyses, nucleotide diversity for all samples (meadow-wide, Fig 3) and 372 nucleotide diversity within each of the three sampling blocks (block-wide, Fig 5) was analyzed. 373 SNPs were called on the combined meadow-wide population set of filtered reads. We con-

11 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

374 structed a simple null model based on an error rate of 0.01% (the Phred 30 cutoff) and simulated 375 simple sampling with replacement to construct estimated rates of erroneous SNP counts at pileups 376 of varying coverages. Positions with an alternative allele that occurred with counts that had a false ~ −6 377 positive rate of 10 with the given coverage of that site in the null model and a minimum allele 378 frequency (MAF) of 5% were called as SNPs. Because meadow-wide coverages were at least 224x 379 (and ranged up to 908x) the MAF cutoff alone would likely be a stringent cutoff for Phred 30 error 380 rates. SNPs were assigned as synonymous or non-synonymous using a custom BioPython script 381 and the gene calls annotated by Prodigal.

382 Linkage disequilibrium, FST, and tests for selection 383 Linkage was calculated for all pairs of segregating sites that were spanned by at least 30 read pairs 2 384 with high quality base pair data. r and D’ linkages were calculated using formulas described by 385 (VanLiere and Rosenberg 2008). FST was calculated on sites segregating across both blocks being 386 compared (for all three block comparisons) using the Hudson method (Hudson et al. 1992) as 387 recommended by (Bhatia et al. 2013), as implemented in the scikit-allel package (Miles et al. 2019). 388 A site had to have a coverage of at least 20x in each block in order to calculate FST , and genes which 389 had coverages in a block outside of the range of two standard deviations were excluded from the 390 analysis. A ratio of averages was then used to determine mean FST for each gene. A two-sample 391 Wilcoxon test was used to determine if average linkage of highly differentiated loci differed from 392 the genomic average for each species, and two-sample t-tests in R (R Core Team 2018) were used to 393 determine if average nucleotide diversity of highly differentiated loci differed from the genomic 394 average. Both sets of tests were corrected for multiple hypotheses using the Benjamini-Hochberg 395 (Benjamini and Hochberg 1995) method.

396 Code and data availability 397 The published genomes and raw sequencing reads for this study are available under NCBI BioProject 398 number PRJNA449266. Reproducible notebooks, analyses and additional data tables used in this 399 analysis are available at https://github.com/alexcritschristoph/soil_popgen/. Large data tables and 400 genome sequences are available at: 401 https://figshare.com/projects/Soil_bacterial_populations_are_shaped_by_recombination_and_g 402 ene-specific_selection_across_a_meadow/65789

403 Acknowledgments 404 The authors thank S. Spaulding for assistance with fieldwork and B. Good for helpful discussions on 405 data analysis and theory. The authors thank K.B. Suttle for establishing the field site’s grassland 406 plots and maintaining experimental precipitation regimes and for the aerial photograph in Figure 1. 407 Sampling was performed at the Angelo Coast Range Reserve and made possible by the University 408 of California Natural Reserve System and the National Science Foundation CZP EAR-1331940 for the 409 Eel River Critical Zone Observatory. Sequencing was carried out under a Community Sequencing 410 Project at the Joint Genome Institute. Funding was provided by the Office of Science, Office of 411 Biological and Environmental Research, of the US Department of Energy (grant DOE-SC10010566).

412 References 413 Anderson RE, Reveillaud J, Reddington E, Delmont TO, Eren AM, McDermott JM, Seewald JS, Huber 414 JA. 2017. Genomic variation in microbial populations inhabiting the marine subseafloor at deep-sea 415 hydrothermal vents. Nat. Commun. 8:1114. 416 Arnold B, Sohail M, Wadsworth C, Corander J, Hanage WP, Sunyaev S, Grad YH. Fine-scale 417 haplotype structure reveals strong signatures of positive selection in a recombining bacterial 418 pathogen.

12 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

419 Bendall ML, Stevens SL, Chan L-K, Malfatti S, Schwientek P, Tremblay J, Schackwitz W, Martin J, 420 Pati A, Bushnell B, et al. 2016. Genome-wide selective sweeps and gene-specific sweeps in natural 421 bacterial populations. ISME J. 10:1589–1601. 422 Benjamini Y, Hochberg Y. 1995. Controlling the False Discovery Rate: A Practical and Powerful 423 Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological) 424 57:289–300. 425 Bergmann, Gaddy T., Scott T. Bates, Kathryn G. Eilers, Christian L. Lauber, J. Gregory Caporaso, 426 William A. Walters, Rob Knight, and Noah Fierer. 2011. The under-recognized dominance of 427 Verrucomicrobia in soil bacterial communities. Soil Biology and Biochemistry 43, no. 7: 1450-1455. 428 Berhe AA, Blake Suttle K, Burton SD, Banfield JF. 2012. Contingency in the direction and mechan- 429 ics of soil organic matter responses to increased rainfall. Plant and Soil 358:371–383. 430 Bhatia G, Patterson N, Sankararaman S, Price AL. 2013. Estimating and interpreting FST: the 431 impact of rare variants. Genome Res. 23:1514–1521. 432 Blin K, Wolf T, Chevrette MG, Lu X, Schwalen CJ, Kautsar SA, Suarez Duran HG, de los Santos 433 ELC, Kim HU, Nave M, et al. 2017. antiSMASH 4.0—improvements in chemistry prediction and gene 434 cluster boundary identification. Nucleic Acids Research 45:W36–W41. 435 Butterfield CN, Li Z, Andeer PF, Spaulding S, Thomas BC, Singh A, Hettich RL, Suttle KB, Probst 436 AJ, Tringe SG, et al. 2016. Proteogenomic analyses indicate bacterial methylotrophy and archaeal 437 heterotrophy are prevalent below the grass root zone. PeerJ 4:e2687. 438 Cassier-Chauvat C, Veaudor T, Chauvat F. 2016. Comparative Genomics of DNA Recombination 439 and Repair in Cyanobacteria: Biotechnological Implications. Front. Microbiol. 7:1809. 440 Cui, Y., Yang, X., Didelot, X., Guo, C., Li, D., Yan, Y., Zhang, Y., Yuan, Y., Yang, H., Wang, J. and Wang, 441 J., 2015. Epidemic clones, oceanic gene pools, and eco-LD in the free living marine pathogen Vibrio 442 parahaemolyticus. Molecular biology and evolution, 32(6), pp.1396-1410. 443 Diamond S, Andeer PF, Li Z, Crits-Christoph A, Burstein D, Anantharaman K, Lane KR, Thomas 444 BC, Pan C, Northen TR, et al. 2019. Mediterranean grassland soil C–N compound turnover is 445 dependent on rainfall and depth, and is mediated by genomically divergent microorganisms. 446 Nature Microbiology . 447 Dixon P. 2003. VEGAN, a package of R functions for community ecology. Journal of Vegetation 448 Science 14:927. 449 Doroghazi JR, Buckley DH. 2010. Widespread homologous recombination within and between 450 Streptomyces species. ISME J. 4:1136–1143. 451 Fierer N. 2017. Embracing the unknown: disentangling the complexities of the soil microbiome. 452 Nat. Rev. Microbiol. 15:579–590. 453 Fierer N, Strickland MS, Liptzin D, Bradford MA, Cleveland CC. 2009. Global patterns in below- 454 ground communities. Ecol. Lett. 12:1238–1249. 455 Finn RD. 2005. Pfam: the protein families database. Encyclopedia of Genetics, Genomics, 456 Proteomics and Bioinformatics . 457 Garud NR, Good BH, Hallatschek O, Pollard KS. 2019. Evolutionary dynamics of bacteria in the 458 gut microbiome within and across hosts. PLoS Biol. 17:e3000102. 459 González-Torres P, Rodríguez-Mateos F, Antón J, Gabaldón T. 2019. Impact of Homologous 460 Recombination on the Evolution of Prokaryotic Core Genomes. MBio 10:e02494–18. 461 Hawkes CV, Kivlin SN, Rocca JD, Huguet V, Thomsen MA, Suttle KB. 2011. Fungal community 462 responses to precipitation. Global Change Biology 17:1637–1645. A 463 Hudson RR, Slatkin M, Maddison WP. 1992. Estimation of levels of gene flow from DNA sequence 464 data. Genetics 132:583–589. 465 Hultman J, Waldrop MP, Mackelprang R, David MM, McFarland J, Blazewicz SJ, Harden J, Turetsky 466 MR, McGuire AD, Shah MB, et al. 2015. Multi-omics of permafrost, active layer and thermokarst bog 467 soil microbiomes. Nature 521:208–212. 468 Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW, Hauser LJ. 2010. Prodigal: prokaryotic 469 gene recognition and translation initiation site identification. BMC Bioinformatics 11:119.

13 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

470 Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. 2018. High throughput ANI 471 analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9:5114. 472 Shapiro BJ, Friedman J, Cordero OX, Preheim SP, Timberlake SC, Szabó G, Polz MF, Alm EJ. 2012. 473 Population Genomics of Early Events in the Ecological Differentiation of Bacteria. Science 336:48–51. 474 Ji M, Greening C, Vanwonterghem I, Carere CR, Bay SK, Steen JA, Montgomery K, Lines T, Beardall 475 J, van Dorst J, et al. 2017. Atmospheric trace gases support primary production in Antarctic desert 476 surface soil. Nature 552:400. 477 Jordan, I. K., Rogozin, I. B., Wolf, Y. I., & Koonin, E. V. 2002. Essential genes are more evolutionarily 478 conserved than are nonessential genes in bacteria. Genome Research, 12(6), 962-968. 479 Kanehisa M. 2000. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 480 28:27–30. 481 Krause DJ, Whitaker RJ. 2015. Inferring Speciation Processes from Patterns of Natural Variation 482 in Microbial Genomes. Syst. Biol. 64:926–935. 483 Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nature Methods 484 9:357–359. 485 Lin M, Kussell E. 2019. Inferring bacterial recombination rates from large-scale sequencing 486 datasets. Nat. Methods 16:199. 487 Lloyd KG, Steen AD, Ladau J, Yin J, Crosby L. 2018. Phylogenetically Novel Uncultured Microbial 488 Cells Dominate Earth Microbiomes. mSystems 3:e00055–18. 489 Miles A, Ralph P, Rae S, Pisupati R. 2019. cggh/scikit-allel: v1.2.1. Available from: https://zenodo. 490 org/record/3238280 491 O’Brien SL, Gibbons SM, Owens SM, Hampton-Marcell J, Johnston ER, Jastrow JD, Gilbert JA, Meyer 492 F, Antonopoulos DA. 2016. Spatial scale drives patterns in soil bacterial diversity. Environ. Microbiol. 493 18:2039–2051. 494 Olm MR, Brown CT, Brooks B, Banfield JF. 2017. dRep: a tool for fast and accurate genomic 495 comparisons that enables improved genome recovery from metagenomes through de-replication. 496 ISME J. 11:2864–2868. 497 Oksanen, Jari, Roeland Kindt, Pierre Legendre, Bob O’Hara, M. Henry H. Stevens, Maintainer Jari 498 Oksanen, and M. A. S. S. Suggests. 2007. The vegan package. Community ecology package. 10: 499 631-637. 500 Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, Fookes M, Falush D, Keane 501 JA, Parkhill J. 2015. Roary: rapid large-scale pan genome analysis. Bioinformatics 502 31:3691–3693. 503 Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. 2015. CheckM: assessing the 504 quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 505 25:1043–1055. 506 Peng Y, Leung HCM, Yiu SM, Chin FYL. 2012. IDBA-UD: a de novo assembler for single-cell and 507 metagenomic sequencing data with highly uneven depth. Bioinformatics 28:1420–1428. 508 Pimentel ZT, Zhang Y. 2018. Evolution of the Natural Transformation Protein, ComEC, in Bacteria. 509 Front. Microbiol. 9:2980. 510 R. Core Team. 2013. R: A language and environment for statistical computing: 201. 511 Rocha EPC, Cornet E, Michel B. 2005. Comparative and Evolutionary Analysis of the Bacterial 512 Homologous Recombination Systems. PLoS Genet. 1:e15. 513 Rosen MJ, Davison M, Bhaya D, Fisher DS. 2015. Fine-scale diversity and extensive recombination 514 in a quasisexual bacterial population occupying a broad niche. Science 348:1019–1023. 515 Rosen MJ, Davison M, Fisher DS, Bhaya D. 2018. Probing the ecological and evolutionary history 516 of a thermophilic cyanobacterial population via statistical properties of its microdiversity. PLoS One 517 13:e0205396. 518 Sakoparnig T, Field C, van Nimwegen E. 2019. Whole genome phylogenies reflect long-tailed 519 distributions of recombination rates in many bacterial species. bioRxiv :601914. Available from: htt 520 ps://www.biorxiv.org/content/10.1101/601914v1.abstract

14 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

521 Shapiro BJ, David LA, Friedman J, Alm EJ. 2009. Looking for Darwin’s footprints in the microbial 522 world. Trends in Microbiology 17:196–204. 523 Smith, John Maynard, and John Haigh 1974. The hitch-hiking effect of a favourable gene. Genetics 524 Research 23, no. 1: 23-35. 525 Sullivan MJP, Thomsen MA, Suttle KB. 2016. Grassland responses to increased rainfall depend 526 on the timescale of forcing. Glob. Chang. Biol. 22:1655–1665. 527 Suttle KB, Thomsen MA, Power ME. 2007. Species interactions reverse grassland responses to 528 changing climate. Science 315:640–642. 529 Thomas CM, Nielsen KM. 2005. Mechanisms of, and Barriers to, Horizontal Gene Transfer 530 between Bacteria. Nat. Rev. Microbiol. 3:711. 531 Thompson LR, Sanders JG, McDonald D, Amir A, Ladau J, Locey KJ, Prill RJ, Tripathi A, Gibbons 532 SM, Ackermann G, et al. 2017. A communal catalogue reveals Earth’s multiscale microbial diversity. 533 Nature 551:457. 534 Tran F, Boedicker JQ. 2019. Plasmid Characteristics Modulate the Propensity of Gene Exchange 535 in Bacterial Vesicles. J. Bacteriol. 201:e00430–18. 536 UniProt Consortium T. 2018. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 537 46:2699. 538 VanLiere, J.M. and Rosenberg, N.A., 2008. Mathematical properties of the r2 measure of linkage 539 disequilibrium. Theoretical population biology, 74(1), pp.130-137. 540 Vos M, Wolf AB, Jennings SJ, Kowalchuk GA. 2013. Micro-scale determinants of bacterial diversity 541 in soil. FEMS Microbiol. Rev. 37:936–954. 542 Whitaker RJ, Banfield JF. 2006. Population genomics in natural microbial communities. Trends in 543 Ecology & Evolution 21:508–516. 544 White RA 3rd, Bottos EM, Roy Chowdhury T, Zucker JD, Brislawn CJ, Nicora CD, Fansler SJ, 545 Glaesemann KR, Glass K, Jansson JK. 2016. Moleculo Long-Read Sequencing Facilitates Assembly 546 and Genomic Binning from Complex Soil Metagenomes. mSystems 1. 547 Wielgoss S, Didelot X, Chaudhuri RR, Liu X, Weedall GD, Velicer GJ, Vos M. 2016. A barrier to ho- 548 mologous recombination between sympatric strains of the cooperative soil bacterium Myxococcus 549 xanthus. ISME J. 10:2468–2477. 550 Woodcroft BJ, Singleton CM, Boyd JA, Evans PN, Emerson JB, Zayed AAF, Hoelzle RD, Lamberton 551 TO, McCalley CK, Hodgkins SB, et al. 2018. Genome-centric view of carbon processing in thawing 552 permafrost. Nature 560:49.

553

15 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

554 Supplemental Figures 555

● ●

● ● 90 90

● ●

● ● ● ● ● ● ● ● ● ● ● ● 60 ● 60 ● ● ● ● ● ● ●

● ● ● ●

Coverage ● Coverage ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● 30 ● ●● ● ● ● ●● 30 ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ●● ●●●●●●● ● ●● ● ● ●● ●●●● ● ●● ● ● ● ● ● ● ● ●● ●●● ●● ●● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ●●●●●●●●● ●● ●● ● ● ●● ●●● ●● ● ●● ● ● ●●● ● ● ● ●●●●●●●●●●●●●●● ●● ● ● ● ● ●●● ●●●● ●●● ● ●●● ●●● ● ●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●● ● ● ●●●●●●●●● ●●●●● ● ●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ● ●●● ●●●●●● ● ●●●●● ●●●●● ●●● ●●●● ●●● ●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●●●●●●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●● ●●● ●● ● ● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●● ● ●●●●●●●●●●● ● ●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●

0 10000 20000 30000 40000 0.005 0.010 0.015 0.020 SNPs / Mbp Nucleotide Diversity 2.0 ● ● ● ● 0.020 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● 1.5 ● ● ● ● ● 0.015 ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●●●● ●●● ● ● ●●● ●●●● ● ● ● ● ● ●●● ● ● ● ●●● ●● ● ● ● ● ● ●● ●●●●● ● ● ● ● ● ● ● ●●●●●●● ● ●●● ● ● ● ●●●●●● ● ●● 1.0 ●●● ●●● ● ● ● ● ● ● ●●●●●●●●● ● ● ● ●●●● ●● ● ● ●● 0.010 ●●●●● ●●●● ●●● ● ●●● ● ● ● ● ● ●●●●● ●●●●● ● ●●●●●● ●● ●● ● ● ●●●● ●●●●● ● ● ● ● ●● ●●●●●● ● ●●● ●●●●●●●●● ●●● ● ● ● ● ●●●● ● ● ● ● ●●●●●● ●● ●● ● ● ● ●● ●●●●● ●● ●●●●●● ● ● ● ●●●●●●●●●●● ● ● Nucleotide Diversity ●●●● ● ● ● ●● ● ● ● ● ●●●● ● ● ●● 0.5 ● ● ● ●● ● ●●●●●● ●● ● ●● ●● ●●●●●●● ●● ●● ●● ● ●● ●●●●●●●● ● ● ● ●●● ●● ● ● ●●● ●● 0.005 ●●●●●● ● ● ● ●● ● ● ●●●●●●●● ● ● ● ●●●●●● ● ●●●●●●●●●●● ● ●● ●● ● ●●●●●●● Nucleotide Diversity*coverage ● ●●●●●● ● ●●●●●●●● ● ●● ●●●● ●●●● ●●●●●● ● ●●●●●●●●●●●● ●● ●●●●●●●●●●●● ●●●● ● ●●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●● ● ●●●●●●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●● ●●●● 0.0 ●●●●●●● 0 10000 20000 30000 40000 0 10000 20000 30000 40000 SNPs SNPs / Mbp 556

557

558 Supplementary Figure S1 The impact of coverage on the number of SNPs called in each sample. 559 Each point is a sample-genome mapping pairing, with the number of SNPs called per sample plotted 560 against the mean coverage of the genome in that sample, the mean nucleotide diversity of the 561 genome in that sample, and the mean nucleotide diversity times the coverage, demonstrating that 562 at low coverages the number of SNPs called is a function of nucleotide diversity and coverage.

16 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Acidobacteria_ANG_10556 Acidobacteria_ANG_477 Gemmatimonadetes_ANG_3684 Verrucomicrobia_ANG_4402 0.6

0.6 0.5 0.6 0.5 N−N 1000 2000 N−N N−S 2000 4000 N−S S−S 0.5 3000 0.4 6000 0.4 0.5 S−S r2 r2 r2 r2

0.3 2500 0.3 0.4 3000 0.4 N−N N−N 5000 6000 N−S N−S 7500 S−S S−S 0.2 9000 10000 0.2 0.3 0.3 0 200 400 600 0 200 400 600 0 200 400 600 800 0 200 400 600 800 Genomic Distance (bp) Genomic Distance (bp) Genomic Distance (bp) Genomic Distance (bp)

Gemmatimonadetes_ANG_4339 Rokubacteria_ANG_2807 Chloroflexi_ANG_2135 Verrucomicrobia_ANG_4402 0.4 0.6 0.5 0.4 5000 5000 5000 5000 10000 10000 0.5 10000 10000 0.3 15000 15000 15000 15000 0.4 20000 20000 0.3 0.4 r2 r2 r2 r2

0.3 N−N 0.2 N−N 0.3 N−N 0.2 N−N N−S N−S N−S N−S S−S S−S S−S S−S 0.2

0.2 0.1 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 Genomic Distance (bp) Genomic Distance (bp) Genomic Distance (bp) Genomic Distance (bp)

Gemmatimonadetes_ANG_3800 Gemmatimonadetes_ANG_3801 ANGP1_ANG_6837 Verrucomicrobia_ANG_7383

0.40 0.45

0.40 5000 5000 10000 0.4 10000 0.35 0.4 10000 10000 15000 15000 0.35 20000 20000 20000 20000 25000 r2 r2 0.30 r2 0.30 r2 0.3 0.3 N−N N−N 0.25 N−N N−S N−S 0.25 N−N N−S S−S S−S N−S S−S 0.20 0.2 0.2 S−S 0.20 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 Genomic Distance (bp) Genomic Distance (bp) Genomic Distance (bp) Genomic Distance (bp)

Gammaproteobacteria_ANG_884 Verrucomicrobia_ANG_3696 Verrucomicrobia_ANG_6721 Dormibacteraeota_ANG_750 0.5 0.35 0.45 0.40

0.30 10000 10000 10000 10000 0.40 0.4 20000 0.35 20000 20000 20000 0.25 30000 30000 r2 r2 r2 r2 0.35 0.30 0.3 0.20 N−N N−N N−N N−N N−S N−S N−S N−S 0.30 S−S S−S 0.15 S−S 0.25 S−S 0.2 0.10 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 Genomic Distance (bp) Genomic Distance (bp) Genomic Distance (bp) Genomic Distance (bp)

Gammaproteobacteria_ANG_56 Deltaproteobacteria_ANG_2457 Gemmatimonadetes_ANG_2150 0.35

0.250 0.30 20000 0.28 20000 40000 20000 40000 60000 40000 0.225 60000 0.25 80000 60000 r2 r2 r2 0.24 80000 0.200 0.20 N−N N−N N−S N−N N−S 0.175 S−S N−S S−S 0.20 0.15 S−S 0.150 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 Genomic Distance (bp) Genomic Distance (bp) Genomic Distance (bp) 563

564

565 Supplementary Figure S2 Linkage disequilibrium decay over genomic distance for all 19 species 566 in this study. Each block point represents a mean of linkage for SNPs at that genomic distance, 567 divided into non synonymous-nonsynonymous linkages, nonsynonymous-synonymous linkages, 568 and synonymous-synonymous linkages. The size of each block point represents the number of 569 SNPs that went into calculating the mean.

17 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

N-N mutations S-S mutations Deltaproteobacteria_ANG_2457 27 67 6 36 59 5

Chloroflexi_ANG_2135 25 67 8 35 60 5

Verrucomicrobia_ANG_6721 21 73 6 28 66 6

Acidobacteria_ANG_477 14 74 12 22 65 13

Dormibacteraeota_ANG_750 17 74 9 25 69 7

Verrucomicrobia_ANG_3696 15 75 10 22 67 11

Rokubacteria_ANG_2807 21 70 8 27 67 6

16 74 10 21 71 8 Gemmatimonadetes_ANG_3801 2/4 haplotypes observed Gemmatimonadetes_ANG_3800 20 70 10 26 67 8 3/4 haplotypes observed Verrucomicrobia_ANG_6527 20 69 11 24 62 14 4/4 haplotypes observed Gemmatimonadetes_ANG_2150 14 73 12 20 71 9

ANGP1_ANG_6837 25 61 14 31 58 11 AB aB Ab ab Gemmatimonadetes_ANG_3684 26 62 12 33 57 9

Gammaproteobacteria_ANG_884 14 71 15 17 71 12

Verrucomicrobia_ANG_7383 21 66 12 25 63 12

Gemmatimonadetes_ANG_4339 11 73 16 15 71 14

Verrucomicrobia_ANG_4402 7 70 23 10 63 27

Gammaproteobacteria_ANG_56 15 77 8 15 78 7

Acidobacteria_ANG_10556 18 64 18 18 61 21

0 25 50 75 100 0 25 50 75 100 Percentage of Biallelic Haplotypes (%) 570

571

572 Supplementary Figure S3 Biallelic haplotype counts for each species. The percentage of ~ 573 haplotype counts for each pair of biallelic sites within 1 Kb of each other is shown for both 574 nonsynonymous-nonsynonymous and synonymous-synoymous pairs of segregating sites.

575

18 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

2150 ● 10556 0.20 ● r2 0.15 0.20 0.25 0.30 2457 0.15 2807 ● 0.35 ● 56 ● ● Acidobacteria ● ANGP1 3684 6837 ● ● ● Chloroflexi 7383 477 ● ● Deltaproteobacteria Relative Abundance(%) Relative 0.10 ● 4402 ● Dormibacteraeota ● 4339 ● ● Gammaproteobacteria ● Gemmatimonadetes 3696 ● Rokubacteria 3801 884 ● ● 2135 ●● Verrucomicrobia ● 3800 6721 0.05 ● ● ● 6527 750 ●

0.004 0.008 0.012 0.016 Nucleotide Diversity 576

577

578 Supplementary Figure S4 Relationship between nucleotide diversity and relative abundance 579 across species.

580

19 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

10556 ● 1.0

477 ● ● Acidobacteria 0.8 ● ANGP1 ● Chloroflexi 6527 ● ● Deltaproteobacteria 4402 56 ● 3696 ● ● Dormibacteraeota ● ● Gammaproteobacteria 0.6 2807 884 ● Gemmatimonadetes

N:S SNP Ratio ● ● 2457 4339 ● 7383 ● ● Rokubacteria ● 6721 3684 ● ● Verrucomicrobia ● 3800 ● 3801 ● 2135 6837 0.4 ● ● 750 ● ● 2150

0.004 0.008 0.012 0.016 Nucleotide Diversity 581

582

583 Supplementary Figure S5 Relationship between nucleotide diversity and the N:S SNP ratio 584 across species. A linear regression is shown (R2=0.25; p=0.018).

585

20 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

2457 ● 2135 ●

6721 Nucleotide Diversity ● 477 ● 0.004 750 3696 1.04 ● ● 0.008 0.012 2807● 3801 3800 ● 2150 Phylum 3684 ● ● 6527 D'N /D'S ● 6837 ● ● Acidobacteria ● ANGP1 1.02 ● Chloroflexi

7383 884 4339 ● Deltaproteobacteria ● ● ● ● Dormibacteraeota 4402 ● Gammaproteobacteria ● ● Gemmatimonadetes 56 ● Rokubacteria ● ● Verrucomicrobia 1.00 10556 ●

0.87 0.90 0.93 0.96 D' 586

587

588 Supplementary Figure S6 Relationship between D’ and D’N /D’S . The ratio of non-synonymous 589 polymorphism linkage to synonymous polymorphism linkage (D’N/D’S) versus mean D’ for all 590 bacterial populations studied. A linear regression model is shown (F-statistic: 15.6; Adjusted 591 R-squared: 0.45, p=0.001).

592

21 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. S S / MAF S 2 N /D' /r N N 2 2 r r Abundance D ' Genome size SNPs / Mbp SNPs N/S ratio MAF D' Nucleotide diversity 1 D'N /D'S ● ● ● ● ● ● −0.64 −0.59 −0.69 0.8 r2 /r2 ● ● N S ● 0.58 0.65 −0.47 ● −0.6 ● 0.6 Nucleotide diversity ● 0.58 0.98 ● −0.54 ● ● −0.66 ● 0.4 SNPs / Mbp ● 0.65 0.98 ● −0.56 ● ● −0.66 ● 0.2 Abundance ● ● ● ● ● 0.63 ● ● ● 0 N/S ratio ● −0.47 −0.54 −0.56 ● 0.69 ● 0.46● ● −0.2 Genome size ● ● ● ● 0.63 0.69 ● ● ● −0.4 MAF / MAF ● ● ● ● N S −0.64 ● ● ● 0.46● −0.6 r2 −0.59 −0.6 −0.66 −0.66 ● 0.46● ● ● 0.64 −0.8 D' −0.69 ● ● ● ● ● ● 0.46● 0.64 −1

593

594

595 Supplementary Figure S7 Cross-population correlations between means of all population 596 genetics summary statistics in this study. In each box, the pearson correlation coefficient between 597 two metrics is shown. Only correlation coefficients with a p-value <0.01 are shown.

22 of 25 certified bypeerreview)istheauthor/funder,whohasgrantedbioRxivalicensetodisplaypreprintinperpetuity.Itmadeavailableunder bioRxiv preprint doi: https://doi.org/10.1101/695478 601 600 599 598 e-eeFTvle o ahseisbtenec ftetrebokcmaiosi shown. is comparisons block three the of each between species each for values FST per-gene Gammaproteobacteria_ANG_884 Gemmatimonadetes_ANG_4339 Gemmatimonadetes_ANG_3801 Gemmatimonadetes_ANG_3800 Gemmatimonadetes_ANG_3684 Gemmatimonadetes_ANG_2150 Gammaproteobacteria_ANG_56 Deltaproteobacteria_ANG_2457 Dormibacteraeota_ANG_750 upeetr iueS8 Figure Supplementary Verrucomicrobia_ANG_7383 Verrucomicrobia_ANG_6721 Verrucomicrobia_ANG_6527 Verrucomicrobia_ANG_4402 Verrucomicrobia_ANG_3696 Acidobacteria_ANG_10556 Rokubacteria_ANG_2807 Acidobacteria_ANG_477 Chloroflexi_ANG_2135 ANGP1_ANG_6837 a CC-BY-NC-ND 4.0Internationallicense ; this versionpostedJuly8,2019. 0.0072 0.016 0.028 0.096 0.044 .9 . 0.071 0.2 0.094 0.029 0.0077 0.03 0.019 0.085 0.015 0.019 0.008 0.074 0.16 0.035 .9 .50.03 0.15 0.097 0.18 .40040.034 0.094 0.04 0.12 0.11 0.38 0.24 0.18 16_13:12_9 blocks .9 0.079 0.093 .30.038 0.03 .40.014 0.14 .50.28 0.55 0.33 0.25 5_2:12_9 0.083 0.035 0.046 0.21 0.05 0.07 5_2:16_13 enFTvle o vr pce costemao.Temean The meadow. the across species every for values FST Mean 0.1 0.2 0.3 0.4 0.5

fst 0.0 0.1 0.2 0.3 0.4 0.5 The copyrightholderforthispreprint(whichwasnot . 5_2:12_9 5_2:16_13 blocks 16_13:12_9 3o 25 of 23 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

602

603

604 Supplementary Figure S9 Gene FST along the genome for every species across the meadow. 605 Per-gene FST values are plotted in ORF order on contigs. The title of each plot shows the species 606 name and the block vs block comparison values being plotted. Each point is a gene, and the size of 607 the point increases with the number of segregating sites in that each. Genes are colored in red if 608 they are part of a locus with mean FST greater than 2.5 times the standard deviation of the genomic 609 average.

24 of 25 bioRxiv preprint doi: https://doi.org/10.1101/695478; this version posted July 8, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

610 Supplementary Table S1 Completeness and contamination statistics for all population replicate 611 genomes included in this study. 612 Supplementary Table S2 Key population genetics summary statistics in this study. 613 Supplementary Table S3 ComEC annotations for each species. 614 Supplementary Table S4 Annotations of genes within highly differentiated loci.

25 of 25