Evolution of the rapidly mutating human salivary agglutinin (DMBT1) and population subsistence strategy

Shamik Polleya, Sandra Louzadab, Diego Fornic, Manuela Sironic, Theodosius Balaskasa, David S. Hainsd, Fengtang Yangb, and Edward J. Holloxa,1

aDepartment of Genetics, University of Leicester, Leicester LE1 7RH, United Kingdom; bMolecular Cytogenetics Facility, Wellcome Trust Sanger Institute, Hinxton, Cambridge CB10 1SA, United Kingdom; cBioinformatics, Scientific Institute IRCCS E. Medea, 23842 Bosisio Parini, Italy; and dDivision of Pediatric Nephrology, University of Tennessee Health Science Center, Le Bonheur Children’s Hospital, Memphis, TN 38103

Edited by Huntington F. Willard, The Marine Biological Laboratory, Woods Hole, MA, and approved March 4, 2015 (received for review August 27, 2014) The dietary change resulting from the domestication of plant and genetic variation has responded to this change in dental health animal species and development of agriculture at different lo- via natural selection. cations across the world was one of the most significant changes in We analyzed the variation of the deleted in malignant brain human evolution. An increase in dietary carbohydrates caused an tumors 1 (DMBT1) gene encoding a major salivary glycoprotein increase in dental caries following the development of agriculture, salivary agglutinin, also known as gp-340, hensin or muclin, and SAG mediated by the cariogenic oral bacterium Streptococcus mutans. hereafter referred to DMBT1 (10). This comprises Salivary agglutinin [SAG, encoded by the deleted in malignant ∼10% of total salivary protein in children and 5% in adults (11), SAG brain tumors 1 (DMBT1) gene] is an innate immune receptor glyco- and is also present at other mucosal surfaces (12). DMBT1 is protein that binds a variety of bacteria and viruses, and mediates a component of innate immunity, acting as a pattern recog- attachment of S. mutans to hydroxyapatite on the surface of the nition receptor interacting with bacteria such as S. mutans and Helicobacter pylori tooth. In this study we show that multiallelic copy number varia- and viruses such as HIV-1 and influenza (12). DMBT1 Variation between host saliva affects the adhesion of S. mutans tion (CNV) within is extensive across all populations and is SAG predicted to result in between 7–20 scavenger–receptor cysteine- (13), and protein variants of DMBT1 have been suggested to rich (SRCR) domains within each SAG molecule. Direct observation affect caries susceptibility in children (14). of de novo mutation in multigeneration families suggests these Copy number variation (CNV) describes a difference in DNA CNVs have a very high mutation rate for a protein-coding , dosage between different individuals, and includes simple de- with a mutation rate of up to 5% per gamete. Given that the SRCR letion and duplications as well as more complex multicopy and multiallelic variation (15). CNV can affect gene expression by domains bind S. mutans and hydroxyapatite in the tooth, we in- altering the total number of copies of individual and vestigated the association of sequence diversity at the SAG-binding therefore gene dosage, by changing tissue-specific enhancers or gene of S. mutans,andDMBT1 CNV. Furthermore, we show that DMBT1 by varying the number of exons within a gene, potentially altering CNV is also associated with a history of agriculture across the number of protein-coding subunits, for example (16). CNV global populations, suggesting that dietary change as a result of can also show a germ-line mutation rate at least an order of agriculture has shaped the pattern of CNV at DMBT1, and that the DMBT1 S. mutans magnitude higher than single nucleotide substitutions, because - interaction is a promising model of host-pathogen- of the distinct mutational processes that underlie copy number culture coevolution in humans. change (16). Genome wide, CNVs are enriched for genes that

copy number variation | agriculture | DMBT1 | mutation | Significance GENETICS structural variation

Humans have undergone an evolutionary very recent change in he effect of the agricultural transition on environment of their own making. The development of agricul- Tvariation has been extensive (1). In addition to the indirect ture profoundly altered diet and exposure to pathogens, and yet effect of an exponential increase in population size, direct effects the evolutionary response to this is still poorly understood. Here, on particular genes have occurred, most notably the evolution, at we characterize extensive copy number variation (CNV) of the multiple locations through multiple alleles, of lactase persistence LCT gene encoding salivary agglutinin (deleted in malignant brain at the gene, enabling adults to drink milk generated from tumors 1, DMBT1). Salivary agglutinin comprises 10% of salivary domesticated mammals (2). The agricultural transition is also protein and binds bacteria, including mediating the attachment of thought to have had an impact on the oral commensal micro- Streptococcus mutans Streptococcus mutans the causative agent of dental caries, ,to biota, in particular , the causative agent of teeth. We show that DMBT1 is a very fast-mutating protein- dental caries which is the most common chronic infectious dis- coding locus, and DMBT1 CNV correlates with a population his- ease in humans. Analysis of ancient skeletal remains (3) and S. mutans tory of agriculture. Furthermore, we examine the relationship modern genomic diversity (4) have suggested that between variation of the S. mutans region that binds salivary became a major oral pathogen only after the development of agglutinin and CNV of the DMBT1 gene. agriculture and the concomitant increase in availability of sugars consumed directly or derived from starchy foods. The increased Author contributions: S.P. and E.J.H. designed research; S.P., S.L., T.B., F.Y., and E.J.H. level of caries in individuals from agricultural societies is ob- performed research; D.F., M.S., and D.S.H. contributed new reagents/analytic tools; S.P., served in both modern and prehistoric populations (5–7). This D.F., M.S., T.B., and E.J.H. analyzed data; and S.P. and E.J.H. wrote the paper. increase in caries was likely to have profound consequences to The authors declare no conflict of interest. the health of the individuals concerned before the development This article is a PNAS Direct Submission. of modern dental treatment (8). Caries left untreated leads to Freely available online through the PNAS open access option. tooth loss, potential severe infections, and a decrease in masti- 1To whom correspondence should be addressed. Email: [email protected]. catory efficiency potentially leading to a reduction in access of This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10. enzymes to the food bolus (9). It is unclear whether human 1073/pnas.1416531112/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1416531112 PNAS | April 21, 2015 | vol. 112 | no. 16 | 5105–5110 Downloaded by guest on September 28, 2021 encode that interact with the environment, particularly (NAHR) between the 98% identical SRCR repeats carrying those in host defense (17), and a high mutation rate of these loci SRCR2 and SRCR6 is responsible for CNV1 (Fig. 1D and SI may contribute to immunological individuality of the host. Methods). It is also clear that CNV2 is considerably more com- Whether selection or a relaxation of functional constraint is re- plex than the small deletion described previously, being a multi- sponsible for this bias in genome-wide distribution remains un- allelic CNV ranging between 1 and 11 copies per diploid genome resolved, although there are strong arguments for the role of with each repeat unit carrying a single SRCR domain. gene duplication in evolution (18). There is convincing evidence Analysis of further samples from the CEPH-Human Genome that CNV in humans can affect the host’s susceptibility to in- Diversity Project (HGDP) panel of 971 individuals from 52 pop- fectious diseases, including the well established effect of α-globin ulations worldwide (28) showed rare individuals with a CNV2 copy deletion on malaria susceptibility (19). Furthermore, it has been number of zero. Sanger sequencing of PCR products from these suggested that the frequency of high copy number alleles of the individuals showed that all of the zero-copy CNV2 alleles had salivary amylase gene AMY1 has increased by natural selection in a breakpoint within 33 bp of sequence identical between SRCR8 populations that eat a carbohydrate-rich diet (20). and SRCR11 (Fig. S4), just upstream of the exon encoding the DMBT1SAG mostly consists of an array of scavenger receptor SRCR domain, suggesting that this allele was generated by NAHR cysteine-rich (SRCR) domains which bind bacteria, including between these repeats (SI Methods). This finding suggests that S. mutans (21) and promote their adherence to hydroxyapatite of other larger CNV2 alleles have also been generated by NAHR the tooth (22, 23), which is critical for the cariogenic activity of between any of the repeats carrying SRCR domains 8–11. the bacteria. The canonical DMBT1 gene annotated in the hg19 human genome assembly has 13 repeats each containing a SRCR DMBT1 Copy Number Variation Has a High Mutation Rate. The ex- domain (Fig. 1A). The repeats containing the SRCR domain, tensive allelic diversity and repetitive genomic structure of DMBT1, hereafter known as SRCR repeats, within the DMBT1 gene are together with the knowledge that NAHR is likely to have mediated distinct at the DNA level but share ∼80% identity at the protein generation of new alleles, led us to consider whether CNV1 and level. Within the SRCR domain, smaller regions that bind to CNV2 had a high mutation rate. To study this directly, we used S. mutans and hydroxyapatite have been identified (Fig. 1D), al- our validated PRT assays to call copy number of DMBT1 CNV1 though bacterial binding is inhibited by sialidases, showing gly- and CNV2 on 522 samples from 40 large multigenerational cosyl groups are also important in bacterial binding (24). Two families from the Centre d’Etude de Polymorphisme Humain polymorphic deletions within the DMBT1 gene, involving variable (CEPH) collection. We robustly identified de novo copy number numbers of SRCR repeats, has been partially described previously mutations at both loci (Fig. S5 and Datasets S1 and S2). The but the nature and extent of CNV within this gene remained in- mutation rate at CNV1 is estimated to be 1.4% per gamete (9 completely characterized. For example, a polymorphic deletion out of 632 meioses, 95% CI 0.7–2.7%) and the mutation rate at involving SRCR3–SRCR6 was associated with Crohn’s disease. CNV2 is 3.3% per gamete (21 out of 632 meioses, 95% CI 2.1– Furthermore, a polymorphic deletion involving at least repeats 5.0%). These mutation rate estimates place both loci among SRCR9–SRCR11 has been described (25, 26). Genome-wide the most highly mutating loci known, with comparable rates seen arrayCGH (aCGH) analysis has identified two CNVs consistent only for noncoding minisatellites (29) or for the mitochondrial with the known polymorphic deletions, but showing extensive D-loop (30). Error rates for CNV1 and CNV2 of 0.37% and complex loss and gain of copy number (17) (Fig. 1B). 0.33% respectively are below the lower 95% CI bound of both We aimed to fully characterize the CNV involving DMBT1, mutation rates, showing that these high mutation rates are not investigate its mutation rate, question whether it has adapted to due to errors in copy number calling. Furthermore, examination different environments across human populations, and dissect of the copy number calls of the individuals showing de novo the variation in the context of its interaction with S. mutans. mutations indicates a high posterior probability of that copy number call (Fig. S5). All mutations were of a loss or gain of one CNV Results repeat unit, with no evidence of a bias toward loss or gain. Analysis Characterization of Copy Number Variation at DMBT1. Genome-wide of our data showed that two CNV1 mutational events and one analysis using high-resolution tiling arrayCGH (aCGH) identi- CNV2 mutational event were associated with a crossover involving fied two CNVs whose location was consistent with the known flanking marker exchange at the correct position in that individual. polymorphic deletions described in the literature, but showing This observation suggests that although NAHR events involving a much more extensive and complex loss and gain of copy homologous do contribute to CNV mutation rate, number (17) (Fig. 1B). We used these two CNV regions as a most events are likely to be inter- or intrachromatidal. starting point for our analysis, by interrogating the two CNVs (CNV1 involving SRCR3–SRCR6 and CNV2 involving SRCR9– Global Diversity of DMBT1 and Agriculture. To examine global di- SRCR11, Fig. 1B) separately. To this end, we designed several versity of DMBT1, we determined diploid copy number of CNV1 paralog ratio tests (PRTs), a form of quantitative PCR that is and CNV2 on the CEPH-HGDP panel (Dataset S2). We ob- particularly robust in accurately calling CNVs (27). These PRTs served a similar range for CNV1 (0-5 copies per diploid genome) were used to estimate copy number at each CNV in 270 indi- and for CNV2 (0–11 copies per diploid genome) as in the viduals from HapMap phase 1. To verify our PRTs, we used HapMap samples (Fig. S5). Although there was no linkage disequi- concordance between PRT assays, clustering of PRT copy librium between copy number alleles at CNV1 and CNV2 (r2 = 0for number estimates into distinct groups reflecting integer copy CEU parents, r2 = 0.01 for YRI parents, r2 = 0.01 for all HGDP, number and comparison with aCGH probe intensities (Fig. 1C Fig. 2A), there was a clear negative relationship between average and Fig. S1). By typing samples in duplicate, we estimated an copy number at CNV1 and CNV2 at the population level (r2 = upper 95% confidence limit for the error rate in determining 0.11, Fig. 2B) and at the continental level (r2 = 0.43). Because, across CNV1 and CNV2 copy number to be 0.37% (366 samples, no populations, the increase in CNV2 is mirrored in part by a decrease discrepancies) and 0.33% (407 samples, no discrepancies), re- in CNV1, the total predicted number of the SRCR domains for the spectively. In addition, we further validated a subset of samples two copies of DMBT1 on homologous chromosomes does not using long-PCR and fiber-FISH (Figs. S2 and S3). We show that mirror this trend. Nevertheless the total number of SRCR domains CNV1 is a multiallelic CNV with a copy number varying between per diploid DMBT1 is highly variable, and the number of SRCR 0 and 5 per diploid genome, and the copy number variable unit domains in a given DMBT1SAG molecule is predicted to range includes four SRCR domains (Fig. S1). For CNV1, zero, one, between 7 and 20, at least (Fig. S5G). and two or more copies reflect homozygous deletion, heterozy- Given the fact that the SRCR domain is known to bind gous deletion, and normal genotype of the deletion described S. mutans, we considered that the development of agriculture previously (26). Sanger sequencing of homozygous deleted and consequent increase carbohydrate-rich foods, oral S. mutans samples suggests that nonallelic homologous recombination and dental caries might be a selective pressure influencing the

5106 | www.pnas.org/cgi/doi/10.1073/pnas.1416531112 Polley et al. Downloaded by guest on September 28, 2021 frequency distributions of CNV1 and CNV2. To test this, we geographical distance matrix, a population pairwise genetic dis- correlated CNV1 and CNV2 mean copy number for each HGDP tance matrix, or distance from East Africa as covariates (32). We population with an index of extent of agricultural practice for found a negative relationship between agricultural populations each population, as published (31). We used two statistical and the mean CNV1 copy number and a positive relationship approaches. First, a regression analysis corrected for population between agricultural populations and CNV2 (Table 1). By resam- effects by using distance from East Africa as a covariate. Secondly, pling from an empirical distribution of partial Mantel r correlations a partial Mantel analysis corrected using a population pairwise with agriculture, we estimated the genome-wide significance of this GENETICS

Fig. 1. Analysis of DMBT1 structure and CNV. (A) Dotplot of the DMBT1 gene (exons and introns) aligned against itself. Lines indicate high similarity and emphasize the repeated nature of the structure of the gene. Individual SRCR domains are indicated and numbered. Note that the canonical DMBT1 gene sequence has one fewer SRCR domain that that predicted by the genome assembly, and the extra SRCR domain is labeled 9’.(B) Exon–intron structure of the DMBT1 gene with CNV signals. Three DMBT1 gene annotations derived from different transcripts are shown. CNV signals from the Database of Genome Variants (dgv.tcag.ca/dgv/app/home) are shown below, with red indicating loss of signal compared with a reference genome, blue gain of signal, and brown both loss and gain of signal observed in different samples. Note that these annotations are often larger than the actual CNV, because they can representlarge insert clones that detect a CNV, but with the CNV boundaries unknown. CNV1 and CNV2 are annotated, with the reference genomic sequence showing one copy of the CNV1 region and four copies of the CNV2 region. Figure is based on UCSC Genome Browser screenshot hg19 assembly. (C) Comparison of copy number calling methods for CNV1 (Left) and CNV2 (Right). Each point on the scatterplot represents an individual sample, with different symbols reflecting the final copy number call. The x axes individuate the copy number value estimated from paralog ratio tests (PRTs), and the y axes indicate the first principal component of probe intensity data for probes spanning the CNV in array comparative genomic hybridization (data from ref. 17). (D) Sequence relationship of SRCR repeats. A maximum-likelihood tree shows the relationship of the SRCR repeats (between 3 and 4 kb) carrying the SRCR-coding-domain exon. Scale bar indicates 0.1 substitutions per site. Amino acid sequences of the SRCR domains corresponding to the S. mutans and hydroxyapatite-binding regions are arranged alongside the tree, ordered according to the order of SRCR domains on the tree. Note that the divergent SRCR14 domain does not bind bacteria (54), is not coded by a repeated DNA region (A), and is located at the C-terminal end of the DMBT1SAG protein.

Polley et al. PNAS | April 21, 2015 | vol. 112 | no. 16 | 5107 Downloaded by guest on September 28, 2021 defined previously (Fig. 2C and ref. 33). We found an association for CNV1 and a weak association for CNV2, both in the expected directions (Table 1). This finding suggests that the subsistence history of a population has affected the frequency distribution of both CNVs within DMBT1. To investigate this association further, we called DMBT1 CNV1 and CNV2 copy number genotype using sequence read depth analysis on three published ancient DNA samples (refs. 34–36, Table S1,andFig. S5 E and F). Both Denisova and Neanderthal hominins show a high CNV1 copy number of 3 and a low CNV2 copy number of 3, within the range of hunter– gatherer human populations. Analysis of an 8,000-y-old hunter– gatherer from Loschbour in Luxembourg provides a more recent directly ancestral calibration point. He had a CNV1 genotype of 1, which is common in modern Europeans, and a CNV2 geno- type of 4, which is less common but still present in modern Europeans. This observation tentatively suggests that a reduction in CNV1 copy number had occurred or was occurring, but an increase in CNV2 was yet to occur; however, further samples are required before any firm conclusions can be drawn, as the ob- served genotype is consistent with a copy number distribution that is unchanged from modern Europeans. To provide further support for natural selection in shaping frequencies of CNV1 and CNV2 copy number, we used forward simulations to model CNV1 and CNV2 in situations of population expansion but no selection. Using a mutation rate derived from our pedigree analysis as well as initial allele frequency distributions based on our largest hunter-gatherer population sample (Biaka) we simulated 1,000 populations at CNV1 and CNV2 using a stepwise mutation model and a realistic demographic model (SI Methods). The resulting distribution of mean copy number for both CNVs provides an empirical test of the departure of our observed pop- ulations from this neutral stepwise mutation model. We interpreted a departure from the model as evidence of natural selection oc- curring on the locus on ancestors of individuals in that population. Using an estimate of mutation rate of the lower 95% confidence limit for both CNV1 and CNV2, 99.5% of simulated populations had a CNV1 mean copy number above 2.4 and 99.5% of pop- ulations had a CNV2 mean copy number above 7.4 (Fig. 2B). This observation shows that CNV2 copy numbers are lower than expected given a neutral stepwise mutation model for all pop- ulations, probably reflecting selective constraints on DMBT1SAG protein length. For CNV1, however, six populations show mean copy number consistent with a neutral stepwise mutation model. Four of those populations are from Africa, one from East Asia and one from South America, yet all six have been classified as non- agricultural (33). The increase in CNV2 copy number and decrease in CNV1 copy number might be due to selection for a particular phenotype more favored following the transition to agriculture. Our favored hypoth- esis is that a particular SRCR domain that binds S. mutans or hydroxyapatite of the tooth more weakly, thereby reducing the like- S. mutans Fig. 2. Distribution of DMBT1 copy number values in the CEPH-HGDP panel. lihood of caries. Analysis of the binding domain sequence (A) Across individuals. Each point represents the mean unrounded PRT copy shows that there is no difference between SRCR domains in CNV1 number of an individual, with the histogram on each axis representing the and CNV2 (Fig. 1D). However, when manually inspecting the human distribution of CNV1 copy numbers (x axis) and CNV2 copy numbers (y axis). GRCh37/UCSC hg19 reference sequence but also 10 HGDP sam- (B) Across populations. The means of CNV1 and CNV2 in each population are ples sequenced to high-depth (ref. 34 and Table S1), the SRCR plotted, colored according to continent of origin. The red dashed lines domains in CNV2 share a serine to tyrosine change which disrupts represent the value above which 99.5% of mean copy numbers of simulated the hydroxyapatite-binding domain (37) and abolishes a strong po- populations fall. (C) Average CNV1 and CNV2 copy number in agricultural and tential mucin-type O-linked glycosylation site (Fig. 1D). Replacement nonagricultural populations. Populations are colored according to region of CNV1-type SRCR domains with CNV2-type SRCR domains has (legend in B), thick line indicates median value and thin lines are 25th and 75th therefore allowed this tyrosine substitution to propagate rapidly centiles, and P values from logistic regression, with distance from Africa as through the DMBTSAG molecule. This observation suggests that a covariate (Table 1). Agricultural population definition is according to ref. 33. the transition to agriculture has been accompanied by a partial replacement of canonical SRCR domains with SRCR domains P = that either bind the tooth less strongly or, because glycosylation observationtobe 0.0467, using the geographic distance matrix is important for binding, bind S. mutans less strongly. as a covariate, and P = 0.0410 using the genetic distance matrix as a covariate. We also tested the association between mean Evolutionary Relationship Between DMBT1 and its Ligand in S. mutans. copy number of a population and a subsistence strategy based on If the interaction between DMBT1SAG and S. mutans is coevolv- carbohydrate-rich foods, such as cereals, roots and tubers, as ing, we might expect to see a relationship between variation of

5108 | www.pnas.org/cgi/doi/10.1073/pnas.1416531112 Polley et al. Downloaded by guest on September 28, 2021 Table 1. DMBT1 CNV and population subsistence strategy Cereals, roots, tubers populations as ref. 33 Percentage of time spent on agriculture as ref. 31

Locus Odds ratio (95% CI) * Beta value from regression (95% CI) * Partial mantel* (r) Partial mantel† (r) Partial mantel‡ (r)

CNV1 0.14 (0.02–0.84) P = 0.041 −0.010 (-0.017 to -0.003) P = 0.009 0.286 P = 0.018 0.263, P = 0.004 0.111, P = 0.102 CNV2 2.00 (1.01–4.33) P = 0.057 0.027 (0.010–0.045) P = 0.003 0.293 P = 0.003 0.283, P = 0.002 0.193, P = 0.030

*Distance from Africa as covariate. † Pairwise geographical distance matrix as covariate. ‡ Pairwise genetic distance matrix as covariate.

S. mutans and CNV1 and CNV2 of DMBT1 across different both CNV1 and CNV2 have exceptionally high mutation rates, individuals. Adaptation of S. mutans to the DMBT1SAG pheno- 1.4% and 3.3% per gamete per generation, respectively; to our type of different mouths reflects a more recent evolutionary time knowledge the fastest mutation rates affecting coding sequence yet scale than adaptation of DMBT1 in humans. However, given that described in humans. Analysis of breakpoints suggests that, at least S. mutans colonizes the mouth in early childhood (38) and the for CNV2, NAHR drives the mutation process and that most, but doubling time of biofilm-attached S. mutans is in the order of hours not all, NAHR events are inter- or intrachromatidal, rather than (39), it is likely that it can adapt genetically to the oral environment. between homologous chromosomes. Such a bias has also been We genotyped 125 adult individuals resident in Leicester U.K., 92 observed at the tandemly repeated DEFA1A3 locus (43) and at of European origin, for CNV1 and CNV2, and sequenced part of alpha-satellite DNA (44), suggesting a shared mechanism. the S. mutans gene spaP from DNA isolated from matched saliva. Populations which practice agriculture generally show a low spaP encodes AgI/II which is the ligand for human DMBT1SAG copy number of CNV1 and high copy number of CNV2, distinct (40). We focused on a 1-kb region of the spaP gene encoding 336 from hunter-gatherer populations and ancient hominins. This amino acids from the C-terminal region known to contain two pattern of CNV increases the number of copies of a particular type binding domains for human DMBT1SAG, namely Ad1 and Ad2 of SRCR domain containing an amino acid change predicted to (41). For 98 of our cohort (78%), only one S. mutans strain (as disrupt binding to hydroxyapatite in the tooth and to abolish S. mutans defined by homozygosity of the sequenced region) was found. This a mucin-like O-glycosylation site. Given that cariogenic S. mutans became prevalent after the development of agriculture, our data observation suggests very low within-mouth diversity of , SAG and that most people are colonized by only one strain. However, suggest that DMBT1 has evolved to modulate its binding to the S. mutans alignment of sequences showed 136 single-nucleotide polymor- tooth surface or (or both) by rapidly mutating SRCR phisms, 82 of which altered amino acid sequence, reflecting very domain units carrying the appropriate binding motifs. This scenario high levels of diversity between individuals (Fig. S6). presupposes that caries was an agent of natural selection before the A difference between the allele frequency spectra (AFS) of development of modern dentistry (8, 45). We think that this is not nonsynonymous and synonymous polymorphisms can indicate se- unlikely, given the known acute consequences of caries, such as the lection, if synonymous polymorphisms are assumed to be neutral. increased risk of abscess (46) and chronic consequences increase in Comparison of the AFS of the S. mutans spaP gene shows a dif- difficulty eating, particularly in children, and reduced weight/height ference in both the total and European-only cohort (P = 0.015 and gain (47, 48). Nevertheless, hypotheses about the agents of evolu- tionary change in humans are very difficult to prove, and we note P = 0.018 respectively), with an enrichment of polymorphisms with SAG rare alleles (minor allele frequency < 1%, P = 0.003 for total co- that, given DMBT1 protein is expressed on other mucosal hort, P = 0.022 for European-only cohort, Fig. S6 E and F). This epithelia and interacts with other microbes, other evolutionary scenarios are possible, such as adaptation to an altered micro- result indicates that weak negative selection is acting on spaP, biome of the gut. Indeed, CNV1 of DMBT1 corresponds to a consistent with previous genome-wide approaches (4). GENETICS previously described deletion (26), with zero copies reflecting a If a particular spaP allele was adapted to a particular DMBT1SAG spaP DMBT1 homozygous deletion and one copy reflecting a heterozygous de- phenotype, we might expect the allele and the ge- letion. This deletion has previously been associated, in a small notype to be associated across a number of individuals. Across our case-control study, with increased susceptibility to Crohn’sdisease cohort, we identified 13 polymorphisms that changed amino acid spaP (25), an intestinal inflammatory disease. If this association is sequence in Ad1 or Ad2. Given the low derived allele fre- confirmed, this would represent an interesting case of pathogen/ quencies of these polymorphisms, we had limited power to detect culture-driven selection increasing the allele frequency of an au- an association with CNV1 or CNV2 copy number. However, two of toimmune susceptibility allele. these polymorphisms affect an amino acid highly conserved across S. mutans G H We also investigated variation of in the context oral Streptococci (Fig. S6 and ), and in one the derived allele of DMBT1SAG CNV. The overall pattern of variation of the (A1090D, PDB P11657) was associated with lower copy number at SAG P = P = DMBT -binding region of AgI/II is that of weak negative CNV1 (nominal 0.0416) and CNV2 (nominal 0.0169) in the selection across the population, where new amino acids changes European-origin cohort, which leads to an overall association with are selected against when transferred from host to host. Our data DMBT1 P = low copy number ( 0.046, corrected for multiple com- also support the lack of geographical structure of S. mutans,as parisons). Only CNV2 remains associated with low copy number our sampling of a restricted population effectively captures the (nominal P = 0.0196) in the full cohort, suggesting an effect of global diversity of sequences analyzed elsewhere, at least for this ethnicity on this interaction. particular region, and again argues for weak negative selection and background selection being the dominant force shaping di- Discussion versity in S. mutans. There is weak evidence that there is a re- We have shown that DMBT1 shows extensive CNV, with two lationship between sequence variation at AgI/II and CNV at distinct regions (termed CNV1 and CNV2) showing extensive DMBT1. As minor alleles at AgI/II are generally rare, a much copy number polymorphism across a wide range of populations. larger cohort and functional analysis are needed to tease apart This polymorphism is predicted to underlie the variable number the natural variation modulating this interaction. of tandemly repeated SRCR domains of the DMBT1SAG protein Our results now provide a framework for understanding the observed in different individuals. These SRCR domains have both full nature and functional effect of sequence variation at this locus, S. mutans- and hydroxyapatite-binding activities (13, 22, 26, 42). which will have an unclear linkage disequilibrium relationship with Direct observation of de novo mutations in pedigrees show that neighboring SNPs. One study has highlighted DMBT1 to be a

Polley et al. PNAS | April 21, 2015 | vol. 112 | no. 16 | 5109 Downloaded by guest on September 28, 2021 strong candidate for ancient balancing selection in the genome (49). for test and reference loci to minimize amplification bias (27, 51, 52). PRTs Another recent study has identified the region upstream of DMBT1 were validated both by examining assay concordance and by testing a subset to show unusually negative Tajima’s D (in the fifth percentile ge- of samples using long PCR, array CGH, fiber-FISH, and next-generation nome wide) in Europeans, supporting our model of selection (50). sequencing (Tables S2 and S3). However, the effect on SNP diversity of a rapidly mutating CNV S. mutans sequences were derived by Sanger sequencing following PCR undergoing fluctuating geographically structured selection remains amplification of a region of the spaP gene directly from DNA insulated from unclear. The relationship between the CNV we describe here and human mouthwash samples. Population simulations were conducted using diseases should also be studied further, in particular those diseases simuPOP (53). Correlation of population subsistence with mean population with an infectious or immune component to their etiology. Taken copy number was performed as previously described (31, 33). further we hope that the rapidly mutating DMBT1 gene will become a paradigm of host-pathogen evolutionary study leading to impor- ACKNOWLEDGMENTS. We thank Gurdeep Lall, Jenny Bowdrey, and Seijal tant insights in understanding the process of caries formation, and Patel for help and support and Rita Neumann, Alec Jeffreys, Mark Jobling, other host-microbe interactions, in humans. and Jan Mollenhauer for DNA samples. This work was funded by a Govern- ment of India Ministry of Social Justice and Empowerment PhD studentship Methods (to S.P. and E.J.H.). E.J.H. was supported in part by a Medical Research Council New Investigator Grant (GO801123). S.L. and F.Y. were supported Full details on the methods used, and details of the samples, are described in by the Wellcome Trust (WT098051). D.S.H. was supported by NIH Grant SI Methods. CNV was characterized and typed using multiple paralog ratio RC4DK090937-01. This research used the ALICE and SPECTRE High Perfor- tests (PRT), a form of quantitative PCR that uses the same primer sequences mance Computing Facilities at the University of Leicester.

1. Jobling M, Hollox E, Hurles M, Kivisild T, Tyler-Smith C (2013) Human Evolutionary 29. Jeffreys AJ, Royle NJ, Wilson V, Wong Z (1988) Spontaneous mutation rates to new Genetics (Garland Science, New York). length alleles at tandem-repetitive hypervariable loci in human DNA. Nature 2. Gerbault P, et al. (2011) Evolution of lactase persistence: An example of human niche 332(6161):278–281. construction. Philos Trans R Soc Lond B Biol Sci 366(1566):863–877. 30. Howell N, et al. (2003) The pedigree rate of sequence divergence in the human mi- 3. Adler CJ, et al. (2013) Sequencing ancient calcified dental plaque shows changes in tochondrial genome: There is a difference between phylogenetic and pedigree rates. oral microbiota with dietary shifts of the Neolithic and Industrial revolutions. Nat Am J Hum Genet 72(3):659–670. Genet 45(4):450–455, e1. 31. Fumagalli M, et al. (2011) Signatures of environmental genetic adaptation pinpoint patho- 4. Cornejo OE, et al. (2013) Evolutionary and population genomics of the cavity causing gens as the main selective pressure through human evolution. PLoS Genet 7(11):e1002355. bacteria Streptococcus mutans. Mol Biol Evol 30(4):881–893. 32. Pozzoli U, et al. (2010) The role of protozoa-driven selection in shaping human ge- 5. Larsen CS (1995) Biological changes in human populations with agriculture. Annu Rev netic variability. Trends Genet 26(3):95–99. Anthropol 24:185–213. 33. Hancock AM, et al. (2010) Colloquium paper: Human adaptations to diet, subsistence, 6. Cohen MN, Crane-Kramer GMM (2007) Ancient Health: Skeletal Indicators of Agri- and ecoregion are due to subtle shifts in allele frequency. Proc Natl Acad Sci USA – cultural and Economic Intensification (University Press of Florida, Gainesville). 107(Suppl 2):8924 8930. 7. Lukacs JR (2008) Fertility and agriculture accentuate sex differences in dental caries 34. Meyer M, et al. (2012) A high-coverage genome sequence from an archaic Denisovan – rates. Curr Anthropol 49(5):901–914. individual. Science 338(6104):222 226. 8. Clarke JH (1999) Toothaches and death. J Hist Dent 47(1):11–13. 35. Prüfer K, et al. (2014) The complete genome sequence of a Neanderthal from the – 9. Sheiham A (2006) Dental caries affects body weight, growth and quality of life in pre- Altai Mountains. Nature 505(7481):43 49. school children. Br Dent J 201(10):625–626. 36. Lazaridis I, et al. (2014) Ancient human genomes suggest three ancestral populations – 10. Madsen J, Mollenhauer J, Holmskov U (2010) Review: Gp-340/DMBT1 in mucosal in- for present-day Europeans. Nature 513(7518):409 413. nate immunity. Innate Immun 16(3):160–167. 37. Bikker FJ, Cukkemane N, Nazmi K, Veerman EC (2013) Identification of the – 11. Sonesson M, Ericson D, Kinnby B, Wickström C (2011) Glycoprotein 340 and sialic acid in minor- hydroxyapatite-binding domain of salivary agglutinin. Eur J Oral Sci 121(1):7 12. gland and whole saliva of children, adolescents, and adults. Eur J Oral Sci 119(6):435–440. 38. Wan AK, et al. (2003) A longitudinal study of Streptococcus mutans colonization in – 12. Ligtenberg AJ, Karlsson NG, Veerman EC (2010) Deleted in malignant brain tumors-1 infants after tooth eruption. J Dent Res 82(7):504 508. 39. Liu J, Ling JQ, Zhang K, Wu CD (2013) Physiological properties of Streptococcus mu- protein (DMBT1): A pattern recognition receptor with multiple binding sites. Int J Mol tans UA159 biofilm-detached cells. FEMS Microbiol Lett 340(1):11–18. Sci 11(12):5212–5233. 40. Kelly C, et al. (1989) Sequence analysis of the cloned streptococcal cell surface antigen 13. Esberg A, Löfgren-Burström A, Öhman U, Strömberg N (2012) Host and bacterial I/II. FEBS Lett 258(1):127–132. phenotype variation in adhesion of Streptococcus mutans to matched human hosts. 41. Kelly CG, et al. (1999) A synthetic peptide adhesion epitope as a novel antimicrobial Infect Immun 80(11):3869–3879. agent. Nat Biotechnol 17(1):42–47. 14. Jonasson A, et al. (2007) Innate immunity glycoprotein gp-340 variants may modulate 42. Mollenhauer J, et al. (1997) DMBT1, a new member of the SRCR superfamily, on chro- human susceptibility to dental caries. BMC Infect Dis 7(1):57. mosome 10q25.3-26.1 is deleted in malignant brain tumours. Nat Genet 17(1):32–39. 15. Wain LV, Armour JAL, Tobin MD (2009) Genomic copy number variation, human 43. Black HA, Khan FF, Tyson J, Al Armour J (2014) Inferring mechanisms of copy number health, and disease. Lancet 374(9686):340–350. change from haplotype structures at the human DEFA1A3 locus. BMC Genomics 15:614. 16. Zhang F, Gu W, Hurles ME, Lupski JR (2009) Copy number variation in human health, 44. Warburton PE, Willard HF (1995) Interhomologue sequence variation of alpha sat- disease, and evolution. Annu Rev Genomics Hum Genet 10:451–481. ellite DNA from human 17: Evidence for concerted evolution along 17. Conrad DF, et al.; Wellcome Trust Case Control Consortium (2010) Origins and functional haplotypic lineages. J Mol Evol 41(6):1006–1015. impact of copy number variation in the human genome. Nature 464(7289):704–712. 45. Caselitz P (1998) Caries—ancient plague of humankind. Dental Anthropology: Fun- 18. Long M, VanKuren NW, Chen S, Vibranovski MD (2013) New gene evolution: Little did damentals, Limits and Prospects, eds Alt KW, Rosing FW, Teschler-Nicola M (Springer, we know. Annu Rev Genet 47:307–333. Vienna), pp 203–226. 19. Hollox EJ, Hoh B-P (2014) Human gene copy number variation and infectious disease. 46. Gendron R, Grenier D, Maheu-Robert L (2000) The oral cavity as a reservoir of bac- – Hum Genet 133(10):1217 1233. terial pathogens for focal infections. Microbes Infect 2(8):897–906. 20. Perry GH, et al. (2007) Diet and the evolution of human amylase gene copy number 47. Schroth RJ, Harrison RL, Moffatt ME (2009) Oral health of indigenous children and the – variation. Nat Genet 39(10):1256 1260. influence of early childhood caries on childhood health and well-being. Pediatr Clin 21. Bikker FJ, et al. (2002) Identification of the bacteria-binding peptide domain on sal- North Am 56(6):1481–1499. ivary agglutinin (gp-340/DMBT1), a member of the scavenger receptor cysteine-rich 48. Alkarimi HA, Watt RG, Pikhart H, Sheiham A, Tsakos G (2014) Dental caries and – superfamily. J Biol Chem 277(35):32109 32115. growth in school-age children. Pediatrics 133(3):e616–e623. 22. Kishimoto E, Hay DI, Gibbons RJ (1989) A human salivary protein which promotes adhesion of 49. DeGiorgio M, Lohmueller KE, Nielsen R (2014) A model-based approach for identifying – Streptococcus mutans serotype c strains to hydroxyapatite. Infect Immun 57(12):3702 3707. signatures of ancient balancing selection in genetic data. PLoS Genet 10(8):e1004561. 23. Lamont RJ, Demuth DR, Davis CA, Malamud D, Rosan B (1991) Salivary-agglutinin- 50. Lin Y-L, Pavlidis P, Karakoc E, Ajay J, Gokcumen O (2015) The evolution and functional mediated adherence of Streptococcus mutans to early plaque bacteria. Infect Immun impact of human deletion variants shared with archaic hominin genomes. Mol Biol 59(10):3446–3450. Evol, 10.1093/molbev/msu405. 24. Loimaranta V, et al. (2005) Fluid- or surface-phase human salivary scavenger protein 51. Cantsilieris S, Western PS, Baird PN, White SJ (2014) Technical considerations for gp340 exposes different bacterial recognition properties. Infect Immun 73(4):2245–2252. genotyping multi-allelic copy number variation (CNV), in regions of segmental du- 25. Renner M, et al. (2007) DMBT1 confers mucosal protection in vivo and a deletion plication. BMC Genomics 15(1):329. variant is associated with Crohn’s disease. Gastroenterology 133(5):1499–1509. 52. Veal CD, et al. (2013) Automated design of paralogue ratio test assays for the accu- 26. Sasaki H, Betensky RA, Cairncross JG, Louis DN (2002) DMBT1 polymorphisms: Re- rate and rapid typing of copy number variation. Bioinformatics 29(16):1997–2003. lationship to malignant glioma tumorigenesis. Cancer Res 62(6):1790–1796. 53. Peng B, Kimmel M (2005) simuPOP: A forward-time population genetics simulation 27. Armour JAL, et al. (2007) Accurate, high-throughput typing of copy number variation environment. Bioinformatics 21(18):3686–3687. using paralogue ratios from dispersed repeats. Nucleic Acids Res 35(3):e19–e19. 54. Bikker FJ, et al. (2004) Bacteria binding by DMBT1/SAG/gp-340 is confined to the 28. Cann HM, et al. (2002) A human genome diversity cell line panel. Science 296(5566): VEVLXXXXW motif in its scavenger receptor cysteine-rich domains. J Biol Chem 261–262. 279(46):47699–47703.

5110 | www.pnas.org/cgi/doi/10.1073/pnas.1416531112 Polley et al. Downloaded by guest on September 28, 2021