Journal of Human Evolution 79 (2015) 64e72
Contents lists available at ScienceDirect
Journal of Human Evolution
journal homepage: www.elsevier.com/locate/jhevol
The utility of ancient human DNA for improving allele age estimates, with implications for demographic models and tests of natural selection
* Aaron J. Sams a, b, , John Hawks a, 1, Alon Keinan b, 1 a Department of Anthropology, University of Wisconsin-Madison, Madison, WI, USA b Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY, USA article info abstract
Article history: The age of polymorphic alleles in humans is often estimated from population genetic patterns in extant Received 14 January 2014 human populations, such as allele frequencies, linkage disequilibrium, and rate of mutations. Ancient Accepted 22 October 2014 DNA can improve the accuracy of such estimates, as well as facilitate testing the validity of demographic Available online 15 November 2014 models underlying many population genetic methods. Specifically, the presence of an allele in a genome derived from an ancient sample testifies that the allele is at least as old as that sample. In this study, we Keywords: € consider a common method for estimating allele age based on allele frequency as applied to variants Otzi from the US National Institutes of Health (NIH) Heart, Lung, and Blood Institute (NHLBI) Exome Neandertals Sequencing Project. We view these estimates in the context of the presence or absence of each allele in Europe € the genomes of the 5300 year old Tyrolean Iceman, Otzi, and of the 50,000 year old Altai Neandertal. Our results illuminate the accuracy of these estimates and their sensitivity to demographic events that were not included in the model underlying age estimation. Specifically, allele presence in the Iceman genome provides a good fit of allele age estimates to the expectation based on the age of that specimen. The equivalent based on the Neandertal genome leads to a poorer fit. This is likely due in part to the older age of the Neandertal and the older time of the split between modern humans and Neandertals, but also due to gene flow from Neandertals to modern humans not being considered in the underlying demographic model. Thus, the incorporation of ancient DNA can improve allele age estimation, demographic modeling, and tests of natural selection. Our results also point to the importance of considering a more diverse set of ancient samples for understanding the geographic and temporal range of individual alleles. © 2014 Elsevier Ltd. All rights reserved.
Introduction timing of natural selection that has shaped present-day human populations. Herein, we illustrate the importance of aDNA for The quality of ancient human DNA (aDNA) sequencing has addressing questions of allele age and timing of selection by first steadily improved during the past decade, culminating recently in briefly reviewing related literature, and by analyzing allele age in the high-coverage sequencing of two archaic hominin genomes the context of two ancient genomes from different time periods from human skeletal remains pre-dating 40,000 years ago (Meyer and different phylogenetic distance to present-day Europeans. By et al., 2012; Prüfer et al., 2014). Additionally, Meyer et al. (2013) comparing with allele age estimates based on allele frequency in an recently sequenced a mitochondrial genome from the Middle extant population, we conclude that consideration of whether an Pleistocene site of Sima de los Huesos in Spain, pushing the oldest allele is present or absent in aDNA can provide important infor- sequenced human DNA beyond 300,000 years. In addition to mation about allele age and improve on the former type of providing information about human demographic history, the estimates. growing sample of aDNA is useful for understanding the age of mutations that segregate in extant populations and, therefore, the Presence of recently selected alleles in ancient European specimens
A straightforward means of testing selective hypotheses, e.g., * Corresponding author. E-mail address: [email protected] (A.J. Sams). based on long-range haplotype methods (Sabeti et al., 2002; Voight 1 Co-project leaders. et al., 2006), is to analyze putatively selected haplotypes and http://dx.doi.org/10.1016/j.jhevol.2014.10.009 0047-2484/© 2014 Elsevier Ltd. All rights reserved. A.J. Sams et al. / Journal of Human Evolution 79 (2015) 64e72 65 single-nucleotide variants (SNVs) in ancient human specimens. agricultural lifestyles from the Near East. The earlier and more Perhaps the most successful application to date of aDNA to a po- popular viewpoint argues that this transition is characterized by a tential case of recent natural selection pertains to the genetic vari- large amount of demic diffusion involving a large influx of agricul- ants underlying lactase persistence. The ability to digest lactose, a tural populations across Europe (Childe, 1958). Others view this disaccharide found in milk, typically slows in most mammals around transition as being dominated by cultural diffusion, such that the time of weaning. In many humans the ability to produce the Mesolithic hunter-gatherers in Europe embraced Neolithic life- enzyme lactase, which digests lactose into glucose and galactose styles with little genetic input from Near Eastern farmers (Zvelebil sugars, continues into adulthood, a phenotype known as ‘lactase and Dolukhanov, 1991). persistence’. The timing and evolution of lactase persistence was Considering the two viewpoints, using phylogeographic ana- documented first in Europeans, where the genomic region sur- lyses of present human DNA samples has led to contrasting results rounding LCT, the gene encoding lactase, has been consistently re- (Sokal and Menozzi, 1982; Ammerman and Cavalli-Sforza, 1984; ported as one of the most extreme long-range haplotype based Sokal et al., 1991; Puit et al., 1994; Richards et al., 2000; Rosser et al., examples of recent selection in Europe (Enattah et al., 2002, 2007, 2000; Simoni et al., 2000; Belle et al., 2006; Balaresque et al., 2010). 2008; Bersaglieri et al., 2004; Tishkoff et al., 2006). The selection It is important to note that estimates based solely on phylogeo- has been shown to be in the nearby MCM6 gene and results in graphic analysis of genetic variation in extant humans can be biased downregulation of the cessation of lactase production after weaning by several factors. Importantly, recent demographic events, such as (Enattah et al., 2002). Similar disruptive changes to the MCM6 gene gene-flow and back-migration between Europe and the Near East have convergently evolved in both African (Tishkoff et al., 2006) and can result in the geography of extant human populations mis- Middle Eastern (Enattah et al., 2007) populations. Selection for representing that of their perceived ancestors (Richards et al., lactase persistence shows the importance of comparing genetic data 2000; Balaresque et al., 2010). Additionally, when considering a to known cultural changes in the past, such as the timing and limited number of genetic loci, clines of genetic variation can be geographic distribution of cattle and camel pastoralism and milk sensitive to stochastic processes such as isolation by distance consumption (Holden and Mace,1997; Gerbault et al., 2009). The age (Novembre and Stephens, 2008). of the mutation and subsequent beginning of the selective sweep Ancient DNA is an ideal data source to circumvent these issues underlying lactase persistence in Europeans (C/T-13910) has been and settle this debate because genetic differences between skeletal estimated between 3000 and 12,000 years, which seems to coincide samples associated with Mesolithic and Neolithic technologies can with the presence of domesticated cattle (Bollongino et al., 2006) be directly assayed and correlated with cultural differences. A and a record of increasing pastoralism and dairying in several hu- model of cultural diffusion predicts little to no major genetic dif- man populations, particularly in northern Europe. For example, ferences associated with culture change in Holocene samples, Tishkoff and colleagues (2006) estimated the age, using a coalescent while the demic diffusion model predicts substantial influx of novel simulation model that incorporated selection and recombination, at genetic variation. The majority of aDNA studies of Europe have approximately 8000 to 9000 years depending on the degree of involved mtDNA analysis, primarily from hunter-gatherer and dominance assumed for the allele. While consistent with the farming populations of north and central Europe. These studies anthropological record, the confidence intervals spanning 2000 to have revealed that approximately ~83% of pre-Neolithic peoples of 19,000 years points to the large uncertainty in the estimates. This is Europe carried mtDNA haplogroup U and none belong to hap- consistent with the large range of variation in coalescence times logroup H, a composition that is markedly different from present (Slatkin and Rannala, 2000). samples in which haplogroup H is dominant (Haak et al., 2005, Estimates of allele age and timing of selection based on popu- 2008, 2010; Bramanti, 2008; Bramanti et al., 2009; Guba et al., lation genetic patterns observed in extant humans depend heavily 2011; Fu et al., 2012; Lee et al., 2012; Nikitin et al., 2012). On the on assumptions about the demographic history of human pop- other hand, ~12% from early farming populations belong to hap- ulations and are often associated with large ranges of error (as logroup U, while haplogroup H is present in between 25 and 37% of illustrated above for the timing of C/T-13910). By testing whether mtDNA from early farming samples, both consistent with the specific genetic variants were absent or present in an ancient haplotype frequencies of extant Europeans. Combined, these re- sample, aDNA can be used to test hypotheses about the timing of sults from ancient mtDNA analyses suggest that pre-Neolithic selective changes in past human populations (Burger et al., 2007; hunter-gatherers contributed at most 20% to the mtDNA genetic Malmstrom€ et al., 2010; Plantinga et al., 2012). This can lead to composition of present European populations (Fu et al., 2012). much more precise time estimates, though these depend on the These ancient mtDNA results were recently combined with a ability to accurately date ancient skeletal materials. For example, dataset of 1151 complete mtDNAs from across Europe (Fu et al., the derived allele (-13910*T) that underlies lactase persistence in 2012). Fu and colleagues (2012) found evidence for a population Northern Europeans was found in only one copy out of 20 (~5%) in a expansion between 15,000 and 10,000 years ago in mtDNA typical 5000 year old skeletal sample from Sweden (Malmstrom€ et al., of pre-Neolithic hunter-gatherers and a subsequent contraction of 2010), in ~27% of a sample of 26 Basque individuals dating be- these haplotypes between 10,000 and 5000 years ago, consistent tween 4500 and 5000 years ago (Plantinga et al., 2012), and was with the expansion of mtDNA from agricultural populations from completely absent from a skeletal sample of nine individuals from the Near East. eastern Europe dating between 5000 and 5800 years ago (Burger In addition to ancient mtDNA data, results from whole genome et al., 2007). sequencing of ancient Holocene-aged individuals have been applied to the question of Neolithic population replacement in Holocene demography of Europe and ancient DNA Europe. Within the past two years, substantial amounts of auto- somal DNA have been reported for seven ancient individuals in € Archaeological evidence suggests that the transition from a Europe. These include Otzi (the Tyrolean Iceman) dating to roughly hunting and gathering lifestyle to a more sedentary agricultural 5300 years ago (Keller et al., 2012), three hunter-gatherers associ- ‘Neolithic’ lifestyle, which began in the Near East by 10,000 years ated with the Pitted Ware culture and one farmer associated with ago, spread across Europe between 8000 and 4000 years ago (Price, the Funnel Beaker culture from Scandinavia dating to approxi- 2000). Archaeological and genetic evidence has traditionally been mately 5000 years ago (Skoglund et al., 2012), and two 7000 year divided between two viewpoints regarding the spread of old Iberian hunter-gatherers (Sanchez-Quinto et al., 2012). 66 A.J. Sams et al. / Journal of Human Evolution 79 (2015) 64e72
Genome-wide analysis of the Iceman's genome revealed that of (Rannala and Reeve, 2001), or shared haplotypes (Genin et al., extant populations, the Iceman appears most closely related to 2004). We chose to utilize a set of allele age estimates generated Sardinians, which has been argued to reflect the common ancestry from allele frequencies (using the approach developed by Griffiths between the Sardinian and Alpine populations of ~5000 years ago and Tavare, 1998) as part of the US National Institutes of Health (Keller et al., 2012). The study of the four ancient Scandinavians (NIH) Heart, Lung, and Blood Institute (NHLBI)-sponsored Exome surprisingly revealed that the hunter-gatherers of this sample are Sequencing Project (ESP). We focus specifically on the set of allele more similar to (though still distinct from) Northern Europeans, ages for approximately 1.15 million exonic SNVs estimated from a while the farmer is more similar to present-day Southern Euro- sample of 4298 European Americans (Fu et al., 2013). peans, with a high degree of genetic similarity to present-day in- We considered two whole-genomes of skeletal samples with € dividuals in Greece and Cyprus. Finally, analysis of the two Iberian known geographic and temporal provenience, namely Otzi, the individuals showed no close relationship between these individuals Tyrolean Iceman, estimated to be approximately 5300 years old and present-day populations in Iberia or southern Europe. The (Keller et al., 2012) and an approximately 50,000 year old Nean- general picture emerging from analysis of aDNA is that the spread dertal from the Altai Mountains of Siberia (Prüfer et al., 2014). The of farming technology during the Neolithic was associated with accuracy of population genetic estimates of derived allele age de- significant population movements from the Near East across pends both on the details of the method applied for their estima- Europe. This population replacement should be taken under tion, as well as on selection operating on the allele and the global consideration in utilizing data from aDNA to address questions effects of demography. For example, an allele has been previously € about recent natural selection in Europe, particularly in studies of a reported (Sams and Hawks, 2013) that is present in Otzi but esti- single locus. mated to be younger than 5000 using an LD-based method. How- ever, we hypothesize that derived allele ages estimated from the Ancient DNA, demography, and selection extant European American population will be much older on average when they are present in an ancient genome, compared To what extent should the evidence of demographic population with derived alleles that are not observed in the ancient individual. replacement affect the application of aDNA in testing hypotheses of Empirical testing and quantification of this hypothesis can weigh recent natural selection? As a hypothetical example, consider the into how commonly-used methods for timing selection and age of analysis by Malmstrom€ et al. (2010), which revealed that the mutations can be biased by demographic factors (such as popula- -13910*T lactase persistence allele was relatively rare (5%) in a tion replacements, bottlenecks, and population growth) as well as middle Neolithic sample of hunter-gatherers associated with the by other factors, and how accurate they are when considering Pitted Ware culture. This result seemingly strengthens the argu- variability across loci in coalescence times, mutation and recom- ment for the recent origin and rapid increase in frequency of the bination rates, etc. lactase persistence allele in populations of Europe. However, given that these mid-Neolithic hunter-gatherer populations may not be Materials and methods closely related to present day populations of northern Europe, the rapid shift in allele frequency may reflect the demographic shift Allele-ages rather than natural selection. In light of this, it is important that applications of aDNA to questions related to selection consider a We extracted text-based (.txt) Exome-Sequencing Project data wider geographic range of aDNA samples. For example, the fre- from the NHLBI Exome Sequencing Project's Exome Variant Server quency of the -13910*T allele in ancient populations of the Near (http://evs.gs.washington.edu/EVS/, accessed December 2013). We East can shed important additional light on the timing and geog- filtered the data to exclude duplicate SNVs, SNVs representing raphy of selection for lactase persistence. transition polymorphisms, and any SNVs that did not include an allele age estimate for the European sub-sample of the ESP. Estimating allele age The ESP data files do not include the project's calls for the ancestral state used to determine derived allele-age. Therefore, we One of the exciting outcomes of the development of aDNA used our own inferred ancestral state (described below) to polarize technologies is the use of ancient samples to improve allele age the frequency at each resulting SNV. During this process, we estimation from modern samples, as recently demonstrated by noticed that a small proportion (~0.04%) of SNVs had listed ages Malaspinas et al. (2012). To examine uncertaintiesddue to de- clearly corresponding to the frequency of our inferred ancestral mographic eventsdassociated with using aDNA to assess allele age, allele, which likely results from differences in calling the ancestral be it a neutral allele or one under recent selection, we aim to first state. To alleviate this problem, we removed these miscalled sites determine how closely estimates of allele age from extant pop- from our analysis. In total, we analyzed a sample of 119,421 poly- ulations conform to ancient specimens with known geographic and morphic sites. temporal provenience. Lactase persistence provides one example where the absence of the derived allele in ancient samples is Inference of ancestral state consistent with a recent mutation. However, discovery of the allele in additional ancient populations could potentially push back the We inferred ancestral state by alignment of the human reference timing of the allele. We focus here on the information that can be (hg19) to three primates [chimpanzee (pantro2), orangutan gleaned from the presence or absence of derived alleles in ancient (ponabe1), and rhesus macaque (rhemac2)]. Alignment gaps and samples on the genome-wide distribution of allele age. We further non-syntenic regions were removed. Ancestral state was called by compare the information gleaned from aDNA with estimates of the chimpanzee allele as long as that allele was also present in a allele age based on extant population genetic information, which second primate outgroup. can provide insight into the accuracy of the latter. Several methods have been developed to estimate allele age PhyloP data using allele frequency (Kimura and Ohta, 1973; Griffiths and Tavare, 1998), intra-allelic variation (Serre et al., 1990; Risch et al., 1995; PhyloP scores generated from the EPO 36 eutherian mammal Slatkin and Rannala, 2000), patterns of linkage disequilibrium alignment with human sequence masked were generated and A.J. Sams et al. / Journal of Human Evolution 79 (2015) 64e72 67 provided by Wenqing Fu and Joshua Akey. We used the 95th Ascertainment in the Iceman genome shows reasonable fit to allele percentile to differentiate highly conserved sites (top 5%) and not age estimation highly conserved sites. The distribution of ESP age estimates for derived alleles that are Iceman genome data present in the Iceman genome exhibits a significant shift towards older ages compared with that for derived alleles present in the We obtained aligned genome reads from the Tyrolean Iceman Iceman genome (Fig. 1; median difference, p < 10 10). This result € (Otzi) from the European Short Read Archive. At the time of our holds also after removing the low frequency variants (derived allele € download, the Otzi data were provided as three BAM files; we used count below 10), which make up a majority of the dataset (data not samtools (http://samtools.org) to merge these into a single dataset shown). The overall proportion of derived alleles estimated to be and extracted the SNV sites for each allele with a valid age estimate younger than 5000 years old and present in the Iceman genome is € from the European portion of the ESP data. In order to align the Otzi very small (Table 1; 0.024%). Stated differently, alleles estimated to reads (hg18) to the HapMap/1000 Genomes data (hg19), we first be younger than 5000 years old are associated with 67-fold had to perform a genome build liftover using the UCSC Genome decreased odds (OR ¼ 0.0015) of being observed in the Iceman Browser Batch Coordinate Conversion utility (http://genome.ucsc. genome relative to alleles estimated to be older than 5000 years. edu/cgi-bin/hgLiftOver). While only 0.024% of alleles were estimated as younger than The extracted set of reads were filtered as in previous studies to 5000 years and present in the Iceman, this represents a significant include only reads with base quality >40 and mapping quality >37. deviation from the expectation of no such alleles (p ¼ 3.5 10 6). Additionally, we filtered out sites with collapsed coverage greater Therefore, we must explain the presence of the small fraction of than 15. We chose not to use a minimum read depth for the Iceman young alleles discovered in the Iceman. Assuming unbiased age genome in order to maximize the number of available sites repre- estimations, the problem must lie either in statistical error of age sented. Because of this, we chose a single random read from each estimation or with error in the aDNA, including deamidation of € SNV and joined the data from Otzi with the ESP allele ages. cytosine residues, sequencing error, and contamination. The esti- mated fraction of the Iceman sequence subject to these sorts of Neandertal genome data errors has been estimated to be approximately 2.5% of the genome (Keller et al., 2012). Our analysis should be confounded by these to a The Altai Neandertal VCF files were downloaded from the Max lesser extent since we only considered sites with known poly- Planck Institute's Department of Evolutionary Genetics server morphisms in present-day populations and removed transition (http://cdna.eva.mpg.de/neandertal/altai/AltaiNeandertal/VCF/). polymorphisms (C-T, A-G) to reduce the impact of cytosine dea- We filtered the data using minimum quality filters, which included midation errors. Only 0.86% of derived alleles that are present in the the following filters: removal of repeat sites from the Tandem Iceman are estimated to be younger than 5000 years, and they Repeat Finder, removal of sites with mapping quality <30, inclusion show a different distribution, shifted towards older ages, compared of only sites that pass a genome alignability filter that requires that with absent alleles that are also estimated to be younger than 5000 50% of 35-mers overlap a position map uniquely allowing up to one years (Fig. 1). Combined, these results support that the vast ma- mismatch, and removal of sites with read depth in the most jority of derived alleles observed in the Iceman are genuine. extreme 5% of the genome-wide coverage distribution. In order to Another piece of evidence supporting these conclusions stems compare the presence-absence distributions of the Iceman and from the frequency spectrum of present alleles: Under neutrality, Neandertal, we extracted only SNVs that were also ascertained in the probability that an allele is observed in a single ancient genome the Iceman. Rather than choosing a single read as in the Iceman, we from an ancestral population can be approximated by the present- simply scored (as 0 or 1) each genotype for the presence or absence day frequency of the allele. Fig. 2A shows the correlation of of the derived allele.
Block bootstrap tests
We used a block bootstrap approach (Lahiri, 2003; Keinan et al., 2007) to assess the significance of the median difference in allele age between presence/absence categories from zero and to assess the significance of deviations of odds-ratios from one (after dividing the sample into four young/old, present/absent bins). We resampled the ESP dataset (with replacement) in blocks of 200 Kb a total of 10,000 times and used the mean and standard error of these bootstrap distributions to calculate Z-scores and p-values.
Results and discussion
We analyzed, in total, 119,421 derived allele age estimates as estimated by the ESP for exonic single-nucleotide variants (Fu et al., 2013). These estimates are based on coalescent simulations using a model developed by Tennessen et al. (2012) and allele frequencies Figure 1. Empirical cumulative distributions of mean allele age divided by its presence from a sample of 4298 European Americans. For each SNV, we or absence in the Iceman and the Altai Neandertal. Empirical distributions of derived considered whether the derived allele is observed (present) or not allele age estimates from ESP for each subset of alleles based on their presence/absence (absent) in each of two ancient genomes, the approximately 5300 in the Iceman and the Altai Neandertal. As expected, alleles observed in ancient ge- € nomes are older than those not observed, and the distribution of the latter conforms to year old Tyrolean Iceman (Otzi) (Keller et al., 2012) and an observations from extant allele frequencies of most alleles being young. Inset bar chart approximately 50,000 year old Neandertal from the Altai Moun- summarizes the distributions at several time points (in ka). Note in particular the tains of Siberia (Prüfer et al., 2014). excess proportion of ‘young present’ alleles in the Altai genome. 68 A.J. Sams et al. / Journal of Human Evolution 79 (2015) 64e72
Table 1 Number of alleles for each combination of presence (P) or absence (A) in the Iceman and older (O) or younger (Y) age (in ka) compared to the threshold indicated in each row.
Meana SDa Y þ PYþ AOþ POþ AORb SE Z p-valuec