Genomic Distribution and Estimation of Nucleotide Diversity in Natural Populations: Perspectives from the Collared flycatcher (Ficedula Albicollis) Genome
Total Page:16
File Type:pdf, Size:1020Kb
Molecular Ecology Resources (2017) 17, 586–597 doi: 10.1111/1755-0998.12602 FROM THE COVER Genomic distribution and estimation of nucleotide diversity in natural populations: perspectives from the collared flycatcher (Ficedula albicollis) genome LUDOVIC DUTOIT, RETO BURRI, ALEXANDER NATER, CARINA F. MUGAL and HANS ELLEGREN Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Norbyv€agen 18D, SE-752 36 Uppsala, Sweden Abstract Properly estimating genetic diversity in populations of nonmodel species requires a basic understanding of how diversity is distributed across the genome and among individuals. To this end, we analysed whole-genome rese- quencing data from 20 collared flycatchers (genome size 1.1 Gb; 10.13 million single nucleotide polymorphisms detected). Genomewide nucleotide diversity was almost identical among individuals (mean = 0.00394, range = 0.00384–0.00401), but diversity levels varied extensively across the genome (95% confidence interval for 200- kb windows = 0.0013–0.0053). Diversity was related to selective constraint such that in comparison with intergenic DNA, diversity at fourfold degenerate sites was reduced to 85%, 30 UTRs to 82%, 50 UTRs to 70% and nondegenerate sites to 12%. There was a strong positive correlation between diversity and chromosome size, probably driven by a higher density of targets for selection on smaller chromosomes increasing the diversity-reducing effect of linked selection. Simulations exploring the ability of sequence data from a small number of genetic markers to capture the observed diversity clearly demonstrated that diversity estimation from finite sampling of such data is bound to be associated with large confidence intervals. Nevertheless, we show that precision in diversity estimation in large out- bred population benefits from increasing the number of loci rather than the number of individuals. Simulations mimicking RAD sequencing showed that this approach gives accurate estimates of genomewide diversity. Based on the patterns of observed diversity and the performed simulations, we provide broad recommendations for how genetic diversity should be estimated in natural populations. Keywords: genetic markers, nucleotide diversity, population genomics, recombination Received 18 February 2016; revision received 2 September 2016; accepted 19 September 2016 accurately estimate genetic diversity is essential to the Introduction study of evolutionary phenomena. Genetic diversity is a key parameter in evolutionary biol- In population genetic terms, genetic diversity reflects ogy and population genetics. It relates to the evolvability the interplay of mutation, genetic drift, selection, recom- of populations (Fisher 1930), is important in the contexts bination and gene flow on DNA sequence variation. of adaptation (Barrett & Schluter 2008), inference of pop- Intuitively, we expect large populations to harbour more ulation structure (Charlesworth 2010) and speciation genetic diversity than small populations and, in princi- (Coyne & Orr 2004), and is also relevant to conservation ple, this is defined by the population mutation rate (Θ), l and management (Frankham 1995; Reed & Frankham which equals 4Ne where Ne is the effective population 2003). Moreover, explaining the genetic diversity under- size and l is the rate of mutation. However, the determi- lying phenotypic variation has long been a challenge to nants of genetic diversity are complex and it has long evolutionary biologists because directional as well as sta- appeared mysterious that variation in levels of genetic bilizing selection should deplete this diversity (Barton & diversity among species is relatively limited despite Turelli 1989; Barton & Keightley 2002). Knowledge about huge variation in population sizes (Leffler et al. 2012), an the levels and character of genetic diversity is important observation known as Lewontin’s paradox (Lewontin to questions like these, and consequently, the ability to 1974). One possible explanation for this paradox is that the high genetic diversity expected for large populations Correspondence: Hans Ellegren, is counteracted by genetic draft (Gillespie 2000, 2001), E-mail: [email protected] with selection being more efficient in large populations © 2016 The Authors. Molecular Ecology Resources Published by John Wiley & Sons Ltd. This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made. ESTIMATING NUCLEOTIDE DIVERSITY 587 and thereby reinforcing the diversity-reducing effect of distributed across the genome in natural populations. An linked selection on neutral diversity (Corbett-Detig et al. alternative way of analysing the extent to which genomic 2015). Both pervasive positive selection (Maynard-Smith diversity is reflected in finite sampling from a population & Haigh 1974) and extensive purifying selection (Char- is to first empirically determine the genomewide land- lesworth 2012) will reduce neutral diversity at linked scape of diversity and then simulate sampling from the sites. Moreover, recent evidence suggests that life-history observed distribution. As this requires vast amounts of traits such as fecundity (Romiguier et al. 2014) rather genomic data, it has up until now not been a viable than ecological disturbance and historical contingency option. However, this is bound to change given current (short-term variation in Ne; (Banks et al. 2013)) may progress in genome sequencing and resequencing of explain variation in diversity levels among species. diverse groups of organisms (Ellegren 2014). Besides variation in diversity levels among popula- At this point, the collared flycatcher (Ficedula albicollis) tions and species, it is important to note that a single Θ is one of the ‘ecological model organisms’ in which the cannot describe genomic diversity because selection genome has been mapped in most detail. Following the locally reduces Ne to a different extent in different parts generation of a draft sequence assembly of the 1.1 Gb of the genome (Gossmann et al. 2011). Such effects are collared flycatcher genome (Ellegren et al. 2012), the con- expected to be most pronounced in regions of low struction of a high-density linkage map using a 50-k sin- recombination where linked selection will affect the fre- gle nucleotide polymorphism (SNP) array (Kawakami quency of neutral variants over longer physical distances et al. 2014a) allowed the development of a second-gen- than in regions of high recombination. This prediction is eration assembly version with unusually high assembly supported by observations in several organisms of a pos- continuity and with scaffolds ordered and oriented along itive correlation between recombination rate and nucleo- chromosomes (Kawakami et al. 2014b; Smeds et al. 2014, tide diversity (reviewed in (Cutter & Payseur 2013)). 2015). In addition, whole-genome resequencing data Additional factors that may contribute to within-genome from multiple populations and from related species are variation in diversity levels include introgression, which available (Burri et al. 2015; Nater et al. 2015; Kardos et al. may not be uniformly distributed across the genome 2016). This system therefore serves as a most useful (Wu & Ting 2004), and local or regional variation in the model and offers excellent opportunities for studies of rate of mutation (Hodgkinson & Eyre-Walker 2011). the landscape of genetic diversity in a eukaryotic gen- Moreover, even under an idealized scenario of constant ome. Here, we use whole-genome resequencing data selection and mutation, stochastic variation in the coales- from a population of collared flycatchers to address the cence process for individual genomic regions will render following questions: How is genetic diversity distributed the amount of diversity variable across the genome across chromosomes and the genome? Is the distribution (Wakeley 2009). On top of this, individuals within at of diversity more heterogeneous than expected by least small populations can vary in their overall degree chance? To what extent does genomewide diversity vary of genetic diversity due to different levels of inbreeding among individuals and among functional categories of (Bensch et al. 2006) and this may also apply when there sequences? Then, based on real data, we use simulations is selfing or frequent introgression. Yet, the ability to to examine how different sampling schemes would affect obtain a single value of the diversity parameter is impor- estimates of genetic diversity and how sampling schemes tant when broadly considering the effectiveness of selec- can be optimized to capture most of the variation in tion and drift and in among-population comparisons. genetic diversity within populations. For the reasons mentioned above, estimating genetic diversity is inherently sensitive to the number of sam- Material and methods pled loci (because of heterogeneity in diversity levels across the genome) and individuals (because of potential Sequence analysis variation in inbreeding among individuals), and also to the type of sequence analysed (e.g. neutrally evolving We used whole-genome resequencing data from 20 (10 loci vs. sequences subject to selection). Theoretical work males, 10 females) collared flycatchers from a single pop- (Pluzhnikov & Donnelly 1996; Felsenstein 2006) and sim- ulation in the Apennine