Molecular Ecology Resources (2017) 17, 586–597 doi: 10.1111/1755-0998.12602

FROM THE COVER Genomic distribution and estimation of diversity in natural populations: perspectives from the collared flycatcher (Ficedula albicollis) genome

LUDOVIC DUTOIT, RETO BURRI, ALEXANDER NATER, CARINA F. MUGAL and HANS ELLEGREN Department of Evolutionary Biology, Evolutionary Biology Centre, Uppsala University, Norbyv€agen 18D, SE-752 36 Uppsala, Sweden

Abstract Properly estimating in populations of nonmodel species requires a basic understanding of how diversity is distributed across the genome and among individuals. To this end, we analysed whole-genome rese- quencing data from 20 collared flycatchers (genome size 1.1 Gb; 10.13 million single nucleotide polymorphisms detected). Genomewide nucleotide diversity was almost identical among individuals (mean = 0.00394, range = 0.00384–0.00401), but diversity levels varied extensively across the genome (95% confidence interval for 200- kb windows = 0.0013–0.0053). Diversity was related to selective constraint such that in comparison with intergenic DNA, diversity at fourfold degenerate sites was reduced to 85%, 30 UTRs to 82%, 50 UTRs to 70% and nondegenerate sites to 12%. There was a strong positive correlation between diversity and chromosome size, probably driven by a higher density of targets for selection on smaller chromosomes increasing the diversity-reducing effect of linked selection. Simulations exploring the ability of sequence data from a small number of genetic markers to capture the observed diversity clearly demonstrated that diversity estimation from finite sampling of such data is bound to be associated with large confidence intervals. Nevertheless, we show that precision in diversity estimation in large out- bred population benefits from increasing the number of loci rather than the number of individuals. Simulations mimicking RAD sequencing showed that this approach gives accurate estimates of genomewide diversity. Based on the patterns of observed diversity and the performed simulations, we provide broad recommendations for how genetic diversity should be estimated in natural populations.

Keywords: genetic markers, nucleotide diversity, population genomics, recombination Received 18 February 2016; revision received 2 September 2016; accepted 19 September 2016

accurately estimate genetic diversity is essential to the Introduction study of evolutionary phenomena. Genetic diversity is a key parameter in evolutionary biol- In population genetic terms, genetic diversity reflects ogy and population . It relates to the evolvability the interplay of mutation, genetic drift, selection, recom- of populations (Fisher 1930), is important in the contexts bination and gene flow on DNA sequence variation. of adaptation (Barrett & Schluter 2008), inference of pop- Intuitively, we expect large populations to harbour more ulation structure (Charlesworth 2010) and speciation genetic diversity than small populations and, in princi- (Coyne & Orr 2004), and is also relevant to conservation ple, this is defined by the population mutation rate (Θ), l and management (Frankham 1995; Reed & Frankham which equals 4Ne where Ne is the effective population 2003). Moreover, explaining the genetic diversity under- size and l is the rate of mutation. However, the determi- lying phenotypic variation has long been a challenge to nants of genetic diversity are complex and it has long evolutionary biologists because directional as well as sta- appeared mysterious that variation in levels of genetic bilizing selection should deplete this diversity (Barton & diversity among species is relatively limited despite Turelli 1989; Barton & Keightley 2002). Knowledge about huge variation in population sizes (Leffler et al. 2012), an the levels and character of genetic diversity is important observation known as Lewontin’s paradox (Lewontin to questions like these, and consequently, the ability to 1974). One possible explanation for this paradox is that the high genetic diversity expected for large populations Correspondence: Hans Ellegren, is counteracted by genetic draft (Gillespie 2000, 2001), E-mail: [email protected] with selection being more efficient in large populations

© 2016 The Authors. Molecular Ecology Resources Published by John Wiley & Sons Ltd. This is an open access article under the terms of the Creative Commons Attribution-NonCommercial-NoDerivs License, which permits use and distribution in any medium, provided the original work is properly cited, the use is non-commercial and no modifications or adaptations are made. ESTIMATING NUCLEOTIDE DIVERSITY 587 and thereby reinforcing the diversity-reducing effect of distributed across the genome in natural populations. An linked selection on neutral diversity (Corbett-Detig et al. alternative way of analysing the extent to which genomic 2015). Both pervasive positive selection (Maynard-Smith diversity is reflected in finite sampling from a population & Haigh 1974) and extensive purifying selection (Char- is to first empirically determine the genomewide land- lesworth 2012) will reduce neutral diversity at linked scape of diversity and then simulate sampling from the sites. Moreover, recent evidence suggests that life-history observed distribution. As this requires vast amounts of traits such as fecundity (Romiguier et al. 2014) rather genomic data, it has up until now not been a viable than ecological disturbance and historical contingency option. However, this is bound to change given current

(short-term variation in Ne; (Banks et al. 2013)) may progress in genome sequencing and resequencing of explain variation in diversity levels among species. diverse groups of organisms (Ellegren 2014). Besides variation in diversity levels among popula- At this point, the collared flycatcher (Ficedula albicollis) tions and species, it is important to note that a single Θ is one of the ‘ecological model organisms’ in which the cannot describe genomic diversity because selection genome has been mapped in most detail. Following the locally reduces Ne to a different extent in different parts generation of a draft sequence assembly of the 1.1 Gb of the genome (Gossmann et al. 2011). Such effects are collared flycatcher genome (Ellegren et al. 2012), the con- expected to be most pronounced in regions of low struction of a high-density linkage map using a 50-k sin- recombination where linked selection will affect the fre- gle nucleotide polymorphism (SNP) array (Kawakami quency of neutral variants over longer physical distances et al. 2014a) allowed the development of a second-gen- than in regions of high recombination. This prediction is eration assembly version with unusually high assembly supported by observations in several organisms of a pos- continuity and with scaffolds ordered and oriented along itive correlation between recombination rate and nucleo- chromosomes (Kawakami et al. 2014b; Smeds et al. 2014, tide diversity (reviewed in (Cutter & Payseur 2013)). 2015). In addition, whole-genome resequencing data Additional factors that may contribute to within-genome from multiple populations and from related species are variation in diversity levels include introgression, which available (Burri et al. 2015; Nater et al. 2015; Kardos et al. may not be uniformly distributed across the genome 2016). This system therefore serves as a most useful (Wu & Ting 2004), and local or regional variation in the model and offers excellent opportunities for studies of rate of mutation (Hodgkinson & Eyre-Walker 2011). the landscape of genetic diversity in a eukaryotic gen- Moreover, even under an idealized scenario of constant ome. Here, we use whole-genome resequencing data selection and mutation, stochastic variation in the coales- from a population of collared flycatchers to address the cence process for individual genomic regions will render following questions: How is genetic diversity distributed the amount of diversity variable across the genome across chromosomes and the genome? Is the distribution (Wakeley 2009). On top of this, individuals within at of diversity more heterogeneous than expected by least small populations can vary in their overall degree chance? To what extent does genomewide diversity vary of genetic diversity due to different levels of inbreeding among individuals and among functional categories of (Bensch et al. 2006) and this may also apply when there sequences? Then, based on real data, we use simulations is selfing or frequent introgression. Yet, the ability to to examine how different sampling schemes would affect obtain a single value of the diversity parameter is impor- estimates of genetic diversity and how sampling schemes tant when broadly considering the effectiveness of selec- can be optimized to capture most of the variation in tion and drift and in among-population comparisons. genetic diversity within populations. For the reasons mentioned above, estimating genetic diversity is inherently sensitive to the number of sam- Material and methods pled loci (because of heterogeneity in diversity levels across the genome) and individuals (because of potential Sequence analysis variation in inbreeding among individuals), and also to the type of sequence analysed (e.g. neutrally evolving We used whole-genome resequencing data from 20 (10 loci vs. sequences subject to selection). Theoretical work males, 10 females) collared flycatchers from a single pop- (Pluzhnikov & Donnelly 1996; Felsenstein 2006) and sim- ulation in the Apennine mountain range in central Italy ulations (Carling & Brumfield 2007) have been used to (available on the European Nucleotide Archive (ENA) describe the expected distribution of diversity levels under Accession Number PRJEB7359). Detailed informa- across the genome and to determine how well finite sam- tion about sampling, DNA analyses and bioinformatic pling of loci and individuals is able to capture the varia- methods used for generating these data is given in Burri tion in genetic diversity. However, it remained unclear et al. (2015). Briefly, birds were sequenced with paired- to what extent these studies captured biologically rele- end technology on an Illumina HiSeq 2000 instrument. vant scenarios in terms of how genetic diversity in Reads (2 9 100 bp, from libraries with insert sizes of

© 2016 The Authors. Molecular Ecology Resources Published by John Wiley & Sons Ltd. 588 L. DUTOIT ET AL. approximately 450 bp) were then mapped to a repeat- nucleotide diversity were made for concatenated masked version of the collared flycatcher genome assem- sequences of each of the sequence categories described bly FicAlb1.5 (GenBank Accession GCA_000247815.2) above. Reported estimates of nucleotide diversity are at using BWA 0.7.5a (Li & Durbin 2009) and local realign- the population level. Estimates of individual nucleotide ment with GATK 3.2.2 (McKenna et al. 2010; DePristo diversities are specified as such. For estimating Z chro- et al. 2011). Variant calling was performed using a com- mosome diversity, we only used data from the 10 males; bination of UnifiedGenotyper in GATK, Samtools (Li in birds, males are the homogametic sex. et al. 2009), and FreeBayes (Garrison & Marth 2012) with To obtain a null expectation for the variation in default settings. Base quality score recalibration was then genetic diversity along the autosomal genome, we ran- applied in two rounds. In addition to the procedures for domly reassigned the location of the total number of variant calling described in (Burri et al. 2015), we applied observed SNPs. We performed 10 resamplings and a final filtering step by only including sites at which extracted the distribution of nucleotide diversity within every individual was covered by at least five reads (with windows under this random distribution of segregating the exception of when comparing different individuals sites for every replicate. within windows). To examine whether any of the individuals were clo- sely related, we assessed relatedness among individuals by calculating of the unadjusted Ajk statistic (Yang et al. Annotation 2010) and inbreeding was assessed by FIS using the We downloaded ENSEMBL annotations for version Hierfstat (Goudet 2005) package for R (R Development FicAlb1.4 of the collared flycatcher genome assembly, Core Team 2014). translated them to the FicAlb1.5 assembly version and then categorized sequences as either representing inter- 0 Simulation of diversity estimation genic regions, introns, coding sequences (CDS), or 5 or 30 untranslated regions (UTR). Alternative transcripts of To simulate the precision of nucleotide diversity esti- the same gene were concatenated. Within coding mates from a given number of markers and individuals, sequences, fourfold and 0-fold degenerate sites were dis- we subsampled the empirical data under different sam- tinguished. In doing so, codons with coding sequence on pling schemes designed to correspond to amplicon both strands and where more than one site was variable sequencing and RAD sequencing, respectively. were removed from the data set. In the schemes mimicking amplicon sequencing, we defined a marker as 500 sites for which all the sampled individuals had a coverage at least five reads in a physi- Estimation of nucleotide diversity cal region of maximum 1500 bp. Sampling schemes con- As an estimator of the population mutation rate (Θ), we sisted of 1–20 individuals sampled for 1–50 ‘markers’ (in estimated nucleotide diversity (p), the average pairwise this case regions of the genome), done separately for cod- number of differences per site among the chromosomes ing sequences, intergenic regions, introns and 50UTRs, in a population (Nei & Li 1979). and 1, 5, 10 and 20 individuals for 50–200 markers in X intergenic regions. The schemes mimicking RAD p ¼ xixjpij ð1Þ sequencing consisted of 10 or 20 individuals sampled for ij 500–5000 100-bp markers of random sequence for which each bp had a coverage of at least five reads for all the where i, j 2 {1, ...,n}, and n is the number of sequences sampled individuals. in the sample. Further, xi and xj denote the respective For all sampling schemes, subsampling was repeated p frequencies of the ith and jth sequences; ij denotes the 1000 times and nucleotide diversity estimated for each number of nucleotide differences per nucleotide site sampled set of individuals and markers. The interval between the ith and jth sequences. covering 95% of the distribution of nucleotide diversity We estimated nucleotide diversity along the genome estimates was recorded. Previous studies examining the in 200-kb nonoverlapping windows for every individual variance in estimates of nucleotide diversity from the use and for the population as a whole using in-house scripts of a finite number of markers have referred to this as and the Python package PYVCF 0.4.0 (https:// accuracy of precision in diversity estimation (Pluzhnikov pyvcf.readthedocs.org). Windows with <10 000 sites & Donnelly 1996; Felsenstein 2006; Carling & Brumfield remaining after filtering were discarded. The chosen 2007). Here, we refer to the width of the 95% confidence window size was a trade-off between capturing fine-scale intervals (CIs) as precision in diversity estimation as it is variation in genetic diversity and reducing measurement a measure of the consistency of p across subsamples for a error. In addition, separate estimates of genomewide given sampling scheme.

© 2016 The Authors. Molecular Ecology Resources Published by John Wiley & Sons Ltd. ESTIMATING NUCLEOTIDE DIVERSITY 589

the lower effective population size, nucleotide diversity Results and discussion along the Z chromosome was lower than on autosomes (mean p = 0.00288). As the evolutionary forces affecting Overall levels of genetic diversity diversity levels of autosomal and sex-linked sequences Whole-genome resequencing of 20 collared flycatchers differ, and addressing these differences was beyond the resulted in a mean per-site coverage of 13.99 across indi- scope of the study, we will not further deal with sex- viduals, with a range of 7.6–21.49. 575.64 Mb out of the linked sequences. 1.047 Gb total autosomal assembly met the criteria of having a read coverage of at least 59 in all 20 individu- Genetic diversity in different sequence categories als, and 10.13 million of these sites were variable in our sample. The large number of SNPs provides an excellent Selection will generally reduce diversity levels. Overall, opportunity to assess genomewide diversity at the popu- nucleotide diversity was strongly dependent on the lation as well as at the individual level. None of the indi- annotated type of sequence and was directly related to viduals were closely related (maximum pairwise AJK the expected selective pressure (Table 2), with coding value: 0.09 Fig. S1, Supporting information), and there sequences being the least variable category (mean = p = was limited evidence for inbreeding (FIS 0.0037). This 0.0014). Using intergenic sequence as a putatively suggests that the analysed individuals constitute a ran- neutral reference (mean p = 0.0040), diversity at fourfold 0 dom sample of birds from the population. Estimates of degenerate sites was reduced to 85% (0.0034), 3 UTRs to 0 genomewide, autosomal nucleotide diversity per indi- 82% (0.0033), 5 UTRs to 70% (0.0028) and nondegenerate vidual were almost identical among the 20 individuals sites to 12% (0.0005). Diversity in introns (mean (mean p = 0.00386, range = 0.00376–0.00393; Fig. 1A, p = 0.0039) was similar to that in intergenic regions. As Table S1, Supporting information) and corresponded to a these two categories of sequences are by far most com- mean of one heterozygous position every 259 bp in an mon in avian genomes, they largely determined the gen- individual’s genome (Table 1). With all 20 individuals omewide average. taken together, the mean density of segregating sites in The lower diversity at fourfold degenerate (silent) the sample was one every 103 bp. A survey of diversity sites compared to intergenic and intronic sites may seem data from 167 species representing 14 phyla found that surprising at first given that all three sequence categories the majority of species have p in the range of 0.0005–0.05, traditionally have been considered to evolve neutrally. with a median of 0.0065 (Leffler et al. 2012). Collared However, it supports observations in molecular evolu- flycatcher thus shows intermediate levels of genetic tionary studies of birds (Kunstner€ et al. 2011) and several diversity seen in this broader context. As expected from other organisms (Eory} et al. 2010; Pollard et al. 2010;

(A) (B) (C)

Fig. 1 Representations of collared flycatcher genetic diversity along its main axes of variation. (A) Mean genomewide nucleotide diver- sity per individual. (B) Density distribution of genetic diversity in 200-kb windows with (solid line) or without (dashed line) coding sequences included. Grey lines represent 10 simulations of the estimated density distribution of genetic diversity assuming random dis- tribution of SNPs across the genome. (C) Autocorrelation analysis of nucleotide diversity among neighbouring windows. Partitions refer to 200-kb windows, such that, for example, 10 partitions correspond to a distance of 2 Mb. The solid line shows Spearman’s rho, and the dotted line refers to P-values. Dashed lines show rho = 0 (lower) and the significance threshold of P < 0.05 (upper).

© 2016 The Authors. Molecular Ecology Resources Published by John Wiley & Sons Ltd. 590 L. DUTOIT ET AL.

Table 1 Average (SD) number of variable sites per individual revealed evolutionary constraints in UTRs. This is consis- for each type of annotated sequence category extrapolated for tent with the reduced diversity in these regions that we the whole genome observe. Among other functions, UTRs are likely to con- tain regulatory sequences affecting gene expression Sequence category Number of variable sites including binding sites for micro-RNAs (Bartel 2009). 50 UTR 15 698 225 CDS 14 343 193 30 UTR 865 29 Variation in genetic diversity across the genome Introns 754 081 8020 A comparison of the simulated distribution of p esti- Intergenic regions 1 482 708 17 170 mates among nonoverlapping 200-kb windows assuming Total 2 267 696 25 208 random location of segregating sites in the genome and the observed distribution of p estimates clearly showed Table 2 Nucleotide diversity (p) in annotated regions of the col- that genetic diversity is heterogeneously distributed lared flycatcher genome, and their fraction of the whole genome across the genome (Fig. 1B). This was also evident from a significant autocorrelation in p estimates between Proportion of neighbouring 200-kb windows (Fig. 1C), demonstrating p Sequence category genome (%) regional similarity in diversity levels. On average, there p Intergenic 0.0040 65.9 was a statistically significant correlation in estimates Intronic 0.0039 30.7 between windows located up to 1 Mb apart (Fig. 1C). Fourfold degenerated sites 0.0034 1.3 The similarity in diversity levels on a regional scale, but Nondegenerated (0-fold) sites 0.0005 0.4 also the extensive variation on a larger scale, can readily 0 5 UTR 0.0028 1.0 be seen from the distribution of p estimates along a chro- 0 3 UTR 0.0033 0.1 mosome (Fig. 2A). Diversity estimates differed by nearly Total 0.0039 two orders of magnitude between the least and the most – Data are for autosomes. variable window, with a 95% CI of 0.0013 0.0053. The variation in p among windows was not merely a Clemente & Vogl 2012; Lawrie et al. 2013) indicating that consequence of different proportions of less variable cod- silent sites are constrained by purifying selection. For ing sequences as excluding coding sequences from the example, they might be functionally involved with splic- estimate did not change the density distribution of ing or mRNA stability (Chamary et al. 2006). An alterna- nucleotide diversity (Fig. 1B). Moreover, as coding tive but not mutually exclusive explanation is that close sequences only constitute 2% of the flycatcher genome, linkage to nearby nonsynonymous sites under selection and UTRs another 1%, the proportion of coding sequence might reduce diversity at synonymous sites due to hitch- per window has limited influence on the variation in hiking effects (Cutter & Payseur 2013; Corbett-Detig et al. diversity among 200-kb windows (while it would have 2015). However, if linked selection is pervasive, this had higher influence at smaller window sizes). However, should also to some extent affect introns. as shown below, the density of coding sequence may still It is noteworthy that regulatory sequences subject to be important in determining diversity levels via linked purifying selection reside within introns and intergenic selection. Although window-based p estimates of differ- regions (Ponting 2008), illustrating the difficulty in defin- ent individuals were highly correlated (Fig. 2B; Spear- ing a truly neutral category of sequences. One way of man’s rho, range = 0.70–0.86), there was far more handling this in attempts to estimate the neutral substitu- variation among individuals at the level of windows tion rate (and thereby getting a proxy for the mutation than at the level of the whole genome (Fig. 2A). This rate) is to mask noncoding regions that are highly con- may of course be a statistical effect of the much larger served in alignments to sequences from distantly related number of sites involved when the whole genome is con- species (Siepel et al. 2005). The downside of this sidered but variation in coalescence times among haplo- approach is that it is difficult to distinguish between types, for example due to selection, could add another weak selective constraints and locally reduced mutation layer of variation among individuals in different geno- rates as the cause of sequence conservation. Moreover, mic regions. The mean difference between the highest functional elements in noncoding DNA may be lineage- and lowest individual p estimate per window was specific due to rapid turnover (Smith et al. 2004; Meader 0.0018 0.0012. et al. 2010) and therefore not detectable as conserved To address the causes of the broad-scale variation in regions in distant alignments. genetic diversity across the genome, we analysed the Finally, we note that many molecular evolutionary relationship between chromosome size and diversity. studies, including in birds (Kunstner et al. 2011), have This analysis was motivated by the fact that chromosome

© 2016 The Authors. Molecular Ecology Resources Published by John Wiley & Sons Ltd. ESTIMATING NUCLEOTIDE DIVERSITY 591

(A) (B)

Fig. 2 (A) Example of variation in nucleotide diversity of a chromosomal segment shown for 200-kb windows within position 30–60 Mb of chromosome 1. The solid line represents the population median, and the grey lines delimit the maximum and minimum individual nucleotide diversity among 20 birds. (B) Example of the correlation between individual nucleotide diversity in 200-kb win- dows of two birds (Pearson’s r = 0.87). size and recombination rate are negatively correlated in animals and plants (reviewed in (Cutter & Payseur collared flycatchers (Kawakami et al. 2014b) as well as in 2013)), including flycatchers (Burri et al. 2015), have many other species of birds (e.g. (ICGSC 2004)) and other indeed demonstrated a positive correlation between rate organisms (e.g. (Farre et al. 2012; Jensen-Seaman et al. of recombination and genetic diversity. However, we 2004). In flycatchers, the variation in mean recombination found a positive relationship between chromosome size rate per chromosome is extensive, from 2 cM/Mb for and p, with diversity increasing from a chromosome chromosomes >100 Mb to >10 cM/Mb for chromosomes average of p 0.0025 for the smallest chromosomes <10 Mb (Kawakami et al. 2014b). High rates of recombi- (<5 Mb) to reach a plateau at 0.0040–0.0043 for chro- nation in small chromosomes should be associated with mosomes in the size range of 50–150 Mb (Fig. 3A; Spear- high levels of genetic diversity (i.e. there should be a man’s rho = 0.803, P < 0.0001). A positive correlation negative correlation between genetic diversity and chro- was seen for all sequence categories (coding sequences, mosome size) because recombination uncouples selected fourfold degenerate sites, nondegenerate sites, 50UTR, and neutral loci, thereby reducing the effect of linked 30UTR, introns, intergenic DNA) when analysed sepa- selection on neutral diversity. Studies in many species of rately (Fig. S3, Supporting information).

(A) (B) (C)

Fig. 3 Correlation between nucleotide diversity and chromosome size (log10 scale) in (A) collared flycatcher (this study), (B) brown creeper (Certhia americana) with data from (Manthey et al. 2015) and (C) Hawaii amakihi (Hemignathus virens) with data from Callicrate et al. (2014).

© 2016 The Authors. Molecular Ecology Resources Published by John Wiley & Sons Ltd. 592 L. DUTOIT ET AL.

The positive relationship between chromosome size chromosome in other bird species. Figure 3B (Certhia and nucleotide diversity indicates that factors other than americana (Manthey et al. 2015)) and C (Hemignathus recombination are related with chromosome size and virens (Callicrate et al. 2014)) show the relationship affect diversity. One such factor is the density of targets between nucleotide diversity and chromosome size for for selection, which can be approximated by the density two species of the order Passeriformes. In Certhia ameri- of coding sequence. There was a strong negative correla- cana, in contrast to flycatchers, the correlation is negative, tion between the density of coding sequence and chro- while in Hemignathus virens there is no significant rela- mosome size in flycatchers (Fig. 4), consistent with tionship. In chicken, one study found no significant rela- findings in other avian genomes [e.g. (ICGSC 2004)]. tionship between SNP rate and chromosome size [figure Everything else being equal, the higher the density of 2 in (ICGPC 2004)], but using data from dbSNP (http:// sequences subject to selection, the stronger the diversity- www.ncbi.nlm.nih.gov/snp/), we actually find a nega- reducing effect of linked selection, which is observed in tive correlation between the number of SNPs per bp and this system (Burri et al. 2015). The opposing effects of chromosome size in chicken (Spearman’s rho = 0.723, high rates of recombination and high density of targets P = 6 9 10 5). Using the same statistic for flycatchers of selection on levels of nucleotide diversity in small (the number of SNPs per bp), we still find a strong posi- avian chromosomes suggest that the direction of the cor- tive correlation between diversity and chromosome size relation between nucleotide diversity and chromosome (rho = 0.905; P < 0.001; Fig. S2, Supporting information). size will be governed by the relative strength of the Thus, the data available so far suggest that the relation- effects of these two causative factors. Without detailed ship between nucleotide diversity and chromosome size investigation, it is hard to predict which of the effects in birds is inconsistent among species. would dominate and it might very well be that the domi- nating factor varies among avian species. For example, The allele frequency spectrum as the diversity-reducing role of linked selection should be more prevalent when Ne is large (Corbett-Detig et al. With complete sequence data from 20 diploid individu- 2015), it might be that the importance of the density of als (40 chromosomes), we could obtain a detailed picture targets for selection increases with increasing Ne. In our of the genomewide allele frequency spectrum, both over- study system, it seems that the effect from density of tar- all and for specific sequence categories (Fig. 5). Aware- gets of selection prevails because we found a positive ness of the shape of the frequency spectrum is important correlation between nucleotide diversity and chromo- when selecting informative SNP markers for inclusion in some size. Analyses using the pairwise sequentially genotyping arrays, but can also inform about selection Markovian coalescent (PSMC) indicate that the long-term and demography. The spectrum was strongly shifted

Ne of the investigated collared flycatcher population has towards low frequency alleles with about 25% of all seg- been 500 000 (Nadachowska-Brzyska et al. 2016). regating sites in intergenic and intronic sequences being To further investigate the relationship between chro- singletons (i.e. corresponding to a minor allele frequency mosome size and diversity, we searched the literature of 0.025 in our sample of 40 chromosomes). As expected for data on levels of nucleotide diversity per for sequences evolving under purifying selection, coding sequence alleles (and especially so for nondegenerate sites) segregated at lower frequencies than alleles in intergenic and intronic DNA (Fig. 5A). A slight shift towards low frequency alleles for fourfold degenerate sites (compared to intergenic and intronic DNA) suggests higher selective constraint on these sites, in line with the observation of reduced nucleotide diversity level at four- fold degenerate sites (Fig. 5B). Alternatively, close link- age to nonsynonymous sites under selection might affect the allele frequency spectrum also of silent sites.

Simulation of p estimation with genetic markers Having characterized the landscape of genetic diversity in the flycatcher genome by whole-genome analysis, we set out to test how well analyses of a finite number of Fig. 4 Density of coding sequence (per bp) in relation to chro- loci, such as often applied in ecological or evolutionary mosome size in the collared flycatcher genome. studies, would be able to capture the observed

© 2016 The Authors. Molecular Ecology Resources Published by John Wiley & Sons Ltd. ESTIMATING NUCLEOTIDE DIVERSITY 593

(A) An overall and indeed important observation from the simulation of sampling variance was that the 95% CI of p estimation was relatively broad for many sam- pling schemes, in particular when the number of markers was limited. This reflects that there is signifi- cant variation in diversity levels across the genome (Fig. 6A,B). Moreover, our results show that the preci- sion in diversity estimation increased more by adding markers than by adding individuals (Figs 6 and 7), consistent with the observation of higher variation in diversity levels along the genome as compared to vari- ation among individuals. For example, with 10 inter- genic markers, the 95% CI for p estimation was similar for sequence data from only one individual (0.0018–0.0068; a ratio of 3.7 between upper and lower 2.5% tail) and from 10 individuals (0.0027–0.0055; 2.9). In contrast, with sequence data from one marker, the 95% CI for 10 individuals was very much wider (0.0004–0.009; 22.5) than that for 10 markers (0.0027– p (B) 0.0055; 2.9). The precision in the estimation of showed a steep increase (narrower CI) in the range of 1–5 markers, while further addition of markers only slowly led to improved precision (Fig. 6A,B). How- ever, even with as much as 50 markers (i.e. a total of 25 kb) sequenced for 10 individuals, the ratio between the upper and lower 95% CI tails of p estimation was 1.3, and for 200 markers 1.2, giving an indication of how many markers are needed to accurately estimate genomewide diversity. The same trends were seen for all sequence categories (Fig. 7), but it should be noted that the precision in the estimation of p was higher for selectively constrained markers (in particular for coding sequences) than for intergenic and intronic DNA (Fig. 7; compare the size of CIs towards the lower left corner of the different plots). This is proba- bly an effect of the lower genetic diversity of coding sequences leading to lower variance. Simulations meant to mimic RAD sequencing (500– Fig. 5 Folded allelic frequency spectra for different sequence 5000 markers) gave narrow confidence intervals of p esti- categories based on data from sites covered in all 20 individuals. mates throughout the whole marker range (Fig. 6C). The (A) Spectrum including coding sequences (orange), introns p 0 0 ratio between the upper and lower CI tails of estima- (brown), intergenic DNA (yellow), 5 UTR (purple) and 3 UTR tion was only 1.2 for 500 markers (0.0035–0.0042) and 1.1 (blue). (B) Refined spectrum for coding sequences including 0- for 5000 markers (0.0037–0.00409). It is noteworthy that fold degenerate sites (red), fourfold degenerate sites (green), and, as a reference, introns (brown). high precision was obtained despite only 100 bp were analysed per sampled marker, as is commonly the case in next-generation sequencing. Again this is consistent genomewide diversity in the population. To mimic a sit- with the higher precision provided by the sampling of uation of analysis of genetic diversity based on ampli- many markers. It could also be noted that as the RAD cons (PCR-amplified markers), we used different sequencing simulations sampled random sequences sampling schemes of 1–20 individuals sequenced for up across the genome and thus included data from all to 200–500-bp regions randomly placed in the genome. sequence categories, p estimates converged towards We refer to these regions as ‘markers’. We ran separate somewhat lower values than was the case when ampli- simulations for intergenic DNA, introns, coding con sequencing of intergenic sequences was simulated sequences and UTRs, respectively. (Fig. 6).

© 2016 The Authors. Molecular Ecology Resources Published by John Wiley & Sons Ltd. 594 L. DUTOIT ET AL.

(A) (B) (C)

Fig. 6 About 95% subsampling confidence intervals of simulated estimated nucleotide diversity in intergenic regions in relation to (A) the number of individuals and (B) the number of 500-bp markers. (C) shows intervals for simulations mimicking RAD sequencing of

100-bp markers (x-axis in log10 scale).

Intergenic Intronic 0.004 20 20

15 15

10 10 0.003

5 5

10 20 30 40 10 20 30 40 0.002 CDS 5’ UTR 20 20

15 Size of the 95% CI

Number of individuals 15

10 10 0.001

5 5

10 20 30 40 10 20 30 40 Number of markers 0

Fig. 7 Size of 95% subsampling confidence intervals from simulations of estimated nucleotide diversity for different sampling schemes, with varying number of 500-bp markers and individuals. Separate results from intergenic regions, introns, coding sequences and 50 UTRs are shown. Colour coding for interval size is shown to the right.

sequence data from a large number of individuals from Recommendations for sampling design the same loci is essentially adding nonindependent data Under the conditions of our study system (an outbred because they share a common evolutionary history population with low levels of linkage disequilibrium; (Pluzhnikov & Donnelly 1996). Moreover, we argue that (Backstrom€ et al. 2006)) and simulation design, the recom- significant heterogeneity in diversity levels across the mendation for diversity estimation would be to put more genome implies that, given a fixed total amount of emphasis on the number of loci analysed than on the sequence obtained, increasing the amount of analysed number of individuals sequenced. This recommendation sequence per locus does not generally make up for reduc- should be relatively insensitive to the precise length of ing the number of loci. The simulations mimicking RAD individual markers used and we see no reason that it sequencing clearly showed that even with only 100 bp would not be broadly applicable across organisms and sequenced per loci, precision in p estimation was high populations. Variation in Ne across the genome, beyond when a large number of loci were sampled. In practice, stochastic variation in the coalescence process, is likely to our results demonstrate that RAD sequencing data will be a common feature of many organisms and generating accurately reflect genomewide diversity and that little is

© 2016 The Authors. Molecular Ecology Resources Published by John Wiley & Sons Ltd. ESTIMATING NUCLEOTIDE DIVERSITY 595 to be gained in this respect by whole-genome sequencing both coding and intronic sequence. If the goal is to esti- as far as mean levels of diversity are concerned. mate neutral genetic diversity, noncoding loci well dis- Our recommendations concord with the conclusions tributed across the genome should therefore be targeted. from theoretical work by (Pluzhnikov & Donnelly 1996) This speaks in favour of using RAD sequencing or and (Felsenstein 2006). (Carling & Brumfield 2007) used whole-genome resequencing for this purpose. It may be simulations to test the influence of the number and particularly relevant to emphasize that our results length of markers, and the number of individuals, on demonstrate that synonymous sites are generally less accuracy of estimates of genetic diversity. Overall, their variable than intergenic regions and introns, indicating work also supported the use of more markers over more that they are evolving under the influence of purifying individuals, but in contrast to our study, their simula- selection. Thus, such sites do not represent an ideal cate- tions were based on simulated sequence data, not empir- gory of sequences for estimating neutral diversity. ical data. The strong and heterogeneous effects of linked Finally, we stress that the analyses and simulations pre- selection across the genome mean that it can be difficult sented here, and the recommendations based on them, to simulate sequence evolution in a way that the extent are for autosomal sequences. Many of the evolutionary and character of polymorphism mirror biologically forces affecting sequence diversity differ between sex meaningful scenarios. chromosomes and autosomes. One situation at which the recommendation could potentially be compromised is for populations with sig- Conclusions nificant inbreeding and, in particular, variation in the extent of inbreeding among individuals. In such cases, We have shown that there is extensive variation in the the importance of analysing more individuals in order to level of nucleotide diversity across the genome of an get a representative estimate of the genetic diversity in avian species. This variation is seen in autosomal the population increases. Another aspect of inbred popu- sequence and is thus unrelated to the well-known effects lations is the increased proportion of the genome that is of sex linkage on genetic diversity (Hedrick 2007; Frank- identical by descent (IBD), meaning that the distribution ham 2012). Linked selection is likely to play a strong role of levels of genetic diversity within the genome starts to in governing within-genome heterogeneity in diversity become detectably bimodal, with a certain fraction levels, with (variation in) recombination rate and density devoid of variability due to runs of homozygosity in IBD of targets of selection being primary determinants of the tracts (Pemberton et al. 2012). We also note that for sev- extent of linked selection. As far as we are aware, this eral other purposes related to population genetic (e.g. for study is the first to characterize genomewide nucleotide high resolution estimation of the allele frequency spec- diversity through whole-genome resequencing of a large trum, and statistical tests based on such spectra) and/or population sample and then use these real data to simu- molecular evolutionary analyses (e.g. like in the case of late how well genetic diversity would be captured by the the McDonald–Kreitman test), it is more critical to sam- use of genetic markers. We find that diversity estimation ple a large number of individuals than in the case of esti- by sequencing a small number of amplicons is bound to mating nucleotide diversity (Robinson et al. 2014). be associated with large confidence intervals. Given the Our simulations were based on sequence data from heterogeneity in diversity levels across the genome, gath- markers randomly distributed across the genome. With ering sequence data from many loci will increase the pre- considerable variation in diversity levels among genomic cision in diversity estimation. Naturally, one could ask regions and chromosomes, a nonrandom selection of whether molecular ecological studies will continue to be markers for characterization of genetic diversity in a based on sequence data from a limited number of loci population could easily bias diversity estimates, or at when genotyping-by-sequencing and whole-genome least make them less comparable to estimates based on resequencing become increasingly feasible in many pro- different sets of markers in other species. A correlation jects. However, even with the use of next-generation between chromosome size and genetic diversity seems sequencing technologies, target capture approaches are particularly relevant to consider in this respect because cost-effective and can be used for a wide range of appli- the direction of correlation is apparently not consistent cations (Jones & Good 2016). among species. For this reason, the recommendation The ability to reliably estimate genetic diversity of dif- would be to sample loci from as wide a range of chromo- ferent populations is critical for making conclusions somes as possible. about evolutionary processes. To end with an example Comparisons of diversity levels among species could (Gohli et al. 2013), recently reported an association also be hampered if the markers used are heterogeneous between genetic diversity and female promiscuity in 18 with respect to the type of annotated sequence, for exam- passerine bird species based on sequence data from five ple, if they contain both coding sequence and UTRs, or introns (mean length 400 bp). One possible explanation

© 2016 The Authors. Molecular Ecology Resources Published by John Wiley & Sons Ltd. 596 L. DUTOIT ET AL. to this would be that species with strong sexual selection Charlesworth B (2012) The effects of deleterious mutations on evolution – for compatible genes (i.e. negative frequency-dependent at linked sites. Genetics, 190,5 22. Clemente F, Vogl C (2012) Evidence for complex selection on four-fold selection for rare or dissimilar alleles) have relatively degenerate sites in Drosophila melanogaster. Journal of Evolutionary Biol- high levels of genetic diversity. The validity of this result ogy, 25, 2582–2595. was criticized by (Spurgin 2013) on several methodologi- Corbett-Detig RB, Hartl DL, Sackton TB (2015) Natural selection con- cal grounds, including the precision of diversity esti- strains neutral diversity across a wide range of species. PLoS Biology, 13, e1002112. mates and the inference of species level diversity from Coyne JA, Orr HA (2004) Speciation. Sinauer Associates Incorporated, sampling of individual populations (see also response to Sunderland, Massachusetts. the criticism by (Lifjeld et al. 2013). Based on experiences Cutter AD, Payseur BA (2013) Genomic signatures of selection at linked 14 from the present study, we note that more firm conclu- sites: unifying the disparity among species. Nature Reviews Genetics, , 262–274. sions should have been possible to reach with more DePristo MA, Banks E, Poplin R et al. (2011) A framework for variation extensive sampling of genomic data, either confirming or discovery and genotyping using next-generation DNA sequencing disproving the idea of a relationship between genetic data. Nature Genetics, 43, 491–498. Ellegren H (2014) Genome sequencing and population genomics in non- diversity and female promiscuity. model organisms. Trends in Ecology & Evolution, 29,51–63. Ellegren H, Smeds L, Burri R et al. (2012) The genomic landscape of spe- cies divergence in Ficedula flycatchers. Nature, 491, 756–760. Acknowledgements Eory} L, Halligan DL, Keightley PD (2010) Distributions of selectively con- This study was funded by the Swedish Research Council (grant strained sites and deleterious mutation rates in the hominid and murid genomes. Molecular Biology and Evolution, 27, 177–192. numbers 2010-565 and 2013-8271) and the Knut and Alice Wal- Farre M, Micheletti D, Ruiz-Herrera A (2012) Recombination rates and lenberg Foundation (KAW Scholar grant). Sequencing was per- genomic shuffling in human and chimpanzee—a new twist in the chro- formed by the SNP&SEQ Technology Platform at Uppsala mosomal speciation theory. Molecular Biology and Evolution, 30, 853–864. University, as part of the National Genomics Infrastructure Felsenstein J (2006) Accuracy of coalescent likelihood estimates: Do we (NGI) Sweden and Science for Life Laboratory. The SNP&SEQ need more sites, more sequences, or more loci? Molecular Biology and Platform is also supported by the Swedish Research Council Evolution, 23, 691–700. and the Knut and Alice Wallenberg Foundation. Computations Fisher RA (1930) The Genetical Theory of Natural Selection. The Clarendon were performed on resources provided by the Swedish National Press, Oxford. Infrastructure for Computing (SNIC) through Uppsala Multidis- Frankham R (1995) Conservation genetics. Annual Review of Genetics, 29, 305–327. ciplinary Center for Advanced Computational Science Frankham R (2012) How closely does genetic diversity in finite popula- (UPPMAX). We thank Rory Craig for comments on the manu- tions conform to predictions of neutral theory? Large deficits in script. regions of low recombination. Heredity, 108, 167–178. Garrison E, Marth G (2012) Haplotype-Based Variant Detection from Short- Read Sequencing. ArXiv e-prints q-bio.GN. References Gillespie JH (2000) Genetic drift in an infinite population: the pseudo- hitchhiking model. Genetics, 155, 909–919. Backstrom€ N, Qvarnstrom€ A, Gustafsson L, Ellegren H (2006) Levels of Gillespie JH (2001) Is the population size of a species relevant to its evolu- linkage disequilibrium in a wild bird population. Biology Letters, 2, tion. Evolution, 55, 2161–2169. 435–438. Gohli J, Anmarkrud JA, Johnsen A et al. (2013) Female promiscuity is Banks SC, Cary GJ, Smith AL et al. (2013) How does ecological disturbance positively associated with neutral and selected genetic diversity in influence genetic diversity? Trends in Ecology & Evolution, 28, 670–679. passerine birds. Evolution, 67, 1406–1419. Barrett RDH, Schluter D (2008) Adaptation from standing genetic varia- Gossmann TI, Woolfit M, Eyre-Walker A (2011) Quantifying the variation tion. Trends in Ecology & Evolution, 23,38–44. in the effective population size within a genome. Genetics, 189, 1389– Bartel DP (2009) MicroRNAs: target recognition and regulatory functions. 1402. Cell, 136, 215–233. Goudet J (2005) HIERFSTAT, a package for R to compute and test hierarchi- Barton NH, Keightley PD (2002) Multifactorial genetics: understanding cal F-statistics. Molecular Ecology Notes, 5, 184–186. quantitative . Nature Reviews Genetics, 3,11–21. Hedrick PW (2007) Sex: differences in mutation, recombination, selection, Barton NH, Turelli M (1989) Evolutionary quantitative genetics: How lit- gene flow, and genetic drift. Evolution, 61, 2750–2771. tle do we know? Annual Review of Genetics, 23, 337–370. Hodgkinson A, Eyre-Walker A (2011) Variation in the mutation rate Bensch S, Andren H, Hansson B, et al. (2006) Heterozygosity gives hope across mammalian genomes. Nature Reviews Genetics, 12, 756–766. to a wild population of inbred wolves. PLoS ONE, 1, e72. ICGPC (2004) A genetic variation map for chicken with 2.8 million sin- Burri R, Nater A, Kawakami T et al. (2015) Linked selection and recombi- gle-nucleotide polymorphisms. Nature, 432, 717–722. nation rate variation drive the evolution of the genomic landscape of ICGSC (2004) Sequence and comparative analysis of the chicken genome differentiation across the speciation continuum of Ficedula flycatchers. provide unique perspectives on vertebrate evolution. Nature, 432, 695– Genome Research, 25, 1656–1665. 716. Callicrate T, Dikow R, Thomas JW et al. (2014) Genomic resources for the Jensen-Seaman MI, Furey TS, Payseur BA et al. (2004) Comparative endangered Hawaiian honeycreepers. BMC Genomics, 15,1–13. recombination rates in the rat, mouse, and human genomes. Genome Carling MD, Brumfield RT (2007) Gene sampling strategies for multi- Research, 14, 528–538. locus population estimates of genetic diversity (h). PLoS ONE, 2, e160. Jones MR, Good JM (2016) Targeted capture in evolutionary and ecologi- Chamary JV, Parmley JL, Hurst LD (2006) Hearing silence: non-neutral cal genomics. Molecular Ecology, 25, 185–202. evolution at synonymous sites in mammals. Nature Reviews Genetics, 7, Kardos M, Husby A, McFarlane SE, Qvarnstrom€ A, Ellegren H (2016) 98–108. Whole-genome resequencing of extreme phenotypes in collared fly- Charlesworth B (2010) Elements of Evolutionary Genetics. Roberts Publish- catchers highlights the difficulty of detecting quantitative trait loci in ers, Englewood, Colorado. natural populations. Molecular Ecology Resources, 16, 727–741.

© 2016 The Authors. Molecular Ecology Resources Published by John Wiley & Sons Ltd. ESTIMATING NUCLEOTIDE DIVERSITY 597

Kawakami T, Backstrom N, Burri R et al. (2014a) Estimation of linkage Romiguier J, Gayral P, Ballenghien M et al. (2014) Comparative popula- disequilibrium and interspecific gene flow in Ficedula flycatchers by a tion genomics in animals uncovers the determinants of genetic diver- newly developed 50k single-nucleotide polymorphism array. Molecular sity. Nature, 515, 261–263. Ecology Resources, 14, 1248–1260. Siepel A, Bejerano G, Pedersen JS et al. (2005) Evolutionarily conserved Kawakami T, Smeds L, Backstrom N et al. (2014b) A high-density linkage elements in vertebrate, insect, worm, and yeast genomes. Genome map enables a second-generation collared flycatcher genome assembly Research, 15, 1034–1050. and reveals the patterns of avian recombination rate variation and Smeds L, Kawakami T, Burri R et al. (2014) Genomic identification and chromosomal evolution. Molecular Ecology, 23, 4035–4058. characterization of the pseudoautosomal region in highly differenti- Kunstner A, Nabholz B, Ellegren H (2011) Evolutionary constraint in ated avian sex chromosomes. Nature Communications, 5, 5448. flanking regions of avian genes. Molecular Biology and Evolution, 28, Smeds L, Warmuth V, Bolivar P et al. (2015) Evolutionary analysis of the 2481–2489. female-specific avian W chromosome. Nature Communications, 6, 7330. Kunstner€ A, Nabholz B, Ellegren H (2011) Significant selective constraint Smith NGC, Brandstrom€ M, Ellegren H (2004) Evidence for turnover of at 4-fold degenerate sites in the avian genome and its consequence for functional noncoding DNA in mammalian genome evolution. Geno- detection of positive selection. Genome Biology and Evolution, 3, 1381– mics, 84, 806–813. 1389. Spurgin LG (2013) Comment on Gohli et al. (2013): “Does promiscuity Lawrie DS, Messer PW, Hershberg R, Petrov DA (2013) Strong purifying explain differences in levels of genetic diversity across passerine selection at synonymous sites in D. melanogaster. PLoS Genetics, 9, birds?” Evolution, 67, 3071–3072. e1003527. Wakeley J (2009) Coalescent Theory. Roberts Publishers, Englewood, Leffler EM, Bullaughey K, Matute DR et al. (2012) Revisiting an old rid- Colorado. dle: What determines genetic diversity levels within species? PLoS Wu C-I, Ting C-T (2004) Genes and speciation. Nature Reviews Genetics, 5, Biology, 10, e1001388. 114–122. Lewontin R (1974) The Genetic Basis of Evolutionary Change. Columbia Yang J, Benyamin B, McEvoy BP et al. (2010) Common SNPs explain a University Press, New York and London. large proportion of the heritability for human height. Nature Genetics, Li H, Durbin R (2009) Fast and accurate short read alignment with Bur- 42, 565–569. rows-Wheeler Transform. Bioinformatics, 25, 1754–1760. Li H, Handsaker B, Wysoker A et al. (2009) The sequence alignment/ map format and SAMtools. Bioinformatics, 25, 2078–2079. Lifjeld JT, Gohli J, Johnsen A (2013) Promiscuity, sexual selection, and genetic diversity: a reply to Spurgin. Evolution, 67, 3073–3074. L.D. performed all population genomic analyses. R.B. Manthey JD, Klicka J, Spellman GM (2015) Chromosomal patterns of provided access to data, A.N. and C.F.M. provided input diversity and differentiation in creepers: a next-gen phylogeographic on study design, data interpretation and statistical ana- – investigation of Certhia americana. Heredity, 115, 165 172. lyses. H.E. conceived of the study and supervised the Maynard-Smith J, Haigh J (1974) The hitch-hiking effect of a favourable gene. Genetical Research, 23,23–35. work. L.D. and H.E. initiated the study and wrote the McKenna A, Hanna M, Banks E et al. (2010) The genome analysis toolkit: paper. aMAPREDUCE framework for analyzing next-generation DNA sequenc- ing data. Genome Research, 20, 1297–1303. Meader S, Ponting CP, Lunter G (2010) Massive turnover of functional sequence in human and other mammalian genomes. Genome Research, 20, 1335–1343. Data accessibility Nadachowska-Brzyska K, Burri R, Smeds L, Ellegren H (2016) PSMC- analysis of effective population sizes in molecular ecology and its Sequences are available in the European Nucleotide application to black-and-white Ficedula flycatchers. Molecular Ecology, Archive (ENA), Accession Number PRJEB7359. Scripts 25, 1058–1072. and details of the data are made available at Dryad: Nater A, Burri R, Kawakami T, Smeds L, Ellegren H (2015) Resolving evolutionary relationships in closely related species with whole-gen- doi:10.5061/dryad.1n84v. ome sequencing data. Systematic Biology, 64, 1000–1017. Nei M, Li WH (1979) Mathematical model for studying genetic variation in terms of restriction endonucleases. Proceedings of the National Acad- Supporting Information emy of Sciences USA, 76, 5269–5273. Pemberton TJ, Absher D, Feldman Marcus W et al. (2012) Genomic pat- Additional Supporting Information may be found in the online terns of homozygosity in worldwide human populations. The American version of this article: Journal of Human Genetics, 91, 275–292. Table S1 Genome-wide nucleotide diversity per individual Pluzhnikov A, Donnelly P (1996) Optimal sequencing strategies for sur- – veying molecular genetic diversity. Genetics, 144, 1247 1262. Fig. S1 Pairwise relatedness among individuals in the studied Pollard KS, Hubisz MJ, Rosenbloom KR, Siepel A (2010) Detection of collared flycatcher population nonneutral substitution rates on mammalian phylogenies. Genome Research, 20, 110–121. Fig. S2 Number of SNPs per bp in relation to chromosome size Ponting CP (2008) The functional repertoires of metazoan genomes. Nat- in the collared flycatcher genome ure Reviews Genetics, 9, 689–698. R Development Core Team (2014) R: A Language and Environment for Sta- Fig. S3 Relationship between chromosome size (log10) and tistical Computing. R Foundation for Statistical, Computing, Vienna, nucleotide diversity for different sequence categories. All corre- Austria. lations were significant (P = 0.013, or less). The strength of cor- Reed DH, Frankham R (2003) Correlation between fitness and genetic relations were: coding sequences (R2 = 0.68), 0-fold degenerate 17 – 0 diversity. Conservation Biology, , 230 237. sites (R2 = 0.62), fourfold degenerate sites (R2 = 0.69), 5 UTR Robinson JD, Coffman AJ, Hickerson MJ, Gutenkunst RN (2014) Sam- sites (R2 = 0.44), 30 UTR sites (R2 = 0.80) intergenic DNA pling strategies for frequency spectrum-based population genomic 2 = 2 = inference. BMC Evolutionary Biology, 14,1–16. (R 0.18) and introns (R 0.29)

© 2016 The Authors. Molecular Ecology Resources Published by John Wiley & Sons Ltd.