Refining the Use of Linkage Disequilibrium As A
Total Page:16
File Type:pdf, Size:1020Kb
HIGHLIGHTED ARTICLE | INVESTIGATION Refining the Use of Linkage Disequilibrium as a Robust Signature of Selective Sweeps Guy S. Jacobs,*,†,1 Timothy J. Sluckin,* and Toomas Kivisild‡ *Mathematical Sciences, University of Southampton, Southampton SO17 1BJ, United Kingdom, †Complexity Institute, Nanyang Technological University, Singapore 637723, and ‡Department of Biological Anthropology, University of Cambridge, Cambridge CB2 1QH, United Kingdom ORCID IDs: 0000-0002-4698-7758 (G.S.J.); 0000-0002-9163-0061 (T.J.S.); 0000-0002-6297-7808 (T.K.) ABSTRACT During a selective sweep, characteristic patterns of linkage disequilibrium can arise in the genomic region surrounding a selected locus. These have been used to infer past selective sweeps. However, the recombination rate is known to vary substantially along the genome for many species. We here investigate the effectiveness of current (Kelly’s ZnS and vmax) and novel statistics at inferring hard selective sweeps based on linkage disequilibrium distortions under different conditions, including a human-realistic demographic model and recombination rate variation. When the recombination rate is constant, Kelly’s ZnS offers high power, but is outperformed by a novel statistic that we test, which we call Za: We also find this statistic to be effective at detecting sweeps from standing variation. When recombination rate fluctuations are included, there is a considerable reduction in power for all linkage disequilibrium-based statistics. However, this can largely be reversed by appropriately controlling for expected linkage disequilibrium using a genetic map. To further test these different methods, we perform selection scans on well-characterized HapMap data, finding that all three statistics—vmax; Kelly’s ZnS; and Za—are able to replicate signals at regions previously identified as selection candidates based on population differentiation or the site frequency spectrum. While vmax replicates most candidates when recombination map data are not available, the ZnS and Za statistics are more successful when recombination rate variation is controlled for. Given both this and their higher power in simulations of selective sweeps, these statistics are preferred when information on local recombination rate variation is available. KEYWORDS linkage disequilibrium; positive selection; recombination rate; genetic map ENOME-WIDE selection scans now form part of the Given the large number of hypotheses tested in genome Gstandard repertoire of techniques through which to scans for selection, many signals are likely to be false positives probe the evolutionary past of a population. These attempt (Kelley et al. 2006; Akey 2009; Nei et al. 2010). This concern to identify genomic regions showing evidence of nonneutral is compounded by systematic biases related to ascertainment evolution by iteratively calculating a test statistic at different (Thornton and Jensen 2007) or data quality (Mallick et al. locations for a sample of genetic data. As the availability of 2009). Distinguishing true signatures of selection is challeng- genetic data and computational power has increased, so too ing, requiring a multifaceted approach (Barrett and Hoekstra have the number of scans performed and the range of statistics 2011), but improving statistics used to infer selection is an used. In the case of human genetics, thousands of putative important step. This work is an attempt to test statistics cal- selection signals have been suggested (Akey 2009). For the culated from patterns of linkage disequilibrium (LD) and en- vast majority of these, it is unclear what phenotypic associa- hance their ability to detect selective sweeps. tion may have allowed selection to operate, and a relatively Distortions in the LD pattern surrounding a selected variant small number of signals are replicated between studies. are one of several population genetic effects of natural selec- tion related to genetic hitchhiking (Smith and Haigh 1974). In Copyright © 2016 by the Genetics Society of America the case of positive selection, a hard selective sweep is doi: 10.1534/genetics.115.185900 Manuscript received December 10, 2015; accepted for publication April 5, 2016. expected to initially increase local LD. As the selected variant Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10. reaches high frequency, LD between SNPs located on oppo- 1534/genetics.115.185900/-/DC1. 1Corresponding author: Complexity Institute, Block 2 Innovation Centre, Level 2 Unit site sides of the variant drops, as described by Kim and Niel- 245, 18 Nanyang Dr., Singapore 637723. E-mail: [email protected] sen (2004) and illustrated in Figure 1. Statistics designed to Genetics, Vol. 203, 1807–1825 August 2016 1807 Figure 1 The average pattern of pairwise linkage dis- equilibrium (r2; lower triangle of each matrix) and SNP diversity (represented by average number of LD measure- ments at a given pairwise distance, upper triangle) cre- ated by a selective sweep, based on 2000 simulations. The human demographic model of Gravel et al. (2011) wasusedinthesimulations(seeAppendix: Extended Methods for details), with 40 chromosomes sampled from the European population. When selection is simu- lated, the sweep begins (forward in time) at time t1 gen- erations before the present, using an initial selected allele frequency of 0.0005 and an additive selection model with the homozygous state corresponding to s ¼ 0:04: The frequency of the selected allele in the present is, on average, 0.7 and . 0:99 when selection started 400 and 1600 generations ago, respectively. In the rightmost plot, the along (a) and over (b) regions used to calculate test statistics are indicated. As described by Kim and Nielsen (2004), this plot displays an increase in LD in the along region, but a decrease in the over region, compared to the central plot, which shows the increased LD in both regions associated with an intermediate-frequency partial sweep. detect both the first [Kelly’s ZnS (Kelly 1997)] and second 2002)] have been proposed—Kelly’s ZnS (Kelly 1997), Rozas’ [v (Kim and Nielsen 2004)] patterns have been suggested, ZA and ZZ (Rozas et al. 2001), and the v statistic (Kim and and the theoretical dynamics of LD given positive selection Nielsen 2004). A separate approach [Ped/Pop (O’Reilly et al. have been explored (Stephan et al. 2006; McVean 2007; 2008)] compares local recombination rate estimated using Pfaffelhuber et al. 2008). It has been argued that one of these LD distortions in population genetic data (McVean et al. statistics, v, is relatively robust to nonequilibrium demo- 2004) with estimates based on pedigree data to detect un- graphic history of the target population, which can vastly expected LD patterns and hence detect selection. We focus on reduce the performance of other statistics (Jensen et al. two of these methods—Kelly’s ZnS and v—which detect very 2007; Crisci et al. 2013). different qualities of the LD distortions expected given a se- In humans (Daly et al. 2001) and other species (e.g., Petes lective sweep. 2001; Mezard 2006; Paigen and Petkov 2010) the local re- Kelly’s ZnS is simply the average pairwise LD between all combination rate is known to be highly variable. Genetic SNPs over a fixed region of the genome (Kelly 1997), maps, which provide estimates of the recombination rate XS21 XS along a genome, are created to describe these fluctuations, 2 Z ¼ r 2 ; (1) either by using patterns of linkage disequilibrium in a popu- nS SðS 2 1Þ i; j i¼1 j¼iþ1 lation genetic sample (e.g., McVean et al. 2004) or by infer- ring recombination events in pedigrees (e.g., Kong et al. where S corresponds to a list of polymorphic sites in the 2010). As an indication of the extent of recombination rate genomic region numbered ½1; ...; Smax; i and j are indicators variation, human genetic data suggest that 60% of recombi- 2 referring to loci in the list S, and ri; j is a standardized measure nation events happen in 6% of the genome (Frazer et al. of LD corresponding to the squared correlation of allelic iden- 2007). The portion of the genome with a high recombination tity between loci i and j (Hill and Robertson 1968). The nor- rate manifests as highly local extreme peaks in the recombi- malization ensures that ZnS ¼ 1 when all loci in S are in nation rate, known as recombination hotspots. The implica- maximal LD. Visually, an example calculation of this statistic tions of recombination rate variation on methods to detect would be the average r2 among all SNPs contained in the selective sweeps based on pairwise LD have not yet been window x in Figure 2. Given the dynamics of LD driven by thoroughly explored. a hard selective sweep, this statistic is expected to be most In this article, we use simulations to assess the power of effective when a selected variant has reached a moderate to ’ v Kelly s ZnS and to detect selective sweeps and compare high frequency, but is not nearing fixation. The approach also them to a barrage of alternative LD-based selection statistics, has relatively high power to detect soft sweeps in which a some of which attempt to control for recombination rate var- locus experiences recurrent beneficial mutations (Pennings iation. We focus on a hard sweep model of selection, but also and Hermisson 2006). consider selection acting on standing genetic variation, The v statistic tries to identify a characteristic LD pattern which can lead to soft sweeps (Hermisson and Pennings that emerges toward the end of a hard selective sweep, rep- 2005). We then test several of the best-performing statistics resented by an increase in LD between SNPs downstream or on well-studied HapMap phase II data (Frazer et al. 2007), upstream of a selected variant (along the chromosome), but a assessing their ability to replicate selection candidates iden- reduction in LD between those on either side of it (over the tified using site frequency spectrum distortions and/or high selected locus).