HIGHLIGHTED ARTICLE | INVESTIGATION

Refining the Use of as a Robust Signature of Selective Sweeps

Guy S. Jacobs,*,†,1 Timothy J. Sluckin,* and Toomas Kivisild‡ *Mathematical Sciences, University of Southampton, Southampton SO17 1BJ, United Kingdom, †Complexity Institute, Nanyang Technological University, Singapore 637723, and ‡Department of Biological Anthropology, University of Cambridge, Cambridge CB2 1QH, United Kingdom ORCID IDs: 0000-0002-4698-7758 (G.S.J.); 0000-0002-9163-0061 (T.J.S.); 0000-0002-6297-7808 (T.K.)

ABSTRACT During a selective sweep, characteristic patterns of linkage disequilibrium can arise in the genomic region surrounding a selected locus. These have been used to infer past selective sweeps. However, the recombination rate is known to vary substantially along the genome for many species. We here investigate the effectiveness of current (Kelly’s ZnS and vmax) and novel statistics at inferring hard selective sweeps based on linkage disequilibrium distortions under different conditions, including a human-realistic demographic model and recombination rate variation. When the recombination rate is constant, Kelly’s ZnS offers high power, but is outperformed by a novel statistic that we test, which we call Za: We also find this statistic to be effective at detecting sweeps from standing variation. When recombination rate fluctuations are included, there is a considerable reduction in power for all linkage disequilibrium-based statistics. However, this can largely be reversed by appropriately controlling for expected linkage disequilibrium using a genetic map. To further test these different methods, we perform selection scans on well-characterized HapMap data, finding that all three statistics—vmax; Kelly’s ZnS; and Za—are able to replicate signals at regions previously identified as selection candidates based on population differentiation or the site frequency spectrum. While vmax replicates most candidates when recombination map data are not available, the ZnS and Za statistics are more successful when recombination rate variation is controlled for. Given both this and their higher power in simulations of selective sweeps, these statistics are preferred when information on local recombination rate variation is available.

KEYWORDS linkage disequilibrium; positive selection; recombination rate; genetic map

ENOME-WIDE selection scans now form part of the Given the large number of hypotheses tested in genome Gstandard repertoire of techniques through which to scans for selection, many signals are likely to be false positives probe the evolutionary past of a population. These attempt (Kelley et al. 2006; Akey 2009; Nei et al. 2010). This concern to identify genomic regions showing evidence of nonneutral is compounded by systematic biases related to ascertainment by iteratively calculating a test statistic at different (Thornton and Jensen 2007) or data quality (Mallick et al. locations for a sample of genetic data. As the availability of 2009). Distinguishing true signatures of selection is challeng- genetic data and computational power has increased, so too ing, requiring a multifaceted approach (Barrett and Hoekstra have the number of scans performed and the range of statistics 2011), but improving statistics used to infer selection is an used. In the case of human , thousands of putative important step. This work is an attempt to test statistics cal- selection signals have been suggested (Akey 2009). For the culated from patterns of linkage disequilibrium (LD) and en- vast majority of these, it is unclear what phenotypic associa- hance their ability to detect selective sweeps. tion may have allowed selection to operate, and a relatively Distortions in the LD pattern surrounding a selected variant small number of signals are replicated between studies. are one of several population genetic effects of natural selec- tion related to genetic hitchhiking (Smith and Haigh 1974). In Copyright © 2016 by the Genetics Society of America the case of positive selection, a hard selective sweep is doi: 10.1534/genetics.115.185900 Manuscript received December 10, 2015; accepted for publication April 5, 2016. expected to initially increase local LD. As the selected variant Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10. reaches high frequency, LD between SNPs located on oppo- 1534/genetics.115.185900/-/DC1. 1Corresponding author: Complexity Institute, Block 2 Innovation Centre, Level 2 Unit site sides of the variant drops, as described by Kim and Niel- 245, 18 Nanyang Dr., Singapore 637723. E-mail: [email protected] sen (2004) and illustrated in Figure 1. Statistics designed to

Genetics, Vol. 203, 1807–1825 August 2016 1807 Figure 1 The average pattern of pairwise linkage dis- equilibrium (r2; lower triangle of each matrix) and SNP diversity (represented by average number of LD measure- ments at a given pairwise distance, upper triangle) cre- ated by a selective sweep, based on 2000 simulations. The human demographic model of Gravel et al. (2011) wasusedinthesimulations(seeAppendix: Extended Methods for details), with 40 sampled from the European population. When selection is simu- lated, the sweep begins (forward in time) at time t1 gen- erations before the present, using an initial selected of 0.0005 and an additive selection model with the homozygous state corresponding to s ¼ 0:04: The frequency of the selected allele in the present is, on average, 0.7 and . 0:99 when selection started 400 and 1600 generations ago, respectively. In the rightmost plot, the along (a) and over (b) regions used to calculate test statistics are indicated. As described by Kim and Nielsen (2004), this plot displays an increase in LD in the along region, but a decrease in the over region, compared to the central plot, which shows the increased LD in both regions associated with an intermediate-frequency partial sweep.

detect both the first [Kelly’s ZnS (Kelly 1997)] and second 2002)] have been proposed—Kelly’s ZnS (Kelly 1997), Rozas’ [v (Kim and Nielsen 2004)] patterns have been suggested, ZA and ZZ (Rozas et al. 2001), and the v statistic (Kim and and the theoretical dynamics of LD given positive selection Nielsen 2004). A separate approach [Ped/Pop (O’Reilly et al. have been explored (Stephan et al. 2006; McVean 2007; 2008)] compares local recombination rate estimated using Pfaffelhuber et al. 2008). It has been argued that one of these LD distortions in population genetic data (McVean et al. statistics, v, is relatively robust to nonequilibrium demo- 2004) with estimates based on pedigree data to detect un- graphic history of the target population, which can vastly expected LD patterns and hence detect selection. We focus on reduce the performance of other statistics (Jensen et al. two of these methods—Kelly’s ZnS and v—which detect very 2007; Crisci et al. 2013). different qualities of the LD distortions expected given a se- In humans (Daly et al. 2001) and other species (e.g., Petes lective sweep. 2001; Mezard 2006; Paigen and Petkov 2010) the local re- Kelly’s ZnS is simply the average pairwise LD between all combination rate is known to be highly variable. Genetic SNPs over a fixed region of the genome (Kelly 1997), maps, which provide estimates of the recombination rate XS21 XS along a genome, are created to describe these fluctuations, 2 Z ¼ r 2 ; (1) either by using patterns of linkage disequilibrium in a popu- nS SðS 2 1Þ i; j i¼1 j¼iþ1 lation genetic sample (e.g., McVean et al. 2004) or by infer- ring recombination events in pedigrees (e.g., Kong et al. where S corresponds to a list of polymorphic sites in the 2010). As an indication of the extent of recombination rate genomic region numbered ½1; ...; Smax; i and j are indicators variation, human genetic data suggest that 60% of recombi- 2 referring to loci in the list S, and ri; j is a standardized measure nation events happen in 6% of the genome (Frazer et al. of LD corresponding to the squared correlation of allelic iden- 2007). The portion of the genome with a high recombination tity between loci i and j (Hill and Robertson 1968). The nor- rate manifests as highly local extreme peaks in the recombi- malization ensures that ZnS ¼ 1 when all loci in S are in nation rate, known as recombination hotspots. The implica- maximal LD. Visually, an example calculation of this statistic tions of recombination rate variation on methods to detect would be the average r2 among all SNPs contained in the selective sweeps based on pairwise LD have not yet been window x in Figure 2. Given the dynamics of LD driven by thoroughly explored. a hard selective sweep, this statistic is expected to be most In this article, we use simulations to assess the power of effective when a selected variant has reached a moderate to ’ v Kelly s ZnS and to detect selective sweeps and compare high frequency, but is not nearing fixation. The approach also them to a barrage of alternative LD-based selection statistics, has relatively high power to detect soft sweeps in which a some of which attempt to control for recombination rate var- locus experiences recurrent beneficial (Pennings iation. We focus on a hard sweep model of selection, but also and Hermisson 2006). consider selection acting on standing , The v statistic tries to identify a characteristic LD pattern which can lead to soft sweeps (Hermisson and Pennings that emerges toward the end of a hard selective sweep, rep- 2005). We then test several of the best-performing statistics resented by an increase in LD between SNPs downstream or on well-studied HapMap phase II data (Frazer et al. 2007), upstream of a selected variant (along the ), but a assessing their ability to replicate selection candidates iden- reduction in LD between those on either side of it (over the tified using site frequency spectrum distortions and/or high selected locus). Following the notation and definition of Kim population differentiation. and Nielsen (2004), v is calculated by taking the list of S polymorphic sites in a window and dividing it into two con- Selection Statistics Based on LD tiguous groups, L and R, which are located on either side of Several methods that use pairwise LD patterns alone [as the lth locus. L and R contain l and S 2 l SNPs, respectively. opposed to, for example, structure (Sabeti et al. Given these definitions, v is defined as

1808 G. S. Jacobs, T. J. Sluckin, and T. Kivisild therefore use the OmegaPlus program to calculate vmax values rather than using the standard v-statistic construction. vminwin;maxwin Henceforth we use the notation max to denote an individual run of OmegaPlus with window size flags -minwin and -maxwin set as the specified number of kilobases.

Challenges faced by Kelly’sZnS and the v statistic

Both Kelly’s ZnS and v have been used to infer selection in population genetic data (e.g., Catalán et al. 2012; Lee et al. Figure 2 Schematic illustration of a test statistic calculation over a region of length X. The dark gray line indicates a section of the chromosome and 2013; Renzette et al. 2015). However, both the construction the rectangles intermittently spaced along it are SNPs. To obtain the test of statistics and properties of may com- statistic value, the selection statistic is calculated for each SNP in the orange plicate the interpretation of results. We discuss two potential region, and either the maximum or the minimum of these values is retrieved factors affecting these LD-based statistics—variable recombi- as specified. The selection statistic itself is calculated over a window of ’ v x l nation rate, which affects both Kelly s ZnS and , and the length centered on the target th SNP. For example, to calculate the v selection statistic for the first SNP in the region (colored light blue), SNPs variable window size of . x= L R within distance 2 are divided into sets (red) and (green). LD measure- Variable recombination rate ments contributing to this selection statistic value are indicated as circles above the chromosome and are colored dark blue if in the along set (used As previously noted, the recombination rate shows high levels Z Z to calculate a) or pink if in the over set (used to calculate b). of genome-wide variation for a number of species. This clearly has significant implications for statistics searching for unusual ’ 2 patterns of LD to infer nonneutral evolution (O Reilly et al. l S2l 1 P P þ r2 þ r 2 2008). Recombination hotspots cause local LD to plummet, 2 2 i; j2L i;j i; j2R i;j v ¼ X ; which mimics the pattern of reduced LD expected at the end 2 (2) ðlðS2lÞÞ 1 r 2 of the selective sweep and can lead to large values of the v i2L; j2R i;j statistic. Conversely, certain regions of the genome have low where the sum is taken over independent pairs i; j with i 6¼ j: recombination rates (Petes 2001; McVean et al. 2004), and The position of l is iteratively moved along the chromosome these coldspots, with correspondingly high LD, will raise ’ : v ’ to obtain vmax; the maximum value of v. Referring again to Kelly s ZnS If the null distribution of or Kelly s ZnS is based Figure 2, an example calculation of this statistic would be the on simulations that do not accurately represent recombina- average value of r2 measurements indicated by blue circles tion rate variation, recombination hot- and coldspots may divided by the average value of r2 measurements indicated by lead to false positives. When the null distribution is based pink circles. Large values of vmax may indicate nonneutral on empirical genome-wide data, the signal of variable recom- evolution. The pattern detected by v is apparent in the right- bination rate will be captured, but we nevertheless expect a most plot of Figure 1. reduction in statistical power compared to the constant- The OmegaPlus program (Alachiotis et al. 2012b) imple- recombination rate case due to outliers associated with recom- ments a genome-scan variant of v. Instead of locating the bination hotspots and coldspots. divisor on SNPs, a regularly sampled grid is defined over Variable statistic window size the region to be scanned. At each point on the grid, the value of vmax is determined by varying the window size and hence Many selection scans calculate statistics based on a sliding the SNPs in sets L and R. In the original v formula, the win- window of fixed size. The OmegaPlus algorithm takes a dow size was limited by the length of sequence data avail- different approach, calculating the statistic at static positions l able. As this would imply whole-chromosome windows in a on a regular grid while changing the size of the windows used genome-scan situation, the OmegaPlus implementation of v to define R and L on each side of the target locus to find vmax: differs in defining constraints on window size for each v This method has several implications. First, we note that the calculation, using the flags “-minwin” and “-maxwin” (sup- spatial extent of LD distortions in the genome caused by a plemental material in Alachiotis et al. 2012b; first applied in selective sweep will depend on recombination rate, selection Pavlidis et al. 2010). Normalization occurs as above and de- strength, and the age of the sweep (Smith and Haigh 1974). pends on the number of LD evaluations made in each win- Given this, it is possible that a single scan with OmegaPlus dow. The implementation of the statistic is highly optimized using variable window size can identify a broader range of (Alachiotis et al. 2012a), such that large amounts of genetic selection scenarios than a fixed-window approach. Second, data can be rapidly scanned, even on consumer-grade com- the expected value of v under neutrality depends on the sizes puters. vmax has shown promise in identifying selective of R and L, which makes interpreting the statistic less intui- sweeps in simulation studies (Jensen et al. 2007; Crisci tive. A highly simplified illustrative example is provided in et al. 2013). Supplemental Material, File S1. Third, the changing size of In our work, we are interested in assessing and developing the window means that the number of SNPs used to obtain LD LD-based summary statistics for use in genome scans. We values when calculating v may vary considerably (although

LD as a Signature of Selective Sweeps 1809 4 Table 1 The power (FPR = 0.05, with n ¼ 40) of several test statistics given a constant population size (Ne ¼ 10 ) demographic model and a ¼ : hard sweep (selection s 0 01) to various allele frequencies qt0

Unless otherwise stated, t0 ¼ 0; and window size x ¼200 kb; all test statistics are calculated as the maximum selection statistic value in an X ¼200-kb region surrounding the selected site, apart from the diversity-based jLjjRj statistic where the minimum is used. The top section indicates power given a constant recombination rate and the bottom section power given recombination rate variation, with recombination rate sampled from a HapMap genetic map; see Appendix: Extended Methods.

the “-minsnps” flag allows the specification of a minimum netic sample. A selection statistic, such as ZnS; is calculated number of SNPs in a window and hence some control of this). centered on each polymorphic site in the region in turn; the The variance in the v statistic is likely to be larger under test statistic for the region can be either the maximum or neutrality for windows containing fewer SNPs, with Equation the minimum of these values. 2 evaluating to very large values when the denominator is The selection statistic is itself calculated over a window of randomly small. A similar effect may also be important in fixed length x, centered on the target locus and containing determining the distribution of other test statistics, particu- polymorphic sites forming a list S. Unlike in the calculation of larly given the impact of selection (Smith and Haigh 1974) and vmax; the target locus is fixed at the lth site in the list S, with neutral processes such as variable rate (Hodgkinson the window stretching x=2 bp upstream and downstream of and Eyre-Walker 2011) on diversity. it. We then follow the approach taken by the v statistic in dividing SNPs within the statistic window into two sets— The neutral distribution of statistics designed to capture those that are to the left of the target locus, L, and those to local patterns of LD depends, then, on features such as the the right, R; see Figure 2. The v statistic, Equation 2, averages window size used and neutral variation in SNP diversity. LD measurements between SNPs in the same set (the along Furthermore, variable recombination rate substantially af- region) and divides this by the average LD between SNPs in fects observed LD values. The implication is that some control different sets (the over region). A measure of the average for these phenomena may improve the ability of these statistics value of LD in the along region is todetectpositiveselection.Inthiswork,wefocusondeveloping 21X 21X and testing LD statistics calculated over a fixed-size genomic l 2 S2l 2 r ; þ r ; window that control for variable recombination rate. 2 i;j2L i j 2 i;j2R i j Za ¼ (3) 2

and in the over region is Methods X r 2 Our approach to improving LD-based statistics is a pragmatic i2L;j2R i; j Zb ¼ ; (4) one. In essence, we use simulations to explore a large number lðS 2 lÞ of possible LD-based statistics and assess their potential to detect selective sweeps by comparing their power to Kelly’s see Figure 1. Assuming the number of SNPs in L and R are ’ ZnS and vmax: The details of this process are described in similar, Kelly s ZnS will be approximately the average of Za Appendix: Extended Methods. and Zb; while v, calculated when the divisor lth locus is centrally located among the list of S polymorphic sites, is Designing selection statistics Za=Zb: When designing statistics, we begin by identifying a region of The selection statistics we investigate take a similar ap- length X bp to be assessed for evidence of selection in a ge- proach in calculating a measure of average LD in the over and

1810 G. S. Jacobs, T. J. Sluckin, and T. Kivisild Table 2 The power (FPR = 0.05, with n ¼ 40) of several test statistics (see Table 1 legend) given an out-of-Africa demographic model and ¼ : ¼ : selection s 0 01 beginning at the time indicated, t1 in generations, with initial allele frequency qt1 0 0005

The average frequency of the selected allele at sampling time was 3%, 40%, and 93% for t1 ¼ 400; 800, and 1600, with q0 . 0:99 when t1 $ 2400: along regions, with the possibility of some simple operation To control for expected LD, we generate an “LD profile” (such as addition and division, as above) then performed on describing the expected properties of LD at a given genetic 2 these. Unlike v and Kelly’s ZnS; we include the option of sub- map distance, such as the average or standard deviation of r : tracting or otherwise controlling for the expected value of the We approximate the LD profile by simulating 3 3 109 bp of selection statistic, based on expected LD between SNPs under neutral genetic data according to the appropriate demo- neutrality. In total, we test 39 different statistics, 29 of which graphic scenario and sample size and then binning LD mea- control for expected LD; see Appendix: Extended Methods. surements according to their genetic map distance. For all simulations involving a variable recombination rate, we con- Assessing the power of statistics through sider three scenarios. We assume that no genetic map is avail- coalescent simulation able, in which case physical distance serves as a proxy for We assess the performance of different statistics using co- genetic map distance (PhysMap); or that the true genetic alescent simulations generated with the programs MSMS map is available (TrueMap); or that a lower-resolution ge- (Ewing and Hermisson 2010) and Cosi2 (Shlyakhter et al. netic map is available (LowResMap). 2014). We first approximate the distribution of each statistic The algorithm that MSMS uses to simulate selection in- under neutrality by simulating 1000 samples for each demo- volves conditioning a coalescent model on a stochastically graphic and recombination model. We consider the identifi- generated allele frequency trajectory. The allele frequency cation of a selection candidate within 100 kb of the true focus trajectory is created using the selection coefficient s, the de- of selection to be practically useful and therefore calculate mographic model, and several of four possible parameters test statistics using a region size X ¼ 200 kb: The 1000 values describing features of the selection scenario—the time at thus generated describe the null distributions for a test sta- which the selection phase begins (pastward), t0 generations; tistic. To approximate the distribution of the test statistics the time at which selection stops, t1; and the frequency of the ; given a positive selective sweep, we simulate at least 300 sam- selected variant at t0 and t1 qt0 and qt1 (see Figure A1). ples with positive selection acting on a single SNP located in We focus our attention on a hard sweep model and compare the middle of the chromosome, using various selection sce- test statistic performance based on three selection scenario narios (e.g., different selection strengths and conditioning on categories—low frequency, where the final frequency of the fi ¼½ : ; : ; : ; different initial and nal selected allele frequencies). The selected allele qt0¼0 0 3 0 5 0 7 high frequency, with ¼½ : ; : ¼½ : ; distribution of each test statistic under neutrality and given qt0¼0 0 9 0 99 or qt0¼800 0 99 and selOOA, denoting positive selection is then used to calculate power and receiver the out-of-Africa demographic scenarios. We also briefly ex- operating characteristic (ROC) curves (Metz 1978). plore the performance of test statistics when selection acts as ¼½ : ; : ; : : We use two demographic scenarios, one with constant a variant at higher initial frequency, qt1 0 01 0 05 0 1 4 population size of Ne ¼ 10 and the other following an out- Performance is assessed using the power of test statistics (e. of-Africa (OOA) human demographic model suggested by g., Table 1, Table 2, and Figure 4) or a measurement based on Gravel et al. (2011), with samples taken from the European the partial area under the ROC curve [pAUC (McClish population. Details can be found in Appendix: Extended Meth- (1989)] between a false positive rate (FPR) of 0 and 0.05 ods; also see Figure A1. The recombination rate is either con- (Figure 3) that assesses the best possible test statistic perfor- stant (both models) or variable (constant population size mance suggested by our simulations. Using the pAUC gives model only). For the variable recombination rate model, greater emphasis on performance when the FPR , 0:05; the rate is sampled from the HapMap human chromosome which is relevant for genome-wide selection scans. Again, 2 (b36) genetic map, estimated from HapMap phase II pop- further details on selection scenarios and the assessment of ulations (Frazer et al. 2007), excluding regions close to the statistic performance are provided in Appendix: Extended centromere. Methods.

LD as a Signature of Selective Sweeps 1811 Figure 3 Performance of different categories of selection statistic tested according to our pAUC metric, under (A) a constant recombination rate, (B) a variable recombination rate and the true ge- netic map, and (C) a variable recombination rate using the physical map as an approximation of the genetic map. The statistic categories are indicated in the key, corresponding, in order, to those that control for expected LD, those that do not, Kelly’s ZnS statistic, the vmax statistic with various window sizes, and methods based on SNP diversity. The relationship between our pAUC metric and power is shown in D.

Replication of previously suggested trolling for expected LD increases the performance of certain selection candidates LD-based selection statistics. We compare the performance of our selection statistics with Controlling for expected LD increases simulated power Kelly’s ZnS and vmax: Tests using several statistics, particularly of statistics when recombination rate is variable the Za statistic, Equation 3, and Kelly’s Z when these control nS In total, we tested 29 methods that incorporated some form of for expected LD, have high power given a variable recombi- control for expected linkage and 10 methods that did not. nation rate. To help further assess the utility of different Comparing these as groups of similar statistics—for example, statistics, we present selection scan results, using HapMap all those that divide average LD in the along region by that in phase II [NCBI b36 (Frazer et al. 2007)] data for human the over region (like v) or all those that add or average LD in chromosome 2 (populations of European ancestry, CEU, and these two regions (like Kelly’s Z )—the average improve- East Asian ancestry, CHB+JPT) and chromosome 15 (CEU). nS ment, given a constant population size demographic model, Recombination rate is controlled for using either the HapMap variable recombination rate, and a hard sweep model, over genetic map [derived from LD patterns (Frazer et al. 2007)] or methods that did not consider genetic map data was 79% (in the deCODE genetic map [derived from inferred recombina- absolute terms, 0.22) by our pAUC metric (see Figure 3). tion events in a large Icelandic cohort (Kong et al. (2010)]. We Focusing on the case of n ¼ 40; window size x ¼ 200 kb; assessed the ability of the various statistics to replicate selec- and s ¼ 0:01; this reflects an average increase in power at tion signals previously identified based on the site frequency 1% FPR of 120% (0.07) for the low-frequency scenarios and spectrum and/or population differentiation (Carlson et al. 90% (0.08) for the high-frequency scenarios, over all (both 2005; Nielsen et al. 2005; Oleksyk et al. 2008; Pickrell et al. high and low performance) statistics. 2009; Chen et al. 2010; Ronen et al. 2013; Colonna et al. 2014; When controlling for variable recombination rate, we Pagani et al. 2016), genetic features that should be relatively found that both the full genetic map (TrueMap) and the independent of local LD patterns and recombination rate var- lower-resolution genetic map (LowResMap) yielded similar iation under neutrality (although see, e.g., figure 3 of Ferrer- results, with a performance reduction according to the pAUC Admetlla et al. 2014 and Thornton 2005). metric of just 10% and 4% for the lower-resolution map given Data availability low-frequency and high-frequency scenarios, respectively. However, trying to control for expected LD based on physical The authors state that all information or data necessary for distance yielded poor results, and those statistics that did not confirming the conclusions presented in the article are either take variable recombination rate into account showed equal or publicly available (HapMap data) or represented fully within better performance (see Figure 3C). the article and Supplemental Material files. The power to detect selection at different stages of a hard selective sweep (or, in the case of the out-of-Africa demo- Results graphic model, selection acting on an initially very rare Our simulations identify certain test statistics (maximum variant, q1 ¼ 0:0005) for some representative statistics is Kelly’s ZnS or Za over the X ¼ 200 - kb regions we test) as shown in Table 1 and Table 2. We note the generally high particularly powerful. They also suggest that controlling for performance of Za; compared to Kelly’s ZnS and especially expected LD increases the power of these statistics when re- vmax; and the slower decay over time of reduced SNP diver- combination rate is variable, but marginally reduces power sity as a signal of a selective sweep compared to LD distortions when recombination rate is constant. We also found that (the jLjjRj statistic, describing the number of LD measurements controlling for expected LD increased the number of previ- in the over region). To avoid high variance in LD-based statis- ously suggested selection candidates that these statistics tics, we did not calculate test statistics for simulations with replicated in HapMap phase II SNP data. Although the in- very few SNPs; our reported power values for diversity-based terpretation of signal replication is not trivial (signals may statistics are hence slight underestimates (usually by ,0.01 be false or true positives), the overall impression is that con- and always ,0.07 in Table 1, Table 2, and Table 3).

1812 G. S. Jacobs, T. J. Sluckin, and T. Kivisild Table 3 The power (FPR = 0.05, with n ¼ 40) of several test include several well-characterized (e.g., Sabeti et al. 2007; statistics (see Table 1 legend) given a constant population size Mathieson et al. 2015) selection signals—MCM6/LCT, ¼ 4 ¼ : (Ne 10 ) demographic model and selection (s 0 01; see File SLC24A5, HERC2, and EDAR. S1,Table S3 for s ¼ 0:04) acting on standing genetic variation An example selection scan for chromosome 15 (CEU), v50;400 ’ ; ; j jj j; using OmegaPlus ( max ), Kelly s ZnS Za and L R as well as LD-controlled variants of Kelly’s ZnS and Za; is shown in Figure 5 [results for chromosome 2 (CEU and CHB+JPT) are included in File S1]. Selection candidate signals suggested by other studies that did not use statistics based on LD or hap- lotype patterns (see Appendix: Extended Methods) are indi- cated. Table 4 shows the average value of each test statistic in the 200-kb genomic region containing these selection sig- nals, as well as the average rank of selection candidate re- gions among all 200-kb regions and an indication of how unusual the observed findings are. Results for scans using several other LD-based statistics that we investigated are in- cluded in File S1. Scans on the CEU population sample using the deCODE (Kong et al. 2010) rather than the HapMap ge- netic map to control for expected LD yielded similar results and are therefore also included in File S1. While the main purpose of these scans was to assess the various statistics, we did find several novel peaks. As we are not aware of selection statistics based on pairwise LD being applied to these data (although see O’Reilly et al. 2008), we tabulate the strongest signals in File S1. q Selection is conditioned on the initial allele frequency 1 and start of selection Interpreting the selection scans shown in Figure 5 and in t ; t ¼ q ; forward in time, 1 with 0 0 and average allele frequency at sampling, 0 File S1, Figures S2 and S3, is not simple. Our metrics show no estimated based on .3500 simulated trajectories in which the selected allele did not go extinct. signal at many selection candidates, and in some cases con- trolling for expected LD actually leads to a reduction in signal strength. This in itself is not concerning, in that some sug- The statistic Za also demonstrated relatively high power to gested selection targets may be false positives. For example, detect a soft sweep acting on standing genetic variation (see of the 72 selection candidate regions we identified in various Table 3 and File S1, Tables S2 and S3). studies that searched for selection in European samples on Identifying the “best” statistic from our set was not simple; chromosome 2, only 5 were replicated between studies. Nev- full results included in File S1 show that many approaches to ertheless, it is clear that controlling for expected LD does not controlling for expected LD had similar performance. Statis- always improve the signal even for well-accepted candidates. tics based on the average LD in the along region, like Za; While the signal at SLC24A5 (CEU) increases (see Figure 5), tended to be particularly successful, and we therefore focus that of MCM6/LCT (CEU) appears to fall (see File S1, Figure ’ ; on these. In Table 1 and Table 2 we controlled for the S2), for both Kelly s ZnS and Za which, unsurprisingly, are expected statistic value given genetic map distance by divid- highly correlated. v ing the observed Za by that expected given the SNP distribu- Interestingly, the max statistic actually displayed an un- E½r2 tion under neutrality, Za=Za : While this is relatively usually low value for some selected regions (such as LCT in intuitive, there are other possible approaches to controlling CEU) rather than the high value that is generally expected, for expected LD, several of which are shown in Figure 4. using various window sizes (File S1 and File S2). This is pre- E½r2 Overall, the Za=Za statistic is simple to calculate and sumably because the selective sweep is recent and incom- generally performed as well as or better than other plete (Itan et al. 2009; Mathieson et al. 2015), and the v LD-based statistics explored given the variable recombination reduction in LD in the over region that drives high max val- rate. ues appears only in later stages of the sweep (Stephan et al. 2006). A similar pattern was sometimes observed in our sim- Controlling for expected LD increases selection ulations and was found to be strongly influenced by choice of candidate signal replication in HapMap data window-size parameters. E½r2 To further assess Za=Za and other selection statistics, we Greater detail on the performance of the various statistics is performed selection scans, using publicly available HapMap ge- given in Table 4. Over all selection candidate regions, Kelly’s notype data (Frazer et al. 2007). We chose to focus on chromo- ZnS replicated 9 of 96 signals as 5% outliers, the same as jLjjRj , v50;400 j jj j somes 2 (European descent and Asian populations, CEU and but Za (12) and max (15 of 101). The estimate for L R CHB+JPT) and 15 (CEU) as these chromosomes/populations is slightly pessimistic, as our pipeline removed regions with

LD as a Signature of Selective Sweeps 1813 Figure 4 Power (FPR = 0.01, x ¼ 200 kb) of the selection statistic Za to detect a hard selective sweep compared with that of similar methods fo- cusing on LD in the along region but controlling for expected LD; see Appendix: Extended Methods.A constant population size demographic model 4 (Ne ¼ 10 ) was used with variable recombination rate and (A) weaker, s ¼ 0:01 or (B) stronger, s ¼ 0:04 selection. Unless otherwise indicated t0 ¼ 0:

very few LD measurements, including two selection candidate regions containing protein-coding genes, recent work has regions [both in chromosome (Chr) 2, one Asian and the highlighted selection on regulation as especially important other European] that would otherwise have yielded positive in recent human evolution (Enard et al. 2014). results. Controlling for expected LD considerably improved the performance of both Kelly’s Z and Za; with both replicating nS Discussion 18 of 96 signals. In general, all statistics replicated consider- ably more signals than expected by chance, as strongly sug- Our results suggest that incorporating information from the gested by the resampling results (Table 4). genetic map into calculations of selection statistics based on The purpose of this study has been to assess certain pairwise LD can substantially improve their performance, both LD-based selection statistics rather than to search for selection in simulations and in replicating previously identified selec- signatures as such. We therefore tabulate novel hits in File S1 tion candidates. We have also suggested a modification of only. Often, several outlier regions occurred in succession, Kelly’s ZnS; Za; that has higher power to detect the LD distor- such that it is difficult to identify specific genes, variants, or tions caused by positive selection in simulations and that both features that might be driving a signal. Different approaches these statistics outperform vmax in simulations. using pairwise LD (Clemente et al. 2014), other population We can begin to explain this by considering the distribution genetic patterns [e.g., DIND (Barreiro et al. 2009; Pagani of different test statistics under neutrality and after a hard et al. 2016) and DDAF (1000 Genomes Project Consortium selective sweep (Figure 6). The maximum value of Za over et al. 2012; Colonna et al. 2014)], or especially biological the X ¼ 200-kb region is often considerably larger after se- information on the impact of variants can simplify this task. lection. The minimum value of Zb is only slightly reduced. Certain novel 200-kb regions contained a single gene or a few These patterns—an increase in LD in the along region and a genes, such as signals overlapping MKRN3 and ARHGAP11B decrease in the over region—reflect prior simulations and (Europeans) and ABCA12 (Asians). ARHGAP11B is a human- theoretical expectations (Kim and Nielsen 2004; Stephan lineage duplication that has recently been found to influence et al. 2006; McVean 2007; Pfaffelhuber et al. 2008) and are neocortex size (Florio et al. 2015), and its low variation sup- expected to be more pronounced toward the end of the sweep ports a recent origin and possible selection. Mutations in (leading to higher power at this stage, Table 1 and Table 2). MKRN3 can affect the onset of puberty (Abreu et al. 2013), Interestingly, the signal of reduced LD in the over region is and variants in this region have been associated with age at not strong, and the maximum Za=Zb test statistic demon- menarche (Perry et al. 2014). Finally, ABCA12 has been pre- strates weaker power than Za and a more peaked and skewed viously identified as a possible selection candidate based on neutral distribution. population differentiation, possibly related to adaptation of Comparison between the distributions of maximum Za=Zb v10;400 the skin in response to ultraviolet light exposure (Colonna and max is especially enlightening. The moderately high et al. 2014). We did not formally class this as a replication as power of Za=Zb demonstrates that the essential signal that our 200-kb region does not include the previously high- the v statistic tries to detect is indeed a reasonable signature lighted variant. of late-stage hard selective sweeps. However, the skew and fi v10;400 Dividing the 65 selection candidates identi ed in European kurtosis of the neutral distribution of max are far greater. chromosome 2 into 200-kb regions that did (39) and did not This appears to be the primary cause of the low power of the (26) contain protein-coding genes and calculating selection v statistic in our simulations. We suggest that the adaptive statistics did not support enrichment of signal replication in window may be extremely successful at finding exceptionally either set (File S1, Table S8). Although the naive expectation high values of vmax under neutrality, leading to a very fat- might suggest that relatively more true positive selection tailed distribution and difficulty in distinguishing outlier val- events, and hence signal replication, will have occurred in ues due to selection.

1814 G. S. Jacobs, T. J. Sluckin, and T. Kivisild Figure 5 Selection scans (standardized to aid comparison) on HapMap phase II data (chromo- some 15, CEU population), using a range of LD-based statistics and a diversity measure. From top to bottom, the scans represent Kelly’s ZnS; Kelly’s ZnS controlled for expected LD, Za; Za con- trolled for expected LD, the OmegaPlus program v50;400; jLjjRj: calculating max and the diversity measure Excluding OmegaPlus, all statistics were calculated with statistic window size x ¼ 400 kb: The Hap- Map combined genetic map was used to estimate expected LD. Thick dashed lines indicate the geno- mic targets of two relatively well-established signa- tures of selection, while the light gray lines indicate signals found in a range of studies based on pop- ulation differentiation and the site frequency spec- trum; see Appendix: Extended Methods and File S2.

Although the vmax statistic implemented in OmegaPlus associated with our largest sample size (n ¼ 80) and selec- performed relatively poorly in simulations, it was effective tion strength (s ¼ 0:08), and simulations were also found to in replicating selection candidate signals. Indeed, of the yield qualitatively expected patterns of LD (Kim and Nielsen LD-based statistics that we tested that did not incorporate 2004; Stephan et al. 2006) and reduced diversity (see Figure information from the genetic map, this statistic reproduced 1). We therefore assume that genealogies under both neu- signals at most previously suggested selection candidates at trality and positive selection are approximated with reason- the 5% significance level. The apparent contradiction be- able accuracy by the coalescent simulation programs we tween power suggested by simulations and the ability to re- employed. produce selection candidate signals in real data deserves The question to ask of our simulation results, then, is consideration, as simulations are often used to justify the whether the correct selection, demographic, and recombina- use of specific selection statistics. There are several possible tion scenarios were modeled. In the case of recombination, the confounding factors, and we begin by discussing our simula- variable rate was sampled from a genetic map (Frazer et al. tion modeling before turning to the implications of selection 2007) estimated using the HapMap data we investigate, candidate replication. which in turn appears to be broadly consistent with other The use of coalescent simulations to represent complex information on recombination rate variation (e.g., Kong demographic scenarios (e.g., Schaffner et al. 2005) and se- et al. 2010; Wang et al. 2012). The demographic model we lection (Wakeley 2010) is well established. We used two co- apply was estimated based on the joint site frequency spec- alescent programs, MSMS (Ewing and Hermisson 2010), trum, using low-coverage data from the 1000 Genomes proj- which has been widely used and cited, and Cosi2 ect (Gravel et al. 2011). This method involves fitting the site (Shlyakhter et al. 2014), a development of the well-known frequency spectrum calculated using a diffusion approxima- Cosi program (Schaffner et al. 2005). In both cases, selection tion of a Wright–Fisher population that incorporates drift, is incorporated by dividing the population according to allelic selection, and migration to observed data (Gutenkunst state at the selected site (Hudson and Kaplan 1988) and et al. 2009). Kingman’s coalescent can accurately approxi- conditioning the coalescent process on an independently mate genealogies generated by a Wright–Fisher model generated allele frequency trajectory describing the size of (Kingman 1982; Fu 2006). As such, even though any demo- the two subpopulations over time. The bifurcating tree gen- graphic model is a vastly idealized version of a population’s erated is expected to be least accurate when the sample size is history, the parameterization will capture qualities of human a relatively large proportion of the population size and selec- genetic data that in turn should be recapitulated in simulated tion very strong (Fu 2006; Kim and Wiehe 2009). Impor- genealogies. The model of human demography used is tantly, we did not observe any anomalous power results broadly consistent with understanding the out-of-Africa

LD as a Signature of Selective Sweeps 1815 Table 4 The performance of test statistics in replicating a signal in candidate selected regions

From top to bottom, the sections correspond to CHB+JPT (Chr 2), CEU (Chr 2), and CEU (Chr 15). For these population/chromosome combinations, we identified (and could v50;400 test using the HapMap data) 17 (15), 65 (64), and 17 (17) previously reported selection candidates, respectively. OmegaPlus calculated max across a grid rather than at SNPs and was able to test all selection candidates (17, 65, 17); for the CHB+JPT (Chr 2) data, one of the replicated candidates was not tested by the other statistics. The test statistic for a 200-kb region corresponds to the maximum value of the chosen selection statistic in that region or the minimum value for the diversity metric jLjjRj: The average test statistic across all selection candidate regions is reported (observed value) and can be compared to the average across all 200-kb regions (expected value). The expected value of those statistics that control for expected LD is .1 due to the imbalanced impact of positive and negative fluctuations in the denominator (a randomly small denominator greatly increasing the statistic value). The average rank of selection candidate regions is also reported (observed rank), as well as the number of selection candidate regions in that are top-5% outliers (or bottom 5% for jLjjRj). A resampling approach yielded percentiles of the three assessment metrics compared to their chromosome-wide distribution; see Appendix: Extended Methods. dispersals and is similar, in terms of divergence times and signals may be driven by other selective processes that are population size estimates, to a model estimated based on known to have affected human genetic variation, in particu- haplotype sharing (Harris and Nielsen 2013). lar purifying (e.g., Bustamante et al. 2005) and balancing Our choice of selection scenarios is more constrained, with (e.g., Andrés et al. 2009) selection. The population genetic a focus on hard sweeps and soft sweeps from standing genetic signatures that these create can be similar to those generated variation. The evolution of real populations, and consequently by selective sweeps—a simple example being reduction in the selection candidate list we use, will reflect a much wider diversity usually expected given both hard selective sweeps range of selection phenomena. For example, soft sweeps, in and purifying selection. The details of signal replication de- which the ultimate fixation of an allele is caused by selection pend on whether these processes generate outlier values for acting on more than one copy of that allele (Hermisson and multiple selection statistics. Regarding LD-based statistics, Pennings 2005; also figure 1 of Messer and Petrov 2013), Kelly’s ZnS may be high in cases of be- may also occur due to multiple advantageous mutations oc- tween linked loci (Kelly 1997) and soft sweeps involving re- curring at a single locus, a scenario we do not explore. Var- current mutation (Pennings and Hermisson 2006), while ious factors contribute to the relative frequency of different selection on loci with epistatic interactions can also affect types of selective sweep (Messer and Petrov 2013); both ex- LD patterns (Phillips 2008). plicitly hard and soft sweeps have been inferred from genetic Inevitably, the patterns of LD generated by our hard and data (e.g., Peter et al. 2012; Garud et al. 2015; Schlamp et al. soft sweep simulations will only closely approximate patterns 2016). observed at a subset of candidate selection signals generated More broadly, the frequency at which selective sweeps by selective processes. The possible range of demographic and occurred in recent human evolution remains a subject of selection scenarios that might be simulated is large such that debate (Hernandez et al. 2011; Enard et al. 2014). Some comprehensive exploration is impractical. Meanwhile, the

1816 G. S. Jacobs, T. J. Sluckin, and T. Kivisild Figure 6 The distribution of test statistics in con- 4 stant population size (Ne ¼ 10 ) neutral and hard selective sweep simulations (s ¼ 0:01; q0 ¼ 0:99; t0 ¼ 0) calculated using an x ¼ 100-kb statistic window (unless otherwise indicated) and X ¼ 200-kb region. Maximum values of Za; Z =Z ; v10;400 Z a b and max and the minimum value of b in the region were used as test statistics. To aid comparison, statistic distributions were standard- ized by subtracting the mean value under neutral- ity and dividing by the standard deviation under neutrality. Summary metrics of the neutral distri- bution, skewness and excess kurtosis, are indi- cated above the plots, as is the power to detect selection given the hard sweep model (FPR = 0.01). relative role of different forms of selection in different species to low genetic diversity, sometimes used as an indicator of and populations is not fully understood. Consequently, char- positive or purifying selection. However, it also reduces the acterizing the performance of selection statistics is an iterative number of available pairwise LD measurements in a region process. An indication of some conditions under which and leads to a coarser estimate of haplotype blocks and the site LD-based statistics appear effective, and of the utility of ge- frequency spectrum. Increased variance in statistics at certain netic maps in controlling for recombination rate variation, points in the genome will lead to more selection statistic provides a useful baseline for further research. outliers (both positive and negative) in these regions and In our work, we have assumed that our selection candidate hence a greater probability of overlapping outliers even when set was enriched for positively selected regions and hence the neutral correlation between statistics is low. considered replication of signals by the LD-based statistics a These general points are highly relevant to the signal useful indicator of their utility. Reproduction of a signal, overlap between the LD-based statistics we tested, for exam- however, has unclear implications. The critical property is ple, which correlate strongly (Figure 5), but are not especially the extent to which two statistics correlate positively under informative concerning the unexpectedly high number of se- neutrality and under positive selection. To try to avoid strong lection candidates replicated by OmegaPlus. The three sig- correlations under neutrality, we did not use selection scans nals replicated only by the vmax statistic were detected in based on measures of allelic associations such as pairwise three different articles using three different methods (Table linkage disequilibrium or haplotype homozygosity statistics 5), such that they cannot simply be attributed to neutral when generating our selection candidate signal list. We did correlation between OmegaPlus and another selection statis- include selection scans based on the site frequency spectrum; tic. While the vmax statistic has greatest performance at the the pattern of high-frequency derived alleles following a end of a selective sweep (e.g., Table 1), coinciding with great- selective sweep is expected to correlate with high LD due to est population differentiation and larger distortions to the common genealogical structure (Kim and Nielsen 2004). site frequency spectrum, its power at this stage was neverthe- Summary statistics describing the site frequency spectrum less lower than that of other LD-based statistics in simula- can also be affected by recombination rate variation tions. Ultimately, further work will be needed to precisely (Thornton 2005), such that the expected relationship be- clarify the relationship between signal replication and power tween observed LD and such statistics appears complex. assessed through simulations. Of broader relevance to the practice of detecting selection When controlling for expected LD given genetic map dis- signals, only the reproduction of signals by statistics that show tance, we used two different genetic maps. We focus on results little correlation under neutrality can be considered indepen- using the combined HapMap genetic map (Frazer et al. 2007) dent evidence for selection at a locus; if this is not the case, in the main text, which is based on LD patterns in Europeans, replication can incorrectly give the impression of a robust Asians, and Africans. As such, there is a danger of underesti- selection candidate. Developing a detailed understanding of mating the recombination rate in regions with high LD due to the correlation between different statistics under neutrality and of incorrectly inferring hotspots when and selection, and the impact of complex demography, vari- LD is low over a high-frequency selected site. Where recom- able recombination, and variable mutation rate on this, is an bination rate is underestimated, expected LD is correspond- important future direction in the interpretation of selection ingly overestimated, and controlling for recombination rate is signatures. thus likely to degrade the signal of selection based on un- A slightly different problem is caused by characteristics of usually high LD. Despite this, we found that results were molecular evolution that increase the variance of many dif- not substantially different when using the deCODE genetic ferent selection statistics, such as regionally low mutation or map (Kong et al. 2010) (see File S1 and File S2). This is recombination rates. For example, infrequent mutation leads interesting, in that differences between the genetic map

LD as a Signature of Selective Sweeps 1817 Table 5 Replication of selection candidate regions at the more stringent P<0:01 level

Population Chr Mb Signals Source Genes Target CHB+JPT 2 72–73 2–6 Chen et al. (2010); Colonna et al. CYP26B1, EXOC6B, SPR, EMX1 (2014) CEU 2 74.2–75 2, 4 Nielsen et al. (2005); Oleksyk et al. 30 genes (2008); Chen et al. (2010) CEU 2 84.2–85 2, 6 Nielsen et al. (2005); Oleksyk et al. FUNDC2P2, SUCLG1, DNAH6, (2008); Pagani et al. (2016) TRABD2A, TMSB10 CEU 2 87.2–87.8 5 Colonna et al. (2014) 5 noncoding CHB+JPT 2 108.4–108.8 2 Pagani et al. (2016) GCC2, LIMS1, RANBP2, CCDC138 EDAR CEU/CHB+JPT 2 121.2–121.4/121.4–121.6 6/6 Chen et al. (2010)/Chen et al. GLI2 (2010) CEU 2 135.8–136.2 1, 3 Nielsen et al. (2005) ZRANB3, R3HDM1, MIR128-1 MCM6 CHB+JPT 2 177–177.8 2, 4, 6 Carlson et al. (2005); Pickrell et al. HNRNPA3 and 5 noncoding (2009); Chen et al. (2010) CEU 2 182.2–182.4 6 Pagani et al. (2016) CERKL, NEUROD1 CEU/CHB+JPT 2 194.2–195/194.4–195 0–3/1–3, 5 Pagani et al. (2016)/Carlson et al. LOC101927406 (2005); Pickrell et al. (2009); Chen et al. (2010) CEU 15 26.2–27 3–6 a 12 genes, including HERC2 HERC2 CEU 15 41.4–42.2 1, 3 Ronen et al. (2013); Pagani et al. 23 genes (2016) CEU 15 46–46.6 2, 4, 6 Chen et al. (2010); Colonna et al. SLC24A5, MYEF2, CTXN2, SLC24A5 (2014) SLC12A1, DUT, FBN1 CEU 15 88.2–88.4 6 Colonna et al. (2014) AP3S2, C15orf38-AP3S2, ARPIN, ZNF710, MIR3174

2 ’ Z ; ’ Z = ’ ZE½r ; Z ; Z =ZE½r2 ; jLjjRj; v50;400: P Selection statistic key: 1, Kelly s nS 2, Kelly s nS Kelly s nS 3, a 4, a a 5, 6, max Genes are listed in full with resampled -values in File S2. a Several studies identified selection candidates marginally downstream or upstream of this signal (Oleksyk et al. 2008; Chen et al. 2010; Ronen et al. 2013). estimated from LD patterns and that based on observed re- of the relative strength of selection signals. Of the methods we E½r2 combination events have been suggested as indicative of se- tested, the Za statistic and Za=Za showed highest power in lection (O’Reilly et al. 2008). We speculate that the method of simulations. In the absence of information on the genetic combining recombination rate estimates from multiple pop- map, OmegaPlus was most successful at replicating selection ulations used in the HapMap genetic map substantially mit- candidates identified by other selection scan studies, despite igates this effect. This is because LD-based methods tend to often demonstrating low power in both soft and hard selec- identify signatures of recent selection, which often postdate tive sweep simulations. Simulations are often used to test the population divergence and will affect different genomic re- performance of selection statistics, and this pattern creates an gions in the different populations. intriguing contradiction. Focusing on this problem, we con- We finally note that our method of controlling for expected clude pragmatically, without a greater understanding of the LD is simple and that more complex alternatives may improve correlation between signals, under neutrality especially, it is the power of these statistics further. For example, widely difficult to interpret precisely what signal replication implies. separated alleles that are always found on the same haplotype Thus, in our study the OmegaPlus statistic was effective at show an r2 value of 1, but such distant cosegregating alleles finding signals that have already been identified, but we are are far less likely under neutrality if the alleles are both de- unable to suggest the precise evolutionary meaning of these rived and at high frequency [the signal exploited by methods signals or whether this reflects shared true or false positive derived from EHH (Sabeti et al. 2002)]. Incorporating infor- results. Based on both simulation and replication, incorporat- mation about the derived allele frequency of each allele in a ing information on expected LD using a genetic map can sub- pairwise LD measurement in addition to genetic map dis- stantially improve the performance of selection statistics. tance may give a better indication of how unusual the ob- served LD pattern is and hence increase statistical power. The Acknowledgments question, ultimately, would be whether this closer approxi- mation of haplotype-based methods has advantages over the We acknowledge useful discussions with Florian Clemente, range of well-developed haplotype methods currently used. Gereon Kaiping, and Mircea Iliescu at various stages of this work and helpful comments from Andy Collins. We also Conclusion thank two anonymous reviewers for their comments. This Our work has confirmed that the power of selection statistics work was submitted by an Engineering and Physical Science based on LD can often be improved by controlling for variable Research Council Doctoral Training Centre grant (EP/ recombination rate. Doing so is likely to reduce the number of G03690X/1; G.S.J.) and a European Starting Investigator false positive selection candidates and give a clearer indication grant (FP7-26123) (to T.K.).

1818 G. S. Jacobs, T. J. Sluckin, and T. Kivisild Literature Cited Fu, Y.-X., 2006 Exact coalescent for the Wright–Fisher model. Theor. Popul. Biol. 69: 385–394. Abreu, A. P., A. Dauber, D. B. Macedo, S. D. Noel, V. N. Brito et al., Garud, N. R., P. W. Messer, E. O. Buzbas, and D. A. Petrov, 2013 Central precocious puberty caused by mutations in the 2015 Recent selective sweeps in North American Drosophila imprinted gene MKRN3. N. Engl. J. Med. 368: 2467–2475. melanogaster show signatures of soft sweeps. PLoS Genet. 11: Akey, J. M., 2009 Constructing genomic maps of positive selection in e1005004. humans: Where do we go from here? Genome Res. 19: 711–722. Gravel, S., B. M. Henn, R. N. Gutenkunst, A. R. Indap, G. T. Marth Alachiotis, N., P. Pavlidis, and A. Stamatakis, 2012a Exploiting et al., 2011 Demographic history and rare allele sharing multi-grain parallelism for efficient selective sweep detection, among human populations. Proc. Natl. Acad. Sci. USA 108: pp. 56–68 in Algorithms and Architectures for Parallel Processing. 11983–11988. Springer-Verlag, Berlin/Heidelberg, Germany/New York. Gutenkunst, R. N., R. D. Hernandez, S. H. Williamson, and C. D. Alachiotis, N., A. Stamatakis, and P. Pavlidis, 2012b OmegaPlus: Bustamante, 2009 Inferring the joint demographic history of a scalable tool for rapid detection of selective sweeps in whole- multiple populations from multidimensional SNP frequency genome datasets. Bioinformatics 28: 2274–2275. data. PLoS Genet. 5: e1000695. Andrés, A. M., M. J. Hubisz, A. Indap, D. G. Torgerson, J. D. De- Harris, K., and R. Nielsen, 2013 Inferring demographic history genhardt et al., 2009 Targets of balancing selection in the from a spectrum of shared haplotype lengths. PLoS Genet. 9: human genome. Mol. Biol. Evol. 26: 2755–2764. e1003521. Barreiro, L. B., M. Ben-Ali, H. Quach, G. Laval, E. Patin et al., Hermisson, J., and P. S. Pennings, 2005 Soft sweeps: molecular 2009 Evolutionary dynamics of human Toll-like receptors of adaptation from standing genetic varia- and their different contributions to host defense. PLoS Genet. tion. Genetics 169: 2335–2352. 5: e1000562. Hernandez, R. D., J. L. Kelley, E. Elyashiv, S. C. Melton, A. Auton Barrett, R. D., and H. E. Hoekstra, 2011 Molecular spandrels: et al., 2011 Classic selective sweeps were rare in recent human tests of adaptation at the genetic level. Nat. Rev. Genet. 12: evolution. Science 331: 920–924. 767–780. Hill, W., and A. Robertson, 1968 Linkage disequilibrium in finite Bustamante, C. D., A. Fledel-Alon, S. Williamson, R. Nielsen, M. T. populations. Theor. Appl. Genet. 38: 226–231. Hubisz et al., 2005 Natural selection on protein-coding genes Hodgkinson, A., and A. Eyre-Walker, 2011 Variation in the muta- in the human genome. Nature 437: 1153–1157. tion rate across mammalian genomes. Nat. Rev. Genet. 12: 756– Carlson, C. S., D. J. Thomas, M. A. Eberle, J. E. Swanson, R. J. 766. Livingston et al., 2005 Genomic regions exhibiting positive Hudson, R. R., and N. L. Kaplan, 1988 The coalescent process in selection identified from dense genotype data. Genome Res. models with selection and recombination. Genetics 120: 831–840. 15: 1553–1565. Itan, Y., A. Powell, M. A. Beaumont, J. Burger, M. G. Thomas et al., Catalán, A., S. Hutter, and J. Parsch, 2012 Population and sex 2009 The origins of in Europe. PLoS Com- differences in Drosophila melanogaster brain gene expression. put. Biol. 5: e1000491. BMC Genomics 13: 654. Jensen, J. D., K. R. Thornton, C. D. Bustamante, and C. F. Aquadro, Chen, H., N. Patterson, and D. Reich, 2010 Population differenti- 2007 On the utility of linkage disequilibrium as a statistic for ation as a test for selective sweeps. Genome Res. 20: 393–402. identifying targets of positive selection in nonequilibrium pop- Clemente, F. J., A. Cardona, C. E. Inchley, B. M. Peter, G. Jacobs ulations. Genetics 176: 2371–2379. et al., 2014 A selective sweep on a deleterious mutation in Jones, E., T. Oliphant, and P. Peterson, 2001 SciPy: open source CPT1A in arctic populations. Am. J. Hum. Genet. 95: 584–589. scientific tools for Python. Available at: http://www.scipy.org. Colonna, V., Q. Ayub, Y. Chen, L. Pagani, P. Luisi et al., Accessed August 19, 2015. 2014 Human genomic regions with exceptionally high levels Kelley, J. L., J. Madeoy, J. C. Calhoun, W. Swanson, and J. M. Akey, of population differentiation identified from 911 whole-genome 2006 Genomic signatures of positive selection in humans and sequences. Genome Biol. 15: R88. the limits of outlier approaches. Genome Res. 16: 980–989. Crisci, J. L., Y.-P. Poh, S. Mahajan, and J. D. Jensen, 2013 The Kelly, J. K., 1997 A test of neutrality based on interlocus associa- impact of equilibrium assumptions on tests of selection. Front. tions. Genetics 146: 1197–1206. Genet. 4: 235. Kim, Y., and R. Nielsen, 2004 Linkage disequilibrium as a signa- Daly, M. J., J. D. Rioux, S. F. Schaffner, T. J. Hudson, and E. S. ture of selective sweeps. Genetics 167: 1513–1524. Lander, 2001 High-resolution haplotype structure in the hu- Kim, Y., and T. Wiehe, 2009 Simulation of DNA sequence evolu- man genome. Nat. Genet. 29: 229–232. tion under models of recent . Brief. Bioin- Enard, D., P. W. Messer, and D. A. Petrov, 2014 Genome-wide form. 10: 84–96. signals of positive selection in human evolution. Genome Res. Kingman, J. F., 1982 On the genealogy of large populations. 24: 885–895. J. Appl. Probab. 19: 27–43. Ewing, G., and J. Hermisson, 2010 MSMS: a coalescent simula- Kong, A., G. Thorleifsson, D. F. Gudbjartsson, G. Masson, A. Si- tion program including recombination, demographic structure gurdsson et al., 2010 Fine-scale recombination rate differences and selection at a single locus. Bioinformatics 26: 2064–2065. between sexes, populations and individuals. Nature 467: 1099– Ewing, G. B., P. A. Reiff, and J. D. Jensen, 2015 PopPlanner: 1103. visually constructing demographic models for simulation. Front. Lee, T., S. Cho, K. S. Seo, J. Chang, H. Kim et al., 2013 Genetic Genet. 6: 150. variants and signatures of selective sweep of Hanwoo popula- Ferrer-Admetlla, A., M. Liang, T. Korneliussen, and R. Nielsen, tion (Korean native cattle). BMB Rep. 46: 346–351. 2014 On detecting incomplete soft or hard selective sweeps Mallick, S., S. Gnerre, P. Muller, and D. Reich, 2009 The difficulty using haplotype structure. Mol. Biol. Evol. 31: 1275–1291. of avoiding false positives in genome scans for natural selection. Florio, M., M. Albert, E. Taverna, T. Namba, H. Brandl et al., Genome Res. 19: 922–933. 2015 Human-specificgeneARHGAP11B promotes basal progenitor Mathieson, I., I. Lazaridis, N. Rohland, S. Mallick, N. Patterson amplification and neocortex expansion. Science 347: 1465–1470. et al., 2015 Genome-wide patterns of selection in 230 ancient Frazer, K. A., D. G. Ballinger, D. R. Cox, D. A. Hinds, L. L. Stuve Eurasians. Nature 528: 499–503. et al., 2007 A second generation human haplotype map of over McClish, D. K., 1989 Analyzing a portion of the ROC curve. Med. 3.1 million SNPs. Nature 449: 851–861. Decis. Making 9: 190–195.

LD as a Signature of Selective Sweeps 1819 McVean, G., 2007 The structure of linkage disequilibrium around Phillips, P. C., 2008 Epistasis - the essential role of gene interac- a selective sweep. Genetics 175: 1395–1406. tions in the structure and evolution of genetic systems. Nat. Rev. McVean, G. A., S. R. Myers, S. Hunt, P. Deloukas, D. R. Bentley Genet. 9: 855–867. et al., 2004 The fine-scale structure of recombination rate var- Pickrell, J. K., G. Coop, J. Novembre, S. Kudaravalli, J. Z. Li et al., iation in the human genome. Science 304: 581–584. 2009 Signals of recent positive selection in a worldwide sam- Messer, P. W., and D. A. Petrov, 2013 Population genomics of ple of human populations. Genome Res. 19: 826–837. rapid adaptation by soft selective sweeps. Trends Ecol. Evol. Renzette, N., T. F. Kowalik, and J. D. Jensen, 2016 On the relative 28: 659–669. roles of background selection and genetic hitchhiking in shaping Metz, C. E., 1978 Basic principles of ROC analysis. Semin. Nucl. human cytomegalovirus genetic diversity. Mol. Ecol. 25: 403–413. Med. 8: 283–298. Ronen, R., N. Udpa, E. Halperin, and V. Bafna, 2013 Learning Mezard, C., 2006 Meiotic recombination hotspots in plants. Bio- natural selection from the site frequency spectrum. Genetics chem. Soc. Trans. 34: 531–534. 195: 181–193. Nei, M., Y. Suzuki, and M. Nozawa, 2010 The neutral theory of Rozas, J., M. Gullaud, G. Blandin, and M. Aguadé, 2001 DNA molecular evolution in the genomic era. Annu. Rev. Genomics variation at the rp49 gene region of Drosophila simulans: evo- Hum. Genet. 11: 265–289. lutionary inferences from an unusual haplotype structure. Ge- Nielsen, R., S. Williamson, Y. Kim, M. J. Hubisz, A. G. Clark et al., netics 158: 1147–1155. 2005 Genomic scans for selective sweeps using SNP data. Ge- Sabeti, P. C., D. E. Reich, J. M. Higgins, H. Z. Levine, D. J. Richter nome Res. 15: 1566–1575. et al., 2002 Detecting recent positive selection in the human Oleksyk, T. K., K. Zhao, M. Francisco, D. A. Gilbert, S. J. O’Brien genome from haplotype structure. Nature 419: 832–837. et al., 2008 Identifying selected regions from heterozygosity Sabeti, P. C., P. Varilly, B. Fry, J. Lohmueller, E. Hostetter et al., and divergence using a light-coverage genomic dataset from 2007 Genome-wide detection and characterization of positive two human populations. PLoS One 3: e1712. selection in human populations. Nature 449: 913–918. 1000 Genomes Project Consortium, G. R. Abecasis,A.Auton,L.D. Schaffner, S. F., C. Foo, S. Gabriel, D. Reich, M. J. Daly et al., Brooks,M.A.DePristo et al., 2012 An integrated map of ge- 2005 Calibrating a coalescent simulation of human genome netic variation from 1,092 human genomes. Nature 491: 56–65. sequence variation. Genome Res. 15: 1576–1583. O’Reilly, P. F., E. Birney, and D. J. Balding, 2008 Confounding Schlamp, F., J. Made, R. Stambler, L. Chesebrough, A. R. Boyko between recombination and selection, and the ped/pop method et al., 2016 Evaluating the performance of selection scans to for detecting selection. Genome Res. 18: 1304–1313. detect selective sweeps in domestic dogs. Mol. Ecol. 25: 342–356. Pagani, L., D. J. Lawson, E. Jagoda, A. Mörseburg, A. Eriksson et al., Shlyakhter, I., P. C. Sabeti, and S. F. Schaffner, 2014 Cosi2: an 2016 Genomic analyses inform on migration events during the efficient simulator of exact and approximate coalescent with peopling of Eurasia. Nature (in press). selection. Bioinformatics 30: 3427–3429. Paigen, K., and P. Petkov, 2010 Mammalian recombination hot Smith, J. M., and J. Haigh, 1974 The hitch-hiking effect of a spots: properties, control and evolution. Nat. Rev. Genet. 11: favourable gene. Genet. Res. 23: 23–35. 221–233. Stephan, W., Y. S. Song, and C. H. Langley, 2006 The hitchhiking Pavlidis, P., J. D. Jensen, and W. Stephan, 2010 Searching for effect on linkage disequilibrium between linked neutral loci. footprints of positive selection in whole-genome SNP data from Genetics 172: 2647–2663. nonequilibrium populations. Genetics 185: 907–922. Thornton, K., 2005 Recombination and the properties of Tajima’s Pennings, P. S., and J. Hermisson, 2006 Soft sweeps III: the sig- D in the context of approximate-likelihood calculation. Genetics nature of positive selection from recurrent mutation. PLoS 171: 2143–2148. Genet. 2: e186. Thornton, K. R., and J. D. Jensen, 2007 Controlling the false- Perry, J. R., F. Day, C. E. Elks, P. Sulem, D. J. Thompson et al., positive rate in multilocus genome scans for selection. Genetics 2014 Parent-of-origin-specific allelic associations among 175: 737–750. 106 genomic loci for age at menarche. Nature 514: 92–97. Wakeley, J., 2010 Natural selection and coalescent theory, pp. Peter, B. M., E. Huerta-Sanchez, and R. Nielsen, 2012 Distinguishing 119–149 in Evolution Since Darwin: The First 150 Years, edited between selective sweeps from standing variation and from a de by M. A. Bell, D. J. Futuyma, W. F. Eanes, and J. S. Levinton. novo mutation. PLoS Genet. 8: e1003011. Sinauer Associates, Sunderland, MA. Petes, T. D., 2001 Meiotic recombination hot spots and cold spots. Wang, J., H. C. Fan, B. Behr, and S. R. Quake, 2012 Genome-wide Nat. Rev. Genet. 2: 360–369. single-cell analysis of recombination activity and de novo muta- Pfaffelhuber, P., A. Lehnert, and W. Stephan, 2008 Linkage dis- tion rates in human sperm. Cell 150: 402–412. equilibrium under genetic hitchhiking in finite populations. Ge- netics 179: 527–537. Communicating editor: R. Nielsen

1820 G. S. Jacobs, T. J. Sluckin, and T. Kivisild Appendix: Extended Methods Designing Simple Selection Statistics As indicated in the main text, we defined a series of possible selection statistics based on some average measurement of LD (r2)in the along region and/or the over region. For example, two of the simplest statistics we explored were Za (Equation 3) and Zb (Equation 4). However, it was often useful to use an alternative to the observed r2 value to control for expected LD. A simple E½r2 2 2 example would be the statistic Za 2 Za ; corresponding to the average r in the along region minus the average expected r in the along region, ! 21 X h i 21 X h i 2 l S2l E½r ¼ : 2 þ 2 ; Za 0 5 E ri; j E ri; j (A1a) 2 i2L;j2L 2 i2R;j2R

½ 2 fi where E ri;j is estimated based on the generated LD pro le, described in detail below, and the genetic map distance between loci i and j. As a rule, we use superscripts to Za and Zb to indicate cases where we are calculating a measure of LD in the along or 2 the over region, respectively, in a manner analogous to Za and Zb but based on a quantity other than the observed r between loci. We used four other approaches to controlling for expected LD within the along and over calculations, ! 21 X . h i 21 X . h i 2= ½ 2 l S2l Zr E r ¼ 0:5 r2 E r2 þ r2 E r2 (A1b) a 2 i; j i; j 2 i; j i; j i2L;j2L i2R;j2R ! 21 X . h i 21 X . h i ð 2= ½ 2Þ l S2l Zlog r E r ¼ 0:5 log r2 E r2 þ log r2 E r2 (A1c) a 2 i; j i;j 2 i;j i;j i2L;j2L i2R; j2R 0 h i h i1 2 2 2 2 2 2 1 X r ; 2 E r ; 1 X r ; 2 E r ; ZScore @ l i j i j S2l i j i j A Za ¼ 0:5 h i þ h i (A1d) 2 2 ; 2 s 2 2 2 ; 2 s 2 i L j L ri;j i R j R ri; j ! 21 X 21 X l S2l ZBetaCDF ¼ 0:5 F r2 ; a; b þ F r2 ; a; b ; (A1e) a 2 i; j 2 i;j i2L; j2L i2R;j2R

ð 2 ; ; Þ 2 ; where, in Equation A1e, F ri;j a b denotes the value, at ri;j of the cumulative distribution function of a Beta distribution with ð 2 ; ; Þ¼ ð 2 ; ; Þ= ð ; Þ; ð 2 ; ; Þ ð ; Þ parameters a and b [F ri;j a b B ri;j a b B a b with B ri;j a b the incomplete beta function and B a b the complete beta function], fitted by maximum likelihood to r2 measurements in the appropriate genetic map distance bin, generated when creating the LD profile. This final approach is essentially attempting to estimate a P-value for each observed r2 measurement and averages these (which is far more conservative than using Fisher’s method to combine P-values and likely more repre- sentative given the strong correlation between LD at nearby pairs of loci). 2 fi ; ’ E½r 2 Analogous quantities are de ned when calculating variants of Zb while Kelly s ZnS is the average expected r between all pairs of loci in a window. The selection statistics we assessed involved simple operations on estimates of LD in the along and over regions and consisted of the following: 1. LD and deviation from expected LD in the along region:

r2=E½r2 logðr2=E½r2Þ ZScore BetaCDF E½r2 Za Za; Z ; Z ; Z ; Z ; Za 2 Z ; : a a a a a E½r2 Za

2. LD and deviation from expected LD in the over region:

r2=E½r2 logðr2=E½r2Þ ZScore BetaCDF E½r2 Zb Zb; Z ; Z ; Z ; Z ; Zb 2 Z ; : b b b b b E½r2 Zb

3. Kelly’s ZnS and deviation from expected Kelly’s ZnS: Z ½ 2 Z ; nS ; Z 2 ZE r : nS E½r2 nS nS ZnS

LD as a Signature of Selective Sweeps 1821 4. The vmax statistic and similar constant window-size alternatives that can control for their expected value: E½r2 BetaCDF v ; Za; Za 2 Za ; Za : max ½ 2 Zb Zb E r BetaCDF Zb Zb

5. Methods similar to Kelly’s Z with more diverse approaches to controlling for expected LD: nS r2=E½r2 r2=E½r2 logðr2=E½r2Þ logðr2=E½r2Þ ZScore ZScore BetaCDF BetaCDF E½r2 Za þ Zb; Za þ Zb ; Za þ Zb ; Za þ Zb ; Za þ Zb ; Za 2 Za E½r2 Za Zb þ Zb 2 Z ; þ : b E½r2 E½r2 Za Zb

6. Alternatives to v that instead use the difference between LD in the over and along regions: max r2=E½r2 r2=E½r2 logðr2=E½r2Þ logðr2=E½r2Þ ZScore ZScore BetaCDF BetaCDF E½r2 Za 2 Zb; Za 2 Zb ; Za 2 Zb ; Za 2 Zb ; Za 2 Zb ; Za 2 Za E½r2 Za Zb 2 Zb 2 Z ; 2 : b E½r2 E½r2 Za Zb

7. The product of average LD in over and along: BetaCDF BetaCDF ZaZb; Za Zb :

8. The number of SNPs in the along and over region, where jRj indicates the cardinality of set R: jLj jRj þ ; jLjjRj: 2 2

Many of these statistics rely on the creation of an LD profile, which we now describe. The LD Profile The LD profile consists of descriptive statistics of LD measurements between loci separated by a given genetic map distance. Creating the LD profile using simulated data involved generating 1000 samples of n chromosomes, of length 3 Mb, under neutrality and according to the relevant model of recombination and demography. As in our power simulations, loci with MAF , 0:05 were removed, followed by the removal of random loci until the average spacing between polymorphic sites was 2500 bp. r2 values were calculated for all pairs of loci up to a distance of 2 cM, with the genetic map distance between loci based on the true genetic map, the low-resolution map, or physical distance. These LD measurements were assigned to 20,000 bins according to the genetic map distance between the two loci, such that each bin represents 0.00001 cM. Finally, the average LD, E½r2; for each bin and the standard deviation, s½r2; were calculated, and a maximum-likelihood fitting of the Beta distribution was performed using the Scipy module (Jones et al. 2001) (scipy.stats.beta.fit), giving values of a and b for each bin. A different LD profile was generated for each combination of sample size, demographic model, recombination model, and assumed known genetic map. LD profiles were constructed for the HapMap data individually for each chromosome and population, using the two different genetic maps (Frazer et al. 2007; Kong et al. 2010). Genetic Maps In our simulations with variable recombination rate, we considered three possible scenarios concerning the genetic map. Two of these simply involved using the physical map as a proxy (PhysMap) or providing the real section of the HapMap genetic map according to which the data were simulated (TrueMap). The third one used a lower-resolution version of the true genetic map (LowResMap). This was generated by downsampling the HapMap map by a factor of 15, reducing the average distance between reported map positions from 817 bp to 12,260 bp. Note that the genetic map is still accurate in the sense that it was generated using all loci, but that hotspots will be considerably smoothed out. Simulations Simulations were performed using MSMS (Ewing and Hermisson 2010) and Cosi2 (Shlyakhter et al. 2014). Each simulation replicate involved simulating a sample size of n ¼½20; 40; 80 chromosomes of length 1.5 Mb, which may or may not have been subject to selection at a site located in the middle of the chromosome. The mutation rate was m ¼ 1:7 3 1028 in the constant-

1822 G. S. Jacobs, T. J. Sluckin, and T. Kivisild size demographic model and followed Gravel et al. (2011), m ¼ 2:38 3 1028; in the out-of-Africa demographic model. When recombination rate was constant this was set to r ¼ 1:1 3 1028; the variable recombination rate was retrieved as described in Methods. The generation time was taken to be 25 years, as in Gravel et al. (2011). To approximate SNP panel data and avoid LD measurements involving singletons, loci with MAF , 0:05 in the sample were removed before randomly removing loci until the average spacing between SNPs was 2500 bp. An X ¼ 200 - kb region was then defined from positions 650 to 850 kb. We calculated the value of all selection statistics other than vmax using in-house scripts at each SNP, using three statistic window sizes (x ¼½100kb; 200kb; 400kb). Selection statistics were not calculated if jRj or jLj were under 4 or if jRjjLj , 25: vmax was calculated on a grid using OmegaPlus with a resolution of 2500 bp, using window-size flags -minwin ¼½1000; 10; 000; -maxwin ¼½100; 000; 400; 000; and -minsnps ¼ 5: v10;400 When reporting power, we focus on results for max in the main text, which tended to perform relatively well. Test statistics were retrieved as either the maximum or the minimum value of each selection statistic in the 200-kb region, unless SNP diversity was too low to obtain any selection statistic calculations, in which case the replicate was not used in power calcu- lation. Note that the removal of very SNP-sparse replicates will make our power estimate for diversity-based statistics con- servative—most windows removed would have been true positives. We estimate this distortion to generally be of the order of 0.01 in the tables presented in the main text (Table 1, Table 2, and Table 3), rising to a maximum of 0.07 for the OOA scenario with selection starting at t1 ¼ 1600 generations (Table 2). 4 Two demographic scenarios were used, one with constant population size (Ne ¼ 10 ) and one following an out-of-Africa model (Gravel et al. 2011) with samples taken from the European population. To obtain selective sweep trajectories under both demographic models in MSMS, we used two types of selection scenarios. In the first one, selection of strength s begins : (pastward in time) at t0 generations with an allele frequency of qt0 The time at which the selection phase of the model ends, t1 generations, corresponds to the time at which the de novo selected allele first appears and is determined stochastically when MSMS generates the selected allele frequency trajectory on which later coalescent simulations are conditioned. This approach is used when the population size is constant. The second method involves specifying s and times t0 and t1; as well as the : . = : frequency of the selected allele when the selection phase ends, qt1 This allows for selection on standing variation, qt1 1 2Ne

This time, qt0 is determined by the generated selected allele frequency trajectory.We use this approach when applying the OOA demographic model and when exploring selection on standing variation. The selection scenarios investigated used an additive selection model and a selective advantage of s ¼½0:01; 0:02; 0:04; 0:08 for the homozygote. For the constant population size demographic model, we conditioned our hard sweep selection simula- fi ¼½ : ; : ; : ; : ; : ¼ ¼½ : : tions on nal allele frequency, qt0 0 3 0 5 0 7 0 9 0 99 with t0 0orqt0¼800 0 99 For the out-of-Africa model, we ¼ : ¼½ ; ; ; ¼ conditioned selection simulations on the starting allele frequency qt1 0 0005 and with t1 200 400 800 1600 and t0 0 and further removed simulations in which the selected allele became extinct such that q0 6¼ 0: The ROC curves from these scenarios were used to calculate the selOOA pAUC performance measure (Figure 3), with otherwise similar supplementary runs using t1 ¼½2400; 3200; 4000 performed to assess the decay of the LD signal (Table 2). ¼ : In the OOA model, even the low initial frequency chosen, qt1 0 0005 led to some soft sweeps (in which not all lineages with the selected site have coalesced at the start of selection), especially when n and s were large and t1 small. Using the “-oOC” flag in MSMS and further simulations, we estimated that 8% of selective sweeps used to calculate OOA pAUC metrics shown in File S1, Figure S1 and Figure 3A were soft sweeps; ,1.5% of those used to obtain the power calculations shown in Table 3 were soft. We also explored, in less detail, primarily soft sweeps acting on standing genetic variation. Here, sample size was n ¼ 40; ¼ v v10;400 ¼½ : ; : : window length x 200kb (in the case of the statistic, we calculated max ), and selection strength s 0 01 0 04 Initial ¼½ : ; : ; : ; ¼½ ; ; ; allele frequency was qt1 0 01 0 05 0 1 with the selection phase ending, pastward, at times t1 200 400 800 1600 . The program MSMS was used for all simulations in which the recombination rate did not vary, while Cosi2 was used for simulations with a variable recombination rate. The allele frequency trajectories used to simulate selection in Cosi2 were generated using MSMS. Example MSMS scripts [checked using PopPlanner (Ewing et al. 2015)] are shown in File S1, Table S4, while Figure A1 gives a schematic of the OOA demographic model, indicating the populations that may be under selection (depending on t1). ROC curves were calculated for each statistic by comparing 1000 neutral replicates with at least 300 replicates involving selection. As the statistics we used employed different signatures to detect selection, four ROC curves were calculated, based on top or bottom outliers of the maximum or minimum statistic value in the 200-kb region indicating selection. These were used to determine the pAUC between an FPR of 0 and 0.05. When assessing the performance of statistics, we did not want to make assumptions about the direction of deviation indicating selection or the statistic window size used in the selection scan. We therefore chose the maximum pAUC for each statistic at a sample size of n ¼ 40 (usually three window sizes and the 4 pAUC each, so the maximum of 12 pAUC values) as its measure of performance under a given selection, recombination rate, and demographic scenario. To summarize the performance of statistics under different selection models, we separated the selection scenarios into three groups, low frequency, high frequency, and selOOA as detailed in Methods. We averaged the maximum pAUC across scenarios in these groups to give an overall indication of average statistic performance. Note that we implicitly

LD as a Signature of Selective Sweeps 1823 give equal weight to each selection coefficient; often power was extremely high when s ¼ 0:08 and low when s ¼ 0:01: In the case of selOOA, we did not include those scenarios for which the average frequency of the selected allele at sampling was low, q0 , 0:05; corresponding to s ¼ 0:01 with t1 ¼½200; 400 and s ¼½0:02; 0:04 with t1 ¼ 200: When calculating pAUC given selection on standing variation, we excluded only the t1 ¼ 200; q200 ¼ 0:01; s ¼ 0:01 scenario. Defining Selection Candidates and Assessing Signal Replication To compile a list of previously suggested selection candidates, we searched for a range of studies performing selection scans based on the site frequency spectrum (SFS) or population differentiation. We identified eight appropriate selection scans, summarized in Table A1. We converted the suggested signal locations into single 200-kb regions by identifying which 200-kb region overlapped the central point of the signal or the SNP reported. Very few regions were identified in multiple studies (the 72 signals identified for European Chr2 corresponded to 66 unique regions, with corresponding numbers 17/21 and 18/21 in European Chr 15 and Asian Chr 2, respectively), although there were obvious clusters of signals. A complete table of signals is included in File S1. We classed 200-kb selection candidate regions that were in the top (bottom for the jLjjRj statistic) 5% of 200-kb regions for a given statistic to be successful in replications. To give an indication of how unexpected the statistic values of selection candidate regions were, we resampled the same number of randomly located 200-kb regions 10,000 times and compared the observed statistic value, rank of candidate windows, and number of outliers to this set. We noted that signals were often clustered and that peaks in several of the LD-based statistics often included several consecutive 200-kb regions. We therefore approximated this clustering in our resampling regime, copying the distribution of consecutive regions seen in the candidate signal data. For example, in the European chromosome 15 candidate data there were nine solitary regions, two runs of two consecutive regions, and a single run of four consecutive regions, and when resampling we followed this pattern. When assessing the ability of different statistics to replicate the 200-kb selection candidate regions, selection statistics were calculated for each SNP in the region, using an x ¼ 400 - kb statistic window for all selection scans apart from v, for which v50;400 max was used. These statistic window sizes were chosen for two reasons. First, large regions of the chromosomes in the HapMap data were SNP sparse and could not yield statistic scores for small windows. Second, a 400-kb window appeared to capture selection signals around LCT and SLC24A5 more effectively (results not shown). However, the x ¼ 400 - kb statistic window size differs from that used to calculate power results shown in Table 1, Table 2, and Figure 4. Simulations using an x ¼ 400 - kb window size were conducted and are incorporated into the pAUC calculations. The different window sizes for v were not observed to strongly affect replication results; see Table S12 in File S1. To assess whether 200-kb regions representing signals containing protein-coding genes were preferentially replicated, we focused on regions suggested based on European chromosome 2 data. These were split into two groups, those that contained protein-coding genes and those that did not, based on the hg18 RefSeq Genes track refGene table accessed through the UCSC table browser (genome.ucsc.edu/cgi-bin/hgTables), before being assessed for signal overlap with our LD-based methods as usual.

1824 G. S. Jacobs, T. J. Sluckin, and T. Kivisild Figure A1 Schematic diagram of the out-of-Africa model of Gravel et al. (2011), with a simulated frequency trace of a positively selected allele shown at the top. The horizontal axis represents time, running pastward from right (the present) to left. For population size and migration parameters, see table 2 in Gravel et al. (2011). Three demographic time events are indicated: the time of an ancient African bottleneck, TAF; the time of the out-of-Africa bottleneck, TB; involving Eurasian population splitting from the African population, with modern Yoruba (YRI) descended from the latter; and the time of the split of the Eurasian population into Europeans (CEU) and East Asians (CHB+JPT), TEuAs: Two selection time events are indicated: the time at which selection begins, pastward in time, t0; and the time at which it ends, t1: Depending on the values of t0 and t1; selection acts in different populations; the populations that would be subject to selection if t1 . 5920 (148,000 years) are colored orange. Note that the exact implementation of the model when t1 . 2040 (51,000 years) is modified slightly; see File S1, Table S4. In the frequency trace, s ¼ 0:005; t1 ¼ 3200; and t0 ¼ 300; in this simulation, the frequency of the allele at t1 is q3200 ¼ 0:0005; at t0 it is q300 0:75; and at present, after 300 generations of random drift, q0 0:73:

Table A1 Selection scan studies used to define the selection candidate signal set

Relevant European Asian European Reference Data population Method Class Outliers reported Chr 2 Chr 2 Chr 15 Nielsen et al. (2005) HapMap SNP data CEU CLR SFS 23 (Chr 2) 23 Carlson et al. (2005) Perlegen SNP data European and Tajima’s D SFS 23 (European), 25 Asian 29 (Asian) Oleksyk et al. (2008) Low coverage European S2Fst Differentiation 162 13 5 (184,000 SNPs) Pickrell et al. (2009) CEPH SNP data European and CLR SFS 10 (European), 10 111 East Asian (East Asian) Chen et al. (2010) HapMap phase European XP-CLR Differentiation, 40 (CEU vs. YRI), 40 334 II SNP data SFS (CHBJPT vs. YRI) Ronen et al. (2013) 1000 Genomes European XP-SFselect Differentiation, 40 (Europeans vs. 52 Project SFS Africans) Colonna et al. (2014) 1000 Genomes European and DDAF Differentiation 110 SNPs (Europe), 19 12 5 Project Asian 110 SNPs (Asian), 73 SNPs (CEU) Pagani et al. (2016) Full genome data European Tajima’s D SFS 65 (top 0.5%, SW 64 Europeans) Total 72 21 21

LD as a Signature of Selective Sweeps 1825 GENETICS

Supporting Information www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.185900/-/DC1

Refining the Use of Linkage Disequilibrium as a Robust Signature of Selective Sweeps

Guy S. Jacobs, Tim J. Sluckin, and Toomas Kivisild

Copyright © 2016 by the Genetics Society of America DOI: 10.1534/genetics.115.185900  

 !"#$$  % &'()0123456789@10)A'3B9C2CD844@26E9F969AD

GHIPQRSTUQVWTXYUSRT`abQcd`cXSaeQcQWYURWd`cRQWRQV WRSRTWRTeW 8fghgip4i@2C3g@gqi93D2q2'6gDq43ig2qgr9st96hig6gCqgD9C 825Ag)u08fg69@9A2ihgip4i@2C3g4p6q2q96q936qf2q34Cqi4Ap4i gvhg3qgDwx9Cqfgw4y€g62h2CD8i'g2hF2i925Agig34@59‚ C2q94Ci2qg63gC2i94696gF9DgCq08fg34C696qgC3(4p6q2q96q936qf2q D4C4q34Cqi4Ap4iig34@59C2q94Ci2qgƒg0s0„ 7„†7EgAA(‡6„ˆ‰7 @gs2‘A'6gq30’4Fgiqfgqfiggig34@59C2q94Ci2qg63gC2i946 s9Fg2C9CD932q94C4pqfg34C696qgC3(4pqfghgip4i@2C3g@gq‚ i93726qfg6gh4ygi266g66@gCq6ygigƒ'CCg3g662i9A(’ighg2qgD0 “f9Ag4CA(qy4Dg@4si2hf9363gC2i946ygig266g66gD7yg34C‚ 69DgigD@2C(F2i92q94C64C69@hAgwx‚526gD6q2q96q9360”66'3f7 p'AAh4ygiig6'Aq62igC4qhig6gCqgDfgig5'q2ig2F29A25Agpi4@ qfg2'qf4i64Cig•'g6q0 ‘gip4i@2C3g@gqi93632A3'A2qgDs9FgC6gAg3q94C4C6q2CD9Cs F2i92q94C2ig6f4yC9C825Ag)–08fg6g2igC4qD9ig3qA(34@h2‚ i25Agq4F2A'g6s9FgCp4i4'i69@'A2q94C64pƒhi9@2i9A(’f2iD 6yggh69C825Ag)u26qfgi2Csg4p6gAg3q94C63gC2i94669@'A2qgD y266@2AAgi0—4ygFgi7qfg(2ig'6gp'Ap4i@2B9CsigA2q9Fg6q2qg‚ @gCq6254'qqfghgip4i@2C3g4pwx‚526gD6q2q96q9360r9i6qA(7qfg C4FgA6q2q96q93qf2qyggvhA4ig@46qqf4i4'sfA(9Cqfg@29Cqgvq7 „ 7Dg@4C6qi2qg6igA2q9FgA(f9sfhgip4i@2C3gƒ2A646gg23q'2A h4ygi32A3'A2q94C69C825Ag6t2CD)t’0)g34CDA(72Aqf4'sfqfg6g ƒhi9@2i9A(’64pq6yggh6Ag2Dq4D9ppgigCq6gAg3qgD2AAgAgpig•'gC3( 2q62@hA9Csq9@gƒ6gg825Ag6t2CD)tp4ig6q9@2qg64p2Fgi2sg ˜™’9qD4g62hhg2iqf2qqfg6g6q2q96q9362ig34C69Dgi25A(Ag66h4y‚ gip'AyfgC6gAg3q94C23q64C6q2CD9CssgCgq93F2i92q94C2CDqfg 6gAg3qgD2AAgAg962qigA2q9FgA(f9sfpig•'gC3(0 ”69CD932qgD9CqfgdvqgCDgDgqf4D6ƒ@29Cqgvq”hhgCD9v’7 qfg6ghgip4i@2C3g@g26'ig6g•'2AA(yg9sfqqfgD9ppgigCq6gAg3‚ q94C63gC2i946ƒ9Cqfg326g4p825Ag)u72DD9q9Fgh469q9Fg6gAg3q94C 4Cqfgf4@4e(s4qgfg h™0™ui™0™–i™0™ji™0™klip4i825Ag)–7 fgh™0™ui™0™jl’0”6@9sfq5ggvhg3qgD7@46q6q2q96q936D96hA2(gD Fgi(f9sfh4ygip4i63gC2i9469CF4AF9CsFgi(6qi4Cs6gAg3q94C qf2q964pqgC'Cig2A96q9376'3fqf2qqfghgip4i@2C3gg6q9@2qg96 @46q'6gp'Ap4i34@h2i9Cs6q2q96q936i2qfgiqf2C2626qi93q9C‚ D932q94C4pqfg9ih4ygiyfgC2hhA9gDq4ig2AD2q20m4qg2A64 qf2q6gAg3q94C96632AgD5(h4h'A2q94C69eg9C342Ag63gCq69@'A2‚ q94C676'3fqf2qqfgh2i2@gqgi–nofDg63i95g66gAg3q94C6qigCsqf 2CDC4qf9q6gAp08f'679pno96A2isgiqf2Cqf2q4pf'@2C69C2 C2q'i2Ah4h'A2q94Cƒpqrfrstuvwxovwˆrywfzoqp4igv2@hAg’qfgC 6yggh632'6gD5(yg2Bgih469q9Fg6gAg3q94C2ig@4igA9BgA(q4 5g9DgCq9{gD0

–™ &'()0123456ozwv|   !" #$"%&'$"'(#$) !"((####$#$"%)(( 0'$"! " $!#$!#"!$12'#'"(3 !45'678%49&@78!42378$!$"# !"&'!)$( " $!#$!#&($!#'"$A0!#$" &3 #)$(("("3(#$!0##$#$"1B3 $ $(##$#$"CD (%#'#'# 3$)$#6#'#'# E$!### ##"!#(D"#9F$! GHI$!$"#P#'A#QR##$#$"#'#'#!#"!#()$(" $!#$!#1S #$(!##$#$" A!$#$!!$ 3(#$!"!$7#'!#'!$D1

9$!E0$T3$($$3 $0!#3("#$! UQ STEI(PUVW3A3CRI$#EX1 4357'## ##$#$"(T'2"#$!56$7!"!#!#%'#$!$& 4 ()01235 6E$" '!'"#$!(Y13A345"#$!6 #!$!66!#$"7$#$!AQ'"#$!$"!$#$!!#E $!$#$''''`%!"H8 !##'"#$!I$!#$ R   !" #$"%"!#!# 2  9RI$#E9 1 3!76'''`%!"H# '$!6R8R $&%'#$! '() 123456$7!'"#$!!#!$!6 2 3 3 0 #$ #!a4C33$ %'##b"#$$!IE$"E#E 6!#$"7$#$!(8 1 @3A32B3A3CB3A2D5B#E $!#F# 92 '"#'''$!#6F#$!"#A !$F#$'!"!#$%#$!6'"#$!"!$AG# #E##E7'%!#$"#'H" '##EEI! $!P$6%Q2R!I!6'"#$!"!$I F'$!#E$!'H$

cc d%HQAe"09fgh   !"#$ h12€76241Rfn oRw()0)015@9126)(265FF15(9356)(2(8‰ƒ "%& H„377)15F)(IR716F354)(1226)3CB)6)3B)3Cw106h3(@2ƒ r7)0(@40)01B)1xqC06541F1C6€35’“@673)6)3h17€F3Bq '()0)01(23435678(29@76)3(5(8)01AB)6)3B)3CDE3965FG317B15 C@BB1F6w(h13B5()6x62)3C@7627€4((F6xx2(y396)3(5(8)01 HIIPQR F1C6€(8’“„3)0F3B)65C1R)011BB15)367x(35)q)06)„35F(„B3 1 C(5B)2@C)3(53B73•17€)(B)2(547€39x6C))01h67@1(8‰ƒDHQq3B U XYU Y` H H TTHVWTH VV TabRcdefbRcWabRcdgfbRcV C71627€377@B)26)1Fƒ AS RDH21h3B3)1FQ Y` H TUTXYUVV abdeRcdgfbRc pq" rsrs!" 65F)01i9146p7@B415(91qBC65h12B3(5Dr76C03()3BstuUvHI`HwQ ”(63F21q39x71915)6)3(5(8(@291)0(FBR„1x2(h3F11y69x71 6))19x))(96y393B1)01h67@1(8Aw€h62€354)01415(93C21q t‚t‚C(9965F7351B35”6w71‚Pƒ 43(5@B1F)(F151)01B1)B(8‚GpBe65Fgƒi9146p7@B67B(67q 7(„B8(2h6236w71(h12677B)6)3B)3C„35F(„B3 1BR)01)()67415(93C 2143(5C(5B3F121Fw€)01B)6)3B)3C)06)C(5)635B)01X‚GpB)06) u!!"" 621x62)3)3(51F35)(e65FgR@B354)01C(9965F†64B‡q935„35ˆ 65F‡q96y„35ˆƒ‰6C06xx2(6C035h(7h1BC(9x62354B1h12671h67q v(9x71)173B)B(8)01B171C)3(5C65F3F6)1B3F15)31F82(9()012 @6)3(5B(8‰ƒH)(5FA96yƒ‘10121C(5B3F12)011yx1C)1F B)@F31BDG317B15stuUvHIIwgv627B(5stuUvHIIwgi71•B€•stuUv h67@1(8)01B11h67@6)3(5B@5F126B39x731F21x21B15)6)3(5(8 HIIxgp3C•2177stuUvHIIygv015stuUvHI`Igz(515stuUvHI`{g 51@)267’“x6))125BR65FB@441B))06))011yx1C)1Fh67@1(8A v(7(556stuUvHI`Pgp64653stuUvHI`|Q62173B)1F35”6w71B‚wR „377(8)15h62€F1x15F354(5„35F(„B3 1ƒ”03B96•1B)01h67@1 ‚|65F‚}ƒj(216C0C65F3F6)1R6C(21x(B3)3(5„6BF151F6B (8A96y9(21F38C@7))(35)12x21)65Fx21C7@F1BC(9x623B(5(8 13)012)0121x(2)1F‚Gpx(B3)3(5(2)01C15)21(8)0121x(2)1F A96yh67@1B(w)6351F@B354F3881215)„35F(„B3 1x62691)12Bƒ 2143(5ƒ~851C1BB62€R)03BC(21x(B3)3(5„6BC(5h12)1F)(Gv'~ w{|C((2F356)1B@B354)01v‚v’38)(h12)((7ƒ”01B171C)3(5 C65F3F6)12143(5„6BF151F6B)01HII•w2143(535C7@F354 )03B‚Gpƒ€151B„0(B1)265BC23x)B(h1276x16C02143(5„121 3F15)31F@B354)01v‚v”6w71'2(„B12D)26C•z18‚1€151BQƒ

‚uƒUs„X †X‡uˆ‰XŠufsbˆeU‹‰s‰bˆX‹ŒŒUssˆtuUŽutsfbuUbUsX

–—˜‚C0196)3C(8)01B)1xqF1C6€35’“F1BC23w1F35)01 )1y)R65F)„(1y69x71B)6)3B)3C„35F(„B3 1C(5B)2@C)3(5B 716F354)(F3881215)1yx1C)1Fh67@1B(8Aƒ”010(23 (5)67w62 D012142115Q35F3C6)1B)01C02(9(B(91R„3)0)01€177(„65F x@2x71)236547121x21B15)354)0196)23y(81yx1C)1Fx632„3B1 ’“h67@1Bw1)„1157(C367(54)01C02(9(B(91ƒ™177(„B0(„B x(B3)3h1’“w1)„115‚GpBB1x626)1Fw€6F3B)65C1de(271BBR „0371x@2x7135F3C6)1B5(’“ƒ‘015)01„35F(„B3 13BfIR )01h67@1(8A3B1yx1C)1F)(w1`35)03BB39x71377@B)26)3h1 BC15623(g„015)01„35F(„B3 13Bf`S deRA3B1yx1C)1F)( w1Hƒ”01‡r7(54ˆ65F‡ih12ˆ2143(5B62135F3C6)1F6Bh65Fi 21Bx1C)3h17€ƒ

r535)@3)3h1@5F12B)65F354C65w1w@37)w€C(5B3F12354B39q x7121x21B15)6)3(5B(8)01F1C6€(8’“„3)0F3B)65C1w1)„115 7(C3ƒ”019(B)w6B3CC6B13BF36426996)3C677€377@B)26)1F35j34q @21‚`ƒ‘1C(5B3F1269(F17w€„03C0fH3B6C(5B)65)h67@1 H Ik fel`w1)„1156777(C3B1x626)1Fw€6F3B)65C1de(271BBR 65FI8(26777(C3B1x626)1Fw€6F3B)65C14216)12)065deƒ‘015 )01B)6)3B)3C„35F(„B3 1f3BB9677RfmdeR677’“916B@21915)B „3771@67fe65F‰ƒH„3771h67@6)1)(`ƒ‘015)01„35F(„ B3 13B9(F126)17€76241Rdel fm HdeR)015@9126)(2(8‰ƒ H„3771h67@6)1)(`„0371)01F15(9356)(23B71BB)065`gA3B C(221Bx(5F3547€4216)12)065`ƒ‘015fS HdeRB0(„535j34ƒ ‚`RA„377w1Hƒj35677€R6B)01„35F(„B3 13B677(„1F)(w1C(91

’35•641F3B1@373w23@96B6B3456)@21(8B171C)3(5 H{     !  ! ##$  " "

%&' ( )0 )1 )1 )1 )1 &( "2" ( 233 "

%&' 4 5 " 6 577 )1 )1 &4 "2" ( 233 " 2)" 211( 21" 28 5772 59

@@1 6 )0 )1 )1 )1 )1 &6 "2AB 6 2( 4759"6427995(64" 2)"44C42""5 2((5BB3"6 2BB5"B49C2"("544C42(""544C42(B"56 32B("56 32 B" 549"32"B 549"32(("75972B(CC5(42 5 "967 ""(6536442 5 "967 "("6536442 5 "967 "( 5"(72 5 "9667( 2 5 "9667B 2D 5 "967B(2D 5 B9(4("2 5" ((9" 59 9 4 @@1 ( 5 ( )1 4 5 9 &( "2AB ( 2( 4759"6427995(64" 2)"44C42""5 2((5BB3"6 2BB5"B49C2"("544C42(""544C42(B"56 32B("56 32 B" 549"32"B 549"32(("75972B(CC5(42 5 "967 ""(6536442 5 "967 "("6536442 5 "967 "( 5"(632 5 "9667( 2 5 "9667B 2D 5 "967B(2D 5 B9(4("2 5" ((9" 59 9 42A 5 37 7B 5 9 2 ( 9C6573(67546 2 592E2@%2F @@1 4 5 4 )1 "3 5 9 &4 "2AB 4 2( 4759"6427995(64" 2)"44C42""5 2((5BB3"6 2BB5"B49C2"("544C42(""544C42(B"56 32B("56 32 B" 549"32"B 549"32(("75972B(CC5(42 5 "967 ""(6536442 5 "967 "("6536442 5 "967 "( 5"(632 5 "9667( 2 5 "9667B 2D 5 "967B(2D 5 B9(4("2 5" ((9" 59 9 42A 5 (C3B3B 5 9 2 5 "967"(""9C57(9C6573 2 (""9C57(9C6573 2 592E2@%2F @@1 6 5 6 )1 B( 5 9 &6 "2AB 6 2( 4759"6427995(64" 2)"44C42""5 2((5BB3"6 2BB5"B49C2"("544C42(""544C42(B"56 32B("56 32 B" 549"32"B 549"32(("75972B(CC5(42 5 "967 ""(6536442 5 "967 "("6536442 5 "967 "( 5"(632 5 "9667( 2 5 "9667B 2D 5 "967B(2 5 B9(B77("5 2D 5 B9(4"(2 5" ((9( 59 9 42A 5 99(C( B 5 9 2 5 B9(4"((B"9564""9C57( 2 5 "967"((B"9564""9C57( 2 ((B"9564""9C57( 2 592E2@%2F

GHIPQRSTU##$&0&$&0$VV$$&&&W&X2 b6 VY`"5"a" 5A&&'$$VVc0&'de`" W@0V1V b6 b6 $de`"44C4f0&"5Ca" $(5B3a" &$&&c 5)&VV&& $$&&0&&W&'XV0& X $0$X0W0$$ &0&&0X0&Xc058@@1&&0$&"($B0&0 &$1V T0&gT0$T&1&5h&XXV@@1XE9" &WcW2$ &0&00&"($B&$&2@@11V2@@11VgT0&gT0$T&1& &c W@@1XE$X &V&V&2@@10i"p2@@1 0i(p$cT0&0i(p21V0i(pX &&VVc0 &'5F&&&V@@1$c$VVc0&V"&W85C

(4 q0 5rX&estu    

 !"#$% "& '(")0123"4" "53"20%6$ %)3"470 48)3$ 4794" )8 !"4"0"6 %)4 $ %4 %64 "4 "5@A73 !"35$ $) !" 4"0"6 %)6$5%5$ "474"5$5 !"%33"4$#20"5BCD$07"48)3 4"0"6 %)4 $ %4 %64%423)D%5"5%E7220"#" $0F$ "3%$0A%0" EG@H"!"3"6)#2$3"4"D"3$0#)3"46$4$523"4" 873 !"3 3"470 4) !"$9%0% 1)85%88"3" 4 $ %4 %64 )3"20%6$ "4"0"6 %) 6$5%5$ "4%I$04@

PQRQSTUVWXSYWX`VabYBcYBdeYXQffgeaVhVXVhQi A%I73"4EG$5Ep47220"#" !"#$% "& A%I73"q% 4!)(%I4"0"6 %)46$4)8r$2F$2s!$4"t!3)#)4)#" Gutvw$5trxy€s3"42"6 %D"01‚@ƒ) " !"3"576 %)% !"4%I$0$3)75cgc„ †g‡(!"3"6)#9%$ %)3$ "%4 G 6) 3)00"58)3%t!3Gutvw‚5$ $ˆ !"4%I$0$ ‰‘’%t!3G €qr23)“0"44!)(%I$D"3$I"aI%D"I"" %65%4C utrxy€s‚5$ $5)"4) $22"$3 )6!$I"3$5%6$001'97 %4 $6"'I""3$ "58)3t!3)#)4)#"Gutvw2)270$ %)‚74%I "D"3%5" %“"5$4$"& 3"#")7 0%"3@ !"r$2F$2$55"t™rvI"" %6#$24@ gVhBYaUW”XU”WYRaQBRUSYTUVW•QT–QQWSYW—U—YTQaQ”UVWXSVWTYUWUW” qr4""$90"Evp@ YW—WVTSVWTYUWUW”BaVTQUW˜SV—UW””QWQX ™734"0"6 %)6$5%5$ "3"I%)0%4 ($40)I"4 8)3v73)2"$ t!3)#)4)#"G'6)4%4 %I)8dq3"I%)4)8(!%6!pe6) $%"5 23) "%C6)5%II""4$5Gd5%5) @H"6)4%5"3"5% 20$74%90" !$ #)3"4"0"6 %)6$5%5$ "4()7059"3"20%6$ "58)3 !"4" )8 3"I%)4 !$ 6) $%"523) "%C6)5%II""4'$4 !"#"6!$%4#4 !3)7I!(!%6!D$3%$ %)% !"4"3"I%)46$!$D"$2!") 12%6 "88"6 $3"9" "3f)(@H"5%5) “54722)3 8)3 !%42$ "3' $44!)(%$90"Eg@h"&6"2 %)($4 !"5%D"34% 1#"$473" i†ii’i'(% !6$5%5$ "3"I%)46) $%%I23) "%6)5%II""4 6) $%%I8"("3Eƒs4'2) " %$0013"j"6 %I !"0)I"3C "3# 4%I$0)83"2"$ "54"0"6 %D"4(""24)3)8273%81%I4"0"6 %)@ gVhBYaUW”TeQ—Qgk‰YW—bYBcYB”QWQTUShYBX E)#"4 75%"4!$D""#2!$4%4"5 !"2) " %$08)32)4% %D"4"0"6C %) )5%4 )3 "4 %#$ "4)8 !"I"" %6#$29$4"5)0%f$I" 5%4"l7%0%93%7#u™mn"%001QTYRoGppg‚@H" !"3"8)3""&2"6 "54"C 0"6 %)4 $ %4 %64 !$ 6) 3)00"58)3"&2"6 "5qr9$4"5) !" r$2F$2I"" %6#$2uA3$s"3QTYRoGppt‚ )0)4"4%I$04 !$ #%I! 9"3"6)D"3"5(!"74%I !"5"t™rv#$2uu)IQTYRo Gpvp‚@h4%0074 3$ "5%$90"4Ee$5Evp' !%4($4) )9D%)7401 !"6$4"@H!%0" !"D$07")8"$6!4 $ %4 %6($40$3I"3(!"74%I !"5"t™rv#$2' !""&2"6 "5#$&%#7#Gppf93"I%)D$07" $04)%63"$4"5$5 !"7#9"3)83"20%6$ "54%I$04uqw)7 0%"34‚ 747$0013"#$%"54 $90")35"63"$4"5@H")8 ")94"3D"5$40%I! %63"$4"%4 $ %4 %68)6744%I) !"h0)I3"I%)$56) 3)0C 0%I8)3qr(!"74%I !"5"t™rv#$2@H"$ 3%97 " !%4 ) !"0)("33"4)07 %))8 !"5"t™rv#$2@H!" !"I"" %6 23)“0"($49"%I"4 %#$ "5'#$12$%34)8Eƒs44"2$3$ "591$ 3"0$ %D"010$3I"2!14%6$05%4 $6"'$5(% !3"0$ %D"010)(2$%3C (%4"qr'("3"$44%I"5$D"314#$00I"" %6#$25%4 $6"@!%4 0"$5 )4794 $ %$0013"576"5"4 %#$ "4)8$D"3$I"qr%4!)3 C 3$I"9%4'4""A%I@Ex'(!%6! "5"5 )63"$ " !"%#23"44%))8 %j$ "55"D%$ %)483)#"&2"6 "5qr8)34)#"4 $ %4 %64@ y v&$#20"4"0"6 %)46$474%I !" z 4 $ %4 %6$59) ! ‰{aG| yz !"5"t™rv$5r$2F$2I"" %6#$24$3"4!)(%A%I@Eq@ h4"&2"6 "5' !" ()46$4$3"!%I!016)33"0$ "5@!"3"%4) 60"$3"88"6 ) !"4 3"I !)8 !"!%I!C6)“5"6"4"0"6 %) 4%I$04ucgc„'b‰’gi'P†gi}‘~‚ˆ8)3"#2%3%6$0BCD$07"4)8 4 $ %4 %6474%I !"5%88"3" I"" %6#$24 )6) 3)08)3"&2"6 "5

q%f$I"5%4"l7%0%93%7#$4$4%I$ 73")84"0"6 %) Gq     !" #$"%"&' (()' !! 01)23""4#145 5 6789@ A B! 4 C "6D54 E6 64 '  C

FGHIPQA 5 A4  R6    5)' !! 0)23#' '   "  955  CQ' S 66 4!@ 6 5 ' 1' " @@B6   @ 5ET 4 ' '' 4!@ 6ET4 'SC

0U V4BCW@XY`ab     !" #$"%"&' (()' !! 01)$234&5""6#167  7 89@AB C D! 6 E "8F76 G8 86 '  E

9H7  I6B 6!76 8   0P !    " ()012234(565 78  9@A4 94  9 BCAD9E8F8 #$%&' !"   G84  9H

IPQRST 8685 UV9 W 9  99 X 565 78  9@A4 9 E8F89 9 BCAD  G8HY 88 54  X5 65B 5GG &`BDab9 X 54 B 5GG cd`BDabH

&e fVHg4hipqr      !"#$%&'#"&($)#0%&($#1$ 22#("&134#&(5#!$&6%$#"%7%(6# 89"&99#7#($%2278%@)#!$808"&9A&(6!#'#@$&8(!$%$&!$&@!!B@) $)%$#12#@$#"CD4%!@8($78''#"987EF#27#!#($7#!B'$!8($)# %G&'&$A89"&99#7#($%2278%@)#!$87#2'&@%$#!&6(%'!%$!#'#@$&8( @%("&"%$#!&(H%G'#!IPQRG%!#"8(CD&($)# '8(67#6&8(S%(" IPPRG%!#"8(CD&($)#T5#77#6&8(SEH)#08!$&($#7#!$&(6 7#!B'$&!$)##1$#($$84)&@)"&99#7#($%2278%@)#!"&99#7&($)# (B0G#789!&6(%'!$)#A7#2'&@%$#3"#!2&$#G#&(6G%!#"8(!&0&'%7 "%$%EH)#!#'#@$&8(!@%(!$)#0!#'5#!%7#!)84(&(U&6B7#!IV %("IW7#!2#@$&5#'A34)&@)6&5#%(&("&@%$&8(89$)##1$#($89 @877#'%$&8(G#$4##($)#"&99#7#($!$%$&!$&@!E

     XXY ` ab  H)#%G&'&$A89T0#6%c'B!$87#2'&@%$#27#5&8B!'A&"#($&d#" !#'#@$&8(@%("&"%$#!B!&(6"&99#7#($4&("84!&e#!&!!)84( f g gqrs &(H%G'#IPf3%!&!$)#2#79870%(@#89$)# hp h !$%$&!$&@ qrfs gi gi 4)&@)B!#!%d1#"tQQuG4&("84!&e#$8"#$#@$!%!&0&'%7!&6(%' GB$%$$#02$!$8@8($78'987#12#@$#"CDEH)#"&99#7#($0%1&0B0 %("0&(&0B04&("84!&e#!)%5#%!$78(6&02%@$8($)#7%4 0%1&0B05%'B#89v0%1&(%fQQuG7#6&8(EC%76#75%'B#!8@@B7 4)#($)#0&(&0B04&("84!&e#&!!0%''#734)&@)'&u#'A7#w#@$! G8$)8G!#75%$&8(!0%"#&($)&!IB22'#0#($%'x%$#7&%'3!#@$&8( If3%("$)#28$#($&%'98767#%$#75%7&%(@#&($)#"#(80&(%$8789 v3!##y€ERfS3$86#(#7%$#!2&u#!&(&$!5%'B#E#5#7$)#'#!!3$)# %G&'&$A89$)#T0#6%c'B!%'687&$)0$87#2'&@%$#!&6(%'!4%!(8$ 67#%$'A%99#@$#"E

C&(u%6#"&!#€B&'&G7&B0%!%!&6(%$B7#89!#'#@$&8( f‚    !"#!$!%&%!&'!%!()$0' "#!& && 1 !%2&3$ ! 45 6778 9@&3' 3($ &&'! ABCDEFGHA$&IP(AQRHA$&IP! AQR HA$&STP9

U7 VWX9E Y`abc       !"# $%&&'( $)0 "#  12 (32& $4050& 60

"7&82!26& 2   9@      !" #$%&&'  &(    ' )0 #1(!" &&&2 '%& 3# 45 67789@&3(3)#&&( ABCDEFGHA#&IP)AQRHA#&IP AQR HA#&STP9

UI VWX9EY`abc       !"# $%&'() $01 "#  &2 A )32' $4151' 6178$961@'  B FGHF$IP2!30 61Q  CDABE 1 IRSTTUVU7

"Q'W2!26' 2   UU   !"#$%& &!&%'!$'($ (!$#$( )&(0()0$1$2!()&$!$'! $!&'$&!$!&$ (! #'(&$&3%')45600()0$173)!899@ A)$BC&% 2! $ '&&$%()DEFGH"I6D&PA2DQR6D&PA!()DQR6D&STAB

UVWXY`a!()!&)$)$'($'!($#$( b!3!()!c!&c!&! !()73)0()0$1)B

d8 e#aBH!' $fghip #$%&$$    !"'( )(0012   (3  3  4'!51 0 4(3 67 58

69!@ A '! 5  B#   fghipqrstuvwxyw€p‚w€ƒ€„v€wvyvuwvu„w  !" " #$" %&!' #"0'0!!"!$#!  09 $$#G†&‡! ()%!!!"!!!01  2%34567$$#89)@A! ‰ †&!ˆW ˆ“ •CDRCC ˆWD E Dˆ“D E DHIHHPH ”98 QD$$#89) 1#% %&#!9$ B4CDCCC# 9ECCF%#B !'# ‰ ‘’ ‘’ †&!ˆW ˆ“ 0#@ "0#9!9 $$A D %  B0 @A!1##9 '#"0ECCF%#B '#90–$—$ 989A9G#9 9A9'#HIHHPHQ!!"@A"096 %&#!9$ B0989A9G#9 9A9'#HIHHPHQ! 20!)@A!1"##!$ "!&0!A!10 !!" 4CDCCCECCF%#B !'#90#@ "0#9!9 " !#A" B2%!R 3SD3TD34CD344 34E%A1   $$A 6˜B !10$$#89)@A!™C6C41# %8"&0!96 " !#!" " !G 0!B0''# "%) 1 0!$$#"0 d  BA#!!0$6B6•e UVWXY` aWabcdWe A#!9 BECCF%1 1!D!A! '&#$" !B !Q6 #!!"!0" #'#8$"(D0#0 –$—$B "9$G #f#gECChQ#ijkB " 9$G† BgEC4CQ1#A!6l !'#10"00# ) !"#$@#$$#B 1###@'#90mi3i2% n#1!#G#"Fo˜'3pQ6 A#!A!#!01  2%!34RD 34• 34q6 20$$#899$#")@A!#$#9#&   "9$#! %1 0''# ()%!!!"!  0B0B0#B !!01 B$#!! (!# ! 9A$ !!"!D#0#0  ##1& "018"$  !" " !6

5q lA&36r"%!g   !"#   !"# )01200    $ %&%%'% ( 567869@ A@BCD986@796@EF    $ $ 34           $ $ )000000 0GH2 0GHIP 0GQ)0 0GH0R 0GR2R 0GQ)P 0GS0R 0G00P

P)Q00000 0GPI 0G00) 0G00Q 0G0SH 0G00S 0G00S 0GHH0 0G0P) TUVW1""XP P)H00000 0GIP 0G002 0G00R 0GQ) 0G0PI 0G0PH 0GR0R 0GPQ Y T PQ000000 0GSRS 0GIRS 0GHI 0GSQ2 0G2H 0GIRR 0GRI0 0G00Q Y T

QQ00000 0GI0 0G0I 0G0I 0GP0H 0G00S 0G0P2 0GPR 0G0)H !`!a1 9DbR01 cU21 TI QH00000 0GI 0G0PR 0G0PH 0G0S2 0G00R 0G0PI 0GPQH 0G)I TI1Wd I)aQ1 #TCV1"CYWd)

RI00000 0G0)S 0G0P0 0G0) 0G02P 0G0I0 0G0)R 0GPS 0G0RP T!`!1WY")1C5V" P1 `R1aUX!2P1#A52 RI200000 0G0Q 0G00P 0G02S 0G0IS 0G0H 0G0)R 0GPR 0GPQR VdWP RIQ00000 0G0)0 0G00I 0G02I 0G0)S 0G0PR 0G0QP 0GI0) 0GSI VdWP1TV`H1TV`HU RIH00000 0G0)R 0G00) 0G022 0G0Q) 0G0PR 0G0Q 0GS2 0GPS2 `C5U1"eWCPP1 9DbRH1W`VUC1V `A R200000 0GP) 0G0PP 0G0I0 0G0)R 0G00Q 0G0PH 0G0SQ 0GP)2 U!dVI1!UPV1` a"1Wd 2V) Wd 2V)1" `TP1 9DbHP1r"5)215`T1 R2200000 0GPI2 0G0P0 0G0I 0G0Q0 0G00Q 0G0P 0GP0I 0GH22 f@ghipqB0P0F cT!H0U1rUCP1!AW15Cd)I1 " P21 `` IP1dUX1C AaP1`dX1"sXP "sXP1VeCP1 `5V1d!XdI1"!P1PVC1 R2Q00000 0GR2 0G0S 0G0)P 0G0)H 0G002 0G0P) 0GPQH 0GQQ T6@ghipqB00)Ft!uughipqB00HF W#V2a R2H00000 0GIQ2 0G0QR 0G0)R 0G0Q) 0G00Q 0G0PH 0G)S 0GP)2 

H200000 0GPIS 0G02Q 0G0HQ 0G0HS 0G0IS 0GP 0GQPP 0G00I H2200000 0G0H) 0G00S 0G0R2 0G0R2 0G00 0GP00 0G)22 0G))) CE@6ghipqB0PQF We dAP1"TV Q H2Q00000 0G0R 0G00R 0G0Q2 0G0R) 0G0II 0G0Q) 0GIRQ 0GSPQ T6@ghipqB00)F "TV Q H2H00000 0G0HP 0G00H 0G0SS 0GP0S 0G0PH 0GPPH 0GIQ) 0G022 !uughipqB00HF "TV Q1`5VU"V1`WUP0

HR00000 0GPS 0G0)P 0G00P 0GIP 0GPR 0G0P0 0G00H 0GSP HR200000 0G2PR 0GPS 0G02 0GIS) 0G2)S 0G02 0G00R 0GQ0P HRQ00000 0GIS 0GPS 0G0P2 0GHQ 0G2S 0G0IR 0G002 0G0I 99@@ghipqB0P2F HS00000 0G0I 0GPR2 0GQ0 0G02 0G0RP 0G20 0G00S 0GPQP HSH00000 0G0IQ 0GRPH 0G)SP 0G00P 0G0Q2 0G020 0G00 0GHHI

S0H00000 0G0I2 0GP0 0GH00 0G00P 0G0P2 0G020 0G00P 0GSSH SP000000 0G0IH 0G0HQ 0GHI 0G0S 0G0HP 0G20Q 0G00I 0GH)

S)000000 0G002 0G0) 0G0 0G0P0 0GPI) 0G022 0G00Q 0GSHR Vd15CW)1vTa)P21vTa S)00000 0G00P 0G022 0G0II 0G0P2 0GPQ0 0G0)0 0G02 0GHSI vTa1C5!1 TcCI S)Q00000 0G00H 0GIS 0G0PI 0G0I 0G2) 0G02S 0G0PR 0GS22 `5c2I S)H00000 0G00H 0GIS 0G0PI 0G02 0G2IQ 0G02H 0G0PH 0GSHS VT5"IQ

P0I00000 0G0) 0G0I 0GPSP 0G0Q 0G0)Q 0GPIS 0GP)0 0G00S PP000000 0G00Q 0GPP 0G2PQ 0G00H 0GPS 0GPH 0G00I 0GR)R Vdd1TC CP

PPP000000 0G00Q 0GPP 0G0)) 0G000 0G000 0G00P 0G00I 0G0PS 5AC"Q15AC")1UeUP PPP00000 0GP2R 0G2RS 0G0SP 0G0P 0G00 0G00P 0G0IP 0G2Q V !Xd

PP00000 0GS0) 0G0SI 0GPI 0GSI0 0G0SI 0GP2R 0G2R0 0G00R f@ghipqB0P0F Adc

PI0Q00000 0G0P0 0G00) 0G00P 0G0) 0G0P 0G00Q 0G0PP 0GH)P C!`#a1 " R2U1WC"21v`U1`eUVI# PI0H00000 0G0P0 0G00R 0G000 0G0PR 0G00I 0G00 0G0P0 0GP) " PP)1cC21C`CTPH1C!`#c1 a PU PIP000000 0G00H a PU1 a P1C!`#w PIP00000 0GIR 0GP 0G0SR 0G0QH 0G0P0 0G002 0G0P) 0G00H AC5P2H1V#5I1V5 A#a2 PIP200000 0GH) 0G0QQ 0G0RS 0GPPH 0G00H 0G0)I 0G02 0G0IR V5 A#a21aVPQHU1Cd# U

PI)H00000 0G00R 0GPIH 0G0IP 0G0P) 0GP)Q 0GPH 0G0QH 0GSSQ v5VTUI PIQ000000 0G0P) 0G0HS 0G0SS 0G0P0 0GPI2 0G0HI 0G0R0 0GSQP T6@ghipqB00)F v5VTUI15I "P

PH00000 0G2SQ 0G2RP 0GISI 0G2HQ 0GI2Q 0GQH 0GIIQ 0G00 CE@6ghipqB0PQF #5d1T#e5!"P

PSI200000 0G0P 0G0H) 0G00 0G00 0G0I 0G00) 0G0Q2 0GRII PSIH00000 0G0PS 0GP2 0G0P 0G00R 0G02S 0G00Q 0GP0 0G)2) PS200000 0G00) 0G0I2 0G0P) 0G002 0G00) 0G0PQ 0G0) 0GHSP PS2200000 0G00 0G0P 0G0)I 0G00) 0G00Q 0G0PS 0G0IS 0GRH CE@6ghipqB0PQF PS2Q00000 0G0IR 0G000 0G0RR 0G02 0G00P 0G0) 0G02H 0G)0 PS2H00000 0GPIH 0G00 0G0SI 0G0S 0G00 0G0H 0G020 0GQ2R

PSH000000 0G00S 0G0IQ 0G02 0G0PH 0GP00 0G0)H 0G0H0 0GSS2 WaIUP1 !sP0U1 WC"P1 WC#Px!U21 WC#P1!U215a`T PSH00000 0G000 0G0PP 0G0IH 0G00Q 0G0R0 0G0R) 0G0QQ 0GSSR 5a`T1V5W1U!dd1Cd dP

0H00000 0GQI 0GQRQ 0GIIQ 0G)S) 0GRPP 0G2I) 0GHH 0G00) TYdP1av")1Cd# I IS200000 0GSPR 0G)SH 0GS 0GQ2 0G)R0 0G00I `rcW`

y€‚ƒ„ †W7869@7@686@86‡6@8f68ˆ1 #e fD93993

d6@uE6‰ˆ66D6ˆ36E@8ˆD9b7869@ IR  $%&'(  $%&'( 456755        ) 010020 3 @ABC(D$E FGBE$CH    ) ) 89 " "   !#   !#  !"#  !"#     ) ) IP"55555 5QR"S 5Q55P 5QR"T 5QI77 5Q57T 5Q47" &U(VT6&U(V"6&U(V IP755555 5QRTT 5Q55W 5Q4T4 5Q"PS 5Q574 5QWR5

"I"55555 5QST5 5Q"7S 5Q5P" 5Q55S 5Q5"7 5QI4T F&XFYSX"6`aT "S"55555 5Q"5S 5QIT" 5Q5"W 5Q55I 5Q55I 5Q55T 5Q557 5Q545 (`%"

"SS55555 5Q5T4 5Q5T4 5Q55W 5Q55I 5Q55I 5Q55T 5Q55I 5Q54S F&XFYR "SR55555 5Q5TW 5Q5T7 5Q55R 5QIRS 5QISI 5Q4"T 5Q5"7 5Q55R

"RS55555 5Q5WS 5QRTP 5Q54S 5Q5I5 5Q5"I 5Q554 5Q55S 5Q5I4 F&XFYR6Y`FYIIV

X%U"6Y'YX6b@%Ya"P6UcVF%76 7I755555 5Q554 5Q"TT 5Q5II 5Q5"W 5QTS" 5Q5"" 5QI55 5QPWP U4TVI6YIY YIY6d4I6%UIV6@U`%6 7IS55555 5Q5I5 5QII" 5Q5IP 5Q5TT 5Q""W 5Q577 5QI"S 5QPPI %YU@(`"6%UIY 'dYT6(XXT6@(`e"6@(`da%76f 7IR55555 5Q5"5 5Q575 5Q"R4 5Q55R 5Q5S7 5Q57P 5QITR 5QR75 eYI6g'`WS6e`'4 7"555555 5Q5"P 5Q57W 5Q75" 5Q55S 5Q574 5Q5TW 5QII4 5QR"T `hipqrG"5ITHsChipqrG"5ISH e`'4

7S555555 5Q7IS 5Q5S4 5QT"R 5Q"SS 5Q5II 5Q5TT 5QTWW 5Q55" %hipqrG"5I7H 7S"55555 5Q5RW 5Q557 5Q557 5QI7S 5Q55P 5Q5T5 5Q"45 5Q55I %hipqrG"5I7Hs%thipqrG"5I5H @X%"7Y46f(e"6%Uua"6@X%I"YI 7S755555 5Q5R7 5Q55" 5Q55I 5QIIT 5Q55T 5Q5II 5Q"TR 5QSST 'cU6eVaI

S"755555 5Q55R 5Q5"S 5Q5SS 5Q5TI 5QI"T 5Q5PI 5Q5RW 5QPS7 %@aIFI6dYY5I5I6U`d76baeS5P W7S55555 5Q55" 5QI"" 5QIII 5Q5T4 5Q7RP 5QIPW 5QISR 5QPP4 @%Y(` RI555555 5QI4" 5Q4II 5QW44 5Q557 5Q7IR 5QI"4 5Q55R 5QP4W `@IW6%(VI6YTV" RR"55555 5Q4PP 5Q7"7 5QIPR 5QS5R 5Q75P 5Q""W 5Q77W 5Q557 %hipqrG"5I7H YT@"6%I4BvTRwYT@"6Y`da6baeWI5

xy€‚ƒ„ @EE$$$†$tA$ 6%(c%tB88I4

TR FA @Q‡Eˆhipqr   $%&'%%    ! " # 01234567895@6A B6CD32456A2@567E    ()       %%%%% %FG%H %F''G %FI%P %FIIH %F%QG %F%%H H%%%%%% %F%Q %F%'% %F%'$ %F%HH %F%RG %F%%% STGU&VWXI%

BUV&U23YIH&`aT$I&UUbUII& PH%%%%% %F%%R %F%RR %F%%' %F%G' %F%$P %FH% BDaI&0cDdPe&0eU'WIWD PQ%%%%% %F%I %F%$R %F%%Q %F%HG %F%HQ %FQIP VDeGG&VX0&XV8

GIQ%%%%% %F%%H %F%P$ %F%I %FIHG %F%$Q %FRQR 8fI G%%%%%% %F%% %F%P' %F%I% %FI% %F%'P %FR'% 8fI&bDgG%&0DW0d

P%%%%%% %F'RR %F%HG %FI$ %F%% %F%% %F$H P%%%%% %FI %F%IG %F%G' %F%%I %F%IG %FI$G UgDHXI&8hfUHX P'%%%%% %F%$$ %F%%$ %F%%H %F%G %F%I %F%%I 8hfUHX PH%%%%% %F%% U2266ipqrsC%I'EtUu6ipqrsC%I%E 8hfUHX PQ%%%%% %F%R %F% %F%%P %F%'% %F%I% %F%%' 8hfUHX&0DV&8hI

P'%%%%%% %FGG %F%GQ %FGGH %F%$G %FIHQ %F%%G WUdB&bBcf&d8dG

QP'%%%%% %FG'' %FGP %F$%P %FGP' %F%%Q %FPG QPH%%%%% %FQH %FI$ %FRG %FIQG %F%%H %F%PQ

QRQ%%%%% %F%IQ %F$'G %F%IQ %F$P' %F%%I %FRRG

R%Q%%%%% %F%%P %F%$ %F%%I %F%%P %F%%% %FRRG RI%%%%%% %F%I %F%$P %F%G$ %FI' %F%%$ %FI%$

R$%%%%%% %F%'% %F%QH %F%' %FGH %F%%' %FQQH We&VD0$&`aT$I'&`aT

RP%%%%% %F%%Q %F%H %F%GG %FI'$ %F%% %FRQI WaVbGH RP'%%%%% %F%I% %F%GR %F%GR %FG' %F%G %FRRH WaVbGHX

I%G%%%%% %F%R %F%%R %FI% %F%$$ %F%H% %FIG'

I%Q'%%%%% %F%Q% %F%%R %FI%% %F%$P %FGG %FRQI BUU&eS0I I%QH%%%%% %F%Q$ %F%%H %FI%Q %F%HH %F'' %FRQP U326ipqrsC%%$E eS0I&VWaXD&UUbUIGQ

II%%%%%% %F%IH %FPI %F%IR %F''G %F%% %FPH' Wee&aD DI III%%%%%% %F%IH %FPI %F%%I %F%I$ %F%% %F%QH VBDbH&VBDb$&XcXI II'%%%%% %F$P$ %FI%I %F$RH %F%R %FGQ %F%%Q Uu6ipqrsC%I%E BeS IHQ%%%%% %FGP %F%'I %F%QH %F%%' %FPH %F'H'

UUbUII$&SD'&DdDaIQ&Dfd8S& IG%Q%%%%% %F%%$ %F%G$ %F%'I %F%I %F%% %F%IP UTUIX IGI%%%%% %FGG$ %FGI% %F%QG %F%%% %F%%G %F%$% BDVI'Q&W8VG&WV B8T' IGI'%%%%% %F% %F%IG %FI'H %F%%$ %F%G$ %FI'R WV B8T'&TWIHQX&De8 X IGIH%%%%% %FIH %F%%$ %FI$H %F%%G %F%$I %FR'P De8 X&Dfd88 IGIQ%%%%% %F%RG %F%%H %F%PG %F%%R %F%' %FHI% vd GbS&dcXWGb&`dW

IPP%%%%%% %F'IQ %F%%P %FGHH %F%I% %F''R %FQ IPP%%%%% %F'%' %F%%G %FGQI %F%%$ %F'Q %F%I% IPP'%%%%% %F'P %F%%I %F'$ %F%I %FG$$ %F%%R Uu6ipqrsC%I%EtD5Aw3ipqrsC%%REtU326ipqrsC%%$E IPPH%%%%% %FI'Q %F%%G %FI'$ %F%I$ %FQH %F%'H aVaDWG

IRG'%%%%% %F%%% %F%I$ %F%% %F%' %F%PG %FR$ IRGH%%%%% %F%%R %F%$G %F%%Q %F%HQ %FI'% %F$R' IRGQ%%%%% %F%IP %FI'' %F%%G %F%G$ %FIIH %F$PQ

IR''%%%%% %F%G' %F%G %F%%R %F%% %FI$G %F$% IR'H%%%%% %F%$P %F%%% %F%$% %F%% %FI'R %FH%I U326ipqrsC%%$E IR'Q%%%%% %FI%I %F%% %F%$G %F%% %FIGI %F%%Q

IRQ%%%%% %F%%I %FIQR %F%I$ %FG% %F%PH %FRQI VTda&WV0&Xfee&DeUeI IRQ'%%%%% %F%I% %FI' %F%$Q %FH$ %F%R' %FRR' DeUeI

%GH%%%%% %F%%G %F'HH %F%'$ %FQ%% %F%$% %FRRR aX8WeI I$H%%%%% %F'%' %F%$P %F'$Q %F%Q$ %FQ %F%%$ WXUWI d WD'&WdB'X&bdg&SaB$& '%%%%% %F%%R b Bb &BWeG0d&a8c'

xy€‚ƒ„ 0A4526A6@5@45@645†@564u541@&U X‡ˆDdUu32(22(

e56w7@5‰155351(5764132YA4526 GR File S2. Supplementary tables S5, S6, S7 and S13. Tables S5-7 give the selection candidate lists used, and Table S13 gives selection statistic scores for corresponding genomic windows in HapMap data. (.xlsx, 45 KB)

Available for download as a .xlsx file at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.185900/-/DC1/FileS2.xlsx