Identifying and Classifying Shared Selective Sweeps from Multilocus Data

bioRxiv preprint doi: https://doi.org/10.1101/446005; this version posted October 17, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Article: Methods Identifying and classifying shared selective sweeps from multilocus data Alexandre M. Harris1;2 and Michael DeGiorgio1;3;4;∗ October 17, 2018 1Department of Biology, Pennsylvania State University, University Park, PA 16802, USA 2Molecular, Cellular, and Integrative Biosciences at the Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, PA 16802, USA 3Department of Statistics, Pennsylvania State University, University Park, PA 16802, USA 4Institute for CyberScience, Pennsylvania State University, University Park, PA 16802,USA ∗Corresponding author: [email protected] Keywords: Expected haplotype homozygosity, multilocus genotype, ancestral sweep, convergent sweep Running title: Detecting shared sweeps 1 bioRxiv preprint doi: https://doi.org/10.1101/446005; this version posted October 17, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Abstract Positive selection causes beneficial alleles to rise to high frequency, resulting in a selective sweep of the diversity surrounding the selected sites. Accordingly, the signature of a selective sweep in an ancestral population may still remain in its descendants. Identifying genomic regions under selection in the ancestor is important to contextualize the timing of a sweep, but few methods exist for this purpose. To uncover genomic regions under shared positive selection across populations, we apply the theory of the expected haplotype homozygosity statistic H12, which detects recent hard and soft sweeps from the presence of high-frequency haplotypes. Our statistic, SS-H12, is distinct from other statistics that detect shared sweeps because it requires a minimum of only two populations, and properly identifies independent convergent sweeps and true ancestral sweeps, with high power. Furthermore, we can apply SS-H12 in conjunction with the ratio of a different set of expected haplotype homozygosity statistics to further classify identified shared sweeps as hard or soft. Finally, we identified both previously-reported and novel shared sweep candidates from whole-genome sequences of global human populations. Previously- reported candidates include the well-characterized ancestral sweeps at LCT and SLC24A5 in Indo-European populations, as well as GPHN worldwide. Novel candidates include an ancestral sweep at RGS18 in sub-Saharan African populations involved in regulating the platelet response and implicated in sudden cardiac death, and a convergent sweep at C2CD5 between European and East Asian populations that may explain their different insulin responses. 2 bioRxiv preprint doi: https://doi.org/10.1101/446005; this version posted October 17, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. Introduction Alleles under positive selection increase in frequency in a population toward fixation, causing nearby linked neutral variants to also rise to high frequency. This process results in selective sweeps of the diversity surrounding the sites of selection, which can be hard or soft [Hermisson and Pennings, 2005, Pennings and Hermisson, 2006a,b, Hermisson and Pennings, 2017]. Under hard sweeps, beneficial alleles exist on a single haplotypic background at the time of selection, and a single haplotype rises to high frequency with the selected variants. In contrast, soft sweeps occur when beneficial alleles are present on multiple haplotypic backgrounds, each of which increases in frequency with the selected variants. Thus, individuals carrying the selected alleles do not all share a common haplotypic background. The signature of a selective sweep, hard or soft, is characterized by elevated levels of linkage disequilibrium (LD) and expected haplotype homozygosity [Maynard Smith and Haigh, 1974, Sabeti et al., 2002, Schweinsberg and Durrett, 2005]. Thus, the signature of a selective sweep decays with distance from the selected site as mutation and recombination erode tracts of sequence identity produced by the sweep, returning expected haplotype homozygosity and LD to their neutral levels [Messer and Petrov, 2013]. Various approaches exist to detect signatures of selective sweeps in single populations, but few methods can identify sweep regions shared across populations, and these methods primarily rely on allele frequency data as input. Identifying sweeps common to multiple populations provides an important layer of context that specifies the branch of a genealogy on which a sweep is likely to have occurred. Existing methods to identify shared sweeps [Bonhomme et al., 2010, Fariello et al., 2013, Racimo, 2016, Librado et al., 2017, Peyrégneet al., 2017, Cheng et al., 2017, Johnson and Voight, 2018] leverage the observation that study populations sharing similar patterns of genetic diversity at a putative site under selection descend from a common ancestor in which the sweep occurred. Such approaches therefore infer a sweep ancestral to the study populations from what may be coincidental (i.e., independent) signals. Moreover, many of these methods require data from at least one reference population in addition to the study populations, and of these, most may be misled by sweeps in their set of reference populations. These constraints may therefore impede the application of these methods to study systems that do not fit their model assumptions or data requirements. Here, we develop an expected haplotype homozygosity-based statistic, denoted SS-H12, that addresses the aforementioned constraints. SS-H12 detects shared selective sweeps from a minimum of two sampled populations using haplotype data (Figure 1), and classifies sweep candidates as either ancestral (shared through common ancestry) or convergent (occurring independently). We base our approach on the theory of the H12 statistic [Garud et al., 2015, Garud and Rosenberg, 2015], which applies a measure of expected homozygosity to haplotype data from a single population. This single-population statistic has high power 3 bioRxiv preprint doi: https://doi.org/10.1101/446005; this version posted October 17, 2018. The copyright holder for this preprint (which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission. to detect recent selective sweeps, and identifies hard and soft sweeps with similar power due to its unique formulation. For a genomic window in which there are I distinct haplotypes, H12 [Garud et al., 2015] is defined as I 2 X 2 H12 = (p1 + p2) + pi ; (1) i=3 where pi is the frequency of the ith most frequent haplotype, with p1 ≥ p2 ≥ · · · ≥ pI . The two most common haplotype frequencies are pooled into a single value to reflect the presence of at least two high- frequency haplotypes in the population under a soft sweep. This pooling yields similar values of H12 for hard and soft sweeps. In addition, the framework of the single-population statistic distinguishes hard and soft PI 2 sweeps using the H2/H1 ratio [Garud et al., 2015, Garud and Rosenberg, 2015], where H1 = i=1 pi is the 2 expected haplotype homozygosity, and where H2 = H1−p1 is the expected haplotype homozygosity omitting the most frequent haplotype. The value of H2/H1 is small under hard sweeps, because the second through Ith frequencies are small, as the beneficial alleles exist only on a single haplotypic background. Accordingly, H2/H1 is larger for soft sweeps [Garud et al., 2015], and can therefore be used to classify sweeps as hard or soft, conditioning on an elevated value of H12. We now define SS-H12, which provides information about the location of a shared sweep on the sample population phylogeny, and describe its application to a sample consisting of individuals from two populations. The SS-H12 approach measures the diversity within each population, as well as within the pool of the populations, and is a modification of the single-population expected homozygosity-based approach. Consider a pooled sample consisting of haplotypes from two populations, in which a fraction γ of the haplotypes derives from population 1 and a fraction 1−γ derives from population 2. For a pooled sample consisting of individuals from two populations, we define the total-sample expected haplotype homozygosity statistic H12Tot within a genomic window containing I distinct haplotypes as I 2 X 2 H12Tot = (x1 + x2) + xi ; (2) i=3 where xi = γp1i + (1 − γ)p2i, x1 ≥ x2 ≥ · · · ≥ xI , is the frequency of the ith most frequent haplotype in the pooled population, and where p1i and p2i are the frequencies of this haplotype in populations 1 and 2, respectively. H12Tot therefore has elevated values at the sites of shared sweeps because the pooled genetic diversity at a site under selection in each sampled population remains small. Next, we seek to define a statistic that classifies the putative shared sweep as ancestral or convergent PI 2 between the pair of populations. To do this, we define a statistic fDiff = i=1(p1i − p2i) , which measures the sum of the squared difference in the frequency of each haplotype between both populations. fDiff takes on values between 0, for population pairs with identical haplotype frequencies, and 2, for populations

Identifying and Classifying Shared Selective Sweeps from Multilocus Data

Approximating Selective Sweeps

A New Approach for Using Genome Scans to Detect Recent Positive Selection in the Human Genome

Refining the Use of Linkage Disequilibrium As A

Identification of Selective Sweeps, Major Genes, and Genotype by Diet Interactions Melanie D

Rapid Evolution of a Skin-Lightening Allele in Southern African Khoesan

Identifying the Favored Mutation in a Positive Selective Sweep

Selective Sweep Biotechnology Intelligence Unit Medical Intelligence Unit Molecular Biology Intelligence Unit

On the Unfounded Enthusiasm for Soft Selective Sweeps III: the Supervised Machine Learning Algorithm That Isn’T

A Spatially Aware Likelihood Test to Detect Sweeps from Haplotype Distributions

Genomic Insights Into Positive Selection

Human Pigmentation Variation: Evolution, Genetic Basis, and Implications for Public Health

Diseases Pardis Sabeti’S Search for Signals of Selection in the Human Genome