Supplementary Information For
Total Page:16
File Type:pdf, Size:1020Kb
Supplementary Information for Rapid Evolution of a Skin Lightening Allele in Southern African KhoeSan Meng Lin, Rebecca L. Siford, Alicia R. Martin, Shigeki Nakagome, Marlo Möller, Eileen G. Hoal, Carlos D. Bustamante, Christopher R. Gignoux, Brenna M. Henn Corresponding authors: Meng Lin Email: [email protected] Brenna Henn Email: [email protected] This PDF file includes: Supplementary text Figs. S1 to S8 Tables S1 to S3 References for SI reference citations www.pnas.org/cgi/doi/10.1073/pnas.1801948115 Supplementary Text SI Materials and Methods Phenotype measurement. Individuals’ corresponding baseline pigmentation was measured with a reflectance spectrophotometer (DermaSpectrometer DSMII ColorMeter, Cortex Technology, Hadsund, Denmark), as described in (1). Quantitative melanin units (log10 ratio of inverse reflectance read from the device) were recorded in five measurements on each of left and right upper inner arms. We took the trimmed means, after removing the highest and lowest values of the five measurements, and then averaged the two arms. Quality control of targeted sequence. Because there were not enough variants to run VQSR, we applied hard filters for quality control. We removed samples with <10x mean coverage, samples with a ≥8% contamination rate measured by verifyBamID, and highly discordant samples (concordance with genotyping array <95%). The SLC24A5 resequencing region had below average GC content (0.33) compared to other targeted regions. On average, individuals had 47x coverage in the capture region. Global ancestry estimation Out of all 430 samples, we estimated the ancestry proportions from 402 individuals that had genome-wide array data (see (1) for genotyping platform information), using a block relaxation method to update ancestral allele frequency and ancestry fraction in ADMIXTURE (2). First, we ran an analysis on 361 individuals genotyped on Illumina Omni2.5 and Illumina MEGA, in order to maximize the number of loci overlapping among platforms. We included French (N= 28), French Basque (N= 23), Maasai trio parents from Kenya (N= 36), Namibian Ju/’hoansi San (N=6), Druze from Near East (N= 42), and Yoruba from Nigeria (N= 22) genotyped on Illumina MEGA platform from Population Architecture using Genomics and Epidemiology (PAGE) dataset (http://www.pagestudy.org); /Gui and //Ghana from central Botswana (N= 7), Nama from Namibia (N= 7), Ju/’hoansi from Namibia (N= 17), and Herero from Namibia (N= 8) genotyped on Illumina Omni2.5 from Schlebusch et al. (3). We removed recently admixed individuals as identified by Schlebusch et al. All datasets were oriented to Human Genome Build hg19 and dbSNP v144, and only biallelic loci with the same allele codes as phase 3 1000 Genome Project were kept. After merging the datasets, we further removed singletons, and variants with > 5% genotyping missingness, resulting in 316,820 SNPs. Due to relatedness in both KhoeSan groups, we separated related ≠Khomani and Nama individuals into 19 running groups, and then added the rest of individuals unrelated to anyone in the population repeatedly to each running group (named “common individuals” as shown in Figure 2). Every running group has maximally unrelated samples of 118 ~ 120, where ancestries of the “common individuals” were averaged out across 19 repeated runs. The threshold of unrelatedness here was defined as pairwise identity-by-descent (IBD) < 1500 cM, as this value set apart clusters of reported relationships when matched with ethnographic information (see Methods in (1)). We ran ADMIXTURE from k= 2-7, with 5 iterations per running group at each k. By matching different ancestry clusters on repeating individuals and reference panels across different running groups, we chose k= 6 as the most stable estimates. Estimates of different running groups were further merged, with repeating individuals and reference panels remaining constant. Then at k= 6, we assigned ancestries in the KhoeSan samples to four general categories: European, Bantu, San, and East African, by combining southern and northern San proportions into an integrated San ancestry, and ancestries predominantly represented in both Maasai and Druze into East African. We did the same category assigning for another 41 KhoeSan individuals estimated from (1). at k= 7, which were genotyped on different platforms, by combining southern and northern San proportions into integrated San ancestry, and ancestries predominantly represented in Hadza, Sandawe, and Mozabite into East African. We then integrated estimates from 361 samples described above and 41 samples from (1) based on the four categories using pong (4). Tested demographic and evolutionary models. Model 1: Neutral allele derived from the European migration (“European Source”) The beneficial allele arose in Europeans (Ne = 10,000) about 13 kya, and was under selection with a selection coefficient 0.08 (5), meanwhile this derived allele remained absent in KhoeSan (Ne = 20,000) and Bantu-speaking agropastoralists (Ne = 17,000). Nama underwent a bottleneck 20 generations ago, with an effective population size decreased to 10,000. 14 generations ago (6), KhoeSan received gene flow from a group of Bantu-speakers at 13% in the ≠Khomani San, 2% in Nama. 7 to 10 generations ago, one pulse of gene flow from Europeans into KhoeSan happened at a migration rate of 12% for ≠Khomani San, 17% for Nama. The introduced derived allele of interest drifted to the observed frequency of 32.5% in ≠Khomani San and 53.5% in Nama. Model 2: Selected allele derived from the European migration (“European Source”) The beneficial allele arose in Europeans (Ne = 10,000) about 13 kya, and was under selection with a selection coefficient 0.08 (5), meanwhile this derived allele remained absent in KhoeSan (Ne = 20,000) and Bantu-speaking agropastoralists (Ne = 17,000). Nama underwent a bottleneck 20 generations ago, with an effective population size decreased to 10,000. 14 generations ago (6), KhoeSan received gene flow from a group of Bantu-speakers at 13% in ≠Khomani San, 2% in Nama. 7 to 10 generations ago, we model a pulse of gene flow from Europeans into the KhoeSan at a rate m=12% for ≠Khomani San, 17% for Nama. The derived allele was under selection in KhoeSan, with a selection coefficient s, reaching the observed frequency of 32.5% in ≠Khomani San and 53.5% in Nama. Model 3: Neutral allele derived from eastern African pastoralists (“East African Source”) The beneficial allele arose in Europeans (Ne=10,000) about 13 kya, and was under selection with a selection coefficient 0.08 (5), while the derived allele was absent in KhoeSan (Ne = 20,000), Bantu- speaking agropastoralists (Ne =17,000), and eastern African pastoralists (Ne = 17,000). About 5 - 10 kya, gene flow from Europeans to eastern African pastoralists happened at 30% (relative to the size of eastern African pastoralists). The introduced allele was not subject to selection in eastern African pastoralists. 900 – 3,000 years ago (7), a group of eastern African pastoralists migrated southward and introduced gene flow into KhoeSan at 2% - 10% (Supplementary Text). Nama underwent a bottleneck 20 generations ago, with an effective population size decreased to 10,000. 14 generations ago (6), KhoeSan received gene flow from a group of Bantu-speakers at 13% in ≠Khomani San, 2% in Nama. 7 to 10 generations ago (6), one pulse of migration from Europeans into KhoeSan happened at rate of 12% for ≠Khomani San, 17% for Nama. The derived allele was neutral in KhoeSan, and drifted to the observed frequency of 32.5% in ≠Khomani San and 53.5% in Nama. Model 4: Selected allele derived from eastern African pastoralists (“East African Source”) The beneficial allele arose in Europeans (Ne =10,000) about 13 kya, and was under selection with a selection coefficient 0.08 (5), while the derived allele was absent in KhoeSan (Ne = 20,000), Bantu- speaking agropastoralists (Ne =17,000), and eastern African pastoralists (Ne = 17,000). About 5 - 10 kya, gene flow from Europeans to eastern African pastoralists at 30% (relative to the size of eastern African pastoralists). The introduced allele was not subject to selection in eastern African pastoralists. 900 – 3,000 years ago (7), a group of eastern African pastoralists migrated southward and introduced gene flow into KhoeSan at 2% - 10%. Nama underwent a bottleneck 20 generations ago, with an effective population size decreased to 10,000. 14 generations ago (6), KhoeSan received gene flow from a group of Bantu- speakers at 13% in ≠Khomani San, 2% in Nama. 7 generations ago (6), one pulse of migration from Europeans into KhoeSan happened at rate of 12% for ≠Khomani San, 17% for Nama. The derived allele was under selection in KhoeSan, and reached the observed frequency of 32.5% in ≠Khomani San and 53.5% in Nama. We obtained the demographic parameter values in each model from published data and estimates of global ancestries in this study (Supplementary Table 2, Supplementary Notes), either by adopting the point estimate, or drawing from a distribution when a wider uncertainty of the estimate exists. The prior of selection coefficient in Model 2 was drawn from a larger range of a uniform distribution U(0.01, 1), and that in Model 4 from U(0.01,0.2), in consideration of possibly stronger selection if the allele was introduced more recently. Coalescent simulations. We performed coalescent simulations in a two-step framework. First, we simulated forward-in-time allele trajectories under a Wright-Fisher model, where each trajectory recorded the frequency change of a focal allele in the population(s) over time. A selection coefficient of the allele was sampled from a prior distribution per trajectory, and the demographic parameters with a range of possible values were drawn each time (Table S2). In the neutral models, we set the selection coefficient to be 1e-6. Note that the input of our trajectory generator takes the selection advantage of the homozygote of the focal allele, which is twice as large as the selection coefficient of the allele drawn from the described distribution under the additive model.