Supplementary Information for

Rapid Evolution of a Skin Lightening Allele in Southern African KhoeSan

Meng Lin, Rebecca L. Siford, Alicia R. Martin, Shigeki Nakagome, Marlo Möller, Eileen G. Hoal, Carlos D. Bustamante, Christopher R. Gignoux, Brenna M. Henn

Corresponding authors: Meng Lin Email: [email protected]

Brenna Henn Email: [email protected]

This PDF file includes:

Supplementary text Figs. S1 to S8 Tables S1 to S3 References for SI reference citations

www.pnas.org/cgi/doi/10.1073/pnas.1801948115 Supplementary Text

SI Materials and Methods

Phenotype measurement. Individuals’ corresponding baseline pigmentation was measured with a reflectance spectrophotometer (DermaSpectrometer DSMII ColorMeter, Cortex Technology, Hadsund, Denmark), as described in (1). Quantitative melanin units (log10 ratio of inverse reflectance read from the device) were recorded in five measurements on each of left and right upper inner arms. We took the trimmed means, after removing the highest and lowest values of the five measurements, and then averaged the two arms. Quality control of targeted sequence. Because there were not enough variants to run VQSR, we applied hard filters for quality control. We removed samples with <10x mean coverage, samples with a ≥8% contamination rate measured by verifyBamID, and highly discordant samples (concordance with genotyping array <95%). The SLC24A5 resequencing region had below average GC content (0.33) compared to other targeted regions. On average, individuals had 47x coverage in the capture region.

Global ancestry estimation Out of all 430 samples, we estimated the ancestry proportions from 402 individuals that had genome-wide array data (see (1) for genotyping platform information), using a block relaxation method to update ancestral allele frequency and ancestry fraction in ADMIXTURE (2). First, we ran an analysis on 361 individuals genotyped on Illumina Omni2.5 and Illumina MEGA, in order to maximize the number of loci overlapping among platforms. We included French (N= 28), French Basque (N= 23), Maasai trio parents from Kenya (N= 36), Namibian Ju/’hoansi San (N=6), Druze from Near East (N= 42), and Yoruba from Nigeria (N= 22) genotyped on Illumina MEGA platform from Population Architecture using Genomics and Epidemiology (PAGE) dataset (http://www.pagestudy.org); /Gui and //Ghana from central Botswana (N= 7), Nama from (N= 7), Ju/’hoansi from Namibia (N= 17), and Herero from Namibia (N= 8) genotyped on Illumina Omni2.5 from Schlebusch et al. (3). We removed recently admixed individuals as identified by Schlebusch et al. All datasets were oriented to Human Genome Build hg19 and dbSNP v144, and only biallelic loci with the same allele codes as phase 3 1000 Genome Project were kept. After merging the datasets, we further removed singletons, and variants with > 5% genotyping missingness, resulting in 316,820 SNPs. Due to relatedness in both KhoeSan groups, we separated related ≠Khomani and Nama individuals into 19 running groups, and then added the rest of individuals unrelated to anyone in the population repeatedly to each running group (named “common individuals” as shown in Figure 2). Every running group has maximally unrelated samples of 118 ~ 120, where ancestries of the “common individuals” were averaged out across 19 repeated runs. The threshold of unrelatedness here was defined as pairwise identity-by-descent (IBD) < 1500 cM, as this value set apart clusters of reported relationships when matched with ethnographic information (see Methods in (1)). We ran ADMIXTURE from k= 2-7, with 5 iterations per running group at each k. By matching different ancestry clusters on repeating individuals and reference panels across different running groups, we chose k= 6 as the most stable estimates. Estimates of different running groups were further merged, with repeating individuals and reference panels remaining constant. Then at k= 6, we assigned ancestries in the KhoeSan samples to four general categories: European, Bantu, San, and East African, by combining southern and northern San proportions into an integrated San ancestry, and ancestries predominantly represented in both Maasai and Druze into East African. We did the same category assigning for another 41 KhoeSan individuals estimated from (1). at k= 7, which were genotyped on different platforms, by combining southern and northern San proportions into integrated San ancestry, and ancestries predominantly represented in Hadza, Sandawe, and Mozabite into East African. We then integrated estimates from 361 samples described above and 41 samples from (1) based on the four categories using pong (4).

Tested demographic and evolutionary models.

Model 1: Neutral allele derived from the European migration (“European Source”) The beneficial allele arose in Europeans (Ne = 10,000) about 13 kya, and was under selection with a selection coefficient 0.08 (5), meanwhile this derived allele remained absent in KhoeSan (Ne = 20,000) and Bantu-speaking agropastoralists (Ne = 17,000). Nama underwent a bottleneck 20 generations ago, with an effective population size decreased to 10,000. 14 generations ago (6), KhoeSan received gene flow from a group of Bantu-speakers at 13% in the ≠Khomani San, 2% in Nama. 7 to 10 generations ago, one pulse of gene flow from Europeans into KhoeSan happened at a migration rate of 12% for ≠Khomani San, 17% for Nama. The introduced derived allele of interest drifted to the observed frequency of 32.5% in ≠Khomani San and 53.5% in Nama.

Model 2: Selected allele derived from the European migration (“European Source”) The beneficial allele arose in Europeans (Ne = 10,000) about 13 kya, and was under selection with a selection coefficient 0.08 (5), meanwhile this derived allele remained absent in KhoeSan (Ne = 20,000) and Bantu-speaking agropastoralists (Ne = 17,000). Nama underwent a bottleneck 20 generations ago, with an effective population size decreased to 10,000. 14 generations ago (6), KhoeSan received gene flow from a group of Bantu-speakers at 13% in ≠Khomani San, 2% in Nama. 7 to 10 generations ago, we model a pulse of gene flow from Europeans into the KhoeSan at a rate m=12% for ≠Khomani San, 17% for Nama. The derived allele was under selection in KhoeSan, with a selection coefficient s, reaching the observed frequency of 32.5% in ≠Khomani San and 53.5% in Nama.

Model 3: Neutral allele derived from eastern African pastoralists (“East African Source”) The beneficial allele arose in Europeans (Ne=10,000) about 13 kya, and was under selection with a selection coefficient 0.08 (5), while the derived allele was absent in KhoeSan (Ne = 20,000), Bantu- speaking agropastoralists (Ne =17,000), and eastern African pastoralists (Ne = 17,000). About 5 - 10 kya, gene flow from Europeans to eastern African pastoralists happened at 30% (relative to the size of eastern African pastoralists). The introduced allele was not subject to selection in eastern African pastoralists. 900 – 3,000 years ago (7), a group of eastern African pastoralists migrated southward and introduced gene flow into KhoeSan at 2% - 10% (Supplementary Text). Nama underwent a bottleneck 20 generations ago, with an effective population size decreased to 10,000. 14 generations ago (6), KhoeSan received gene flow from a group of Bantu-speakers at 13% in ≠Khomani San, 2% in Nama. 7 to 10 generations ago (6), one pulse of migration from Europeans into KhoeSan happened at rate of 12% for ≠Khomani San, 17% for Nama. The derived allele was neutral in KhoeSan, and drifted to the observed frequency of 32.5% in ≠Khomani San and 53.5% in Nama.

Model 4: Selected allele derived from eastern African pastoralists (“East African Source”) The beneficial allele arose in Europeans (Ne =10,000) about 13 kya, and was under selection with a selection coefficient 0.08 (5), while the derived allele was absent in KhoeSan (Ne = 20,000), Bantu- speaking agropastoralists (Ne =17,000), and eastern African pastoralists (Ne = 17,000). About 5 - 10 kya, gene flow from Europeans to eastern African pastoralists at 30% (relative to the size of eastern African pastoralists). The introduced allele was not subject to selection in eastern African pastoralists. 900 – 3,000 years ago (7), a group of eastern African pastoralists migrated southward and introduced gene flow into KhoeSan at 2% - 10%. Nama underwent a bottleneck 20 generations ago, with an effective population size decreased to 10,000. 14 generations ago (6), KhoeSan received gene flow from a group of Bantu- speakers at 13% in ≠Khomani San, 2% in Nama. 7 generations ago (6), one pulse of migration from Europeans into KhoeSan happened at rate of 12% for ≠Khomani San, 17% for Nama. The derived allele was under selection in KhoeSan, and reached the observed frequency of 32.5% in ≠Khomani San and 53.5% in Nama.

We obtained the demographic parameter values in each model from published data and estimates of global ancestries in this study (Supplementary Table 2, Supplementary Notes), either by adopting the point estimate, or drawing from a distribution when a wider uncertainty of the estimate exists. The prior of selection coefficient in Model 2 was drawn from a larger range of a uniform distribution U(0.01, 1), and that in Model 4 from U(0.01,0.2), in consideration of possibly stronger selection if the allele was introduced more recently.

Coalescent simulations. We performed coalescent simulations in a two-step framework. First, we simulated forward-in-time allele trajectories under a Wright-Fisher model, where each trajectory recorded the frequency change of a focal allele in the population(s) over time. A selection coefficient of the allele was sampled from a prior distribution per trajectory, and the demographic parameters with a range of possible values were drawn each time (Table S2). In the neutral models, we set the selection coefficient to be 1e-6. Note that the input of our trajectory generator takes the selection advantage of the homozygote of the focal allele, which is twice as large as the selection coefficient of the allele drawn from the described distribution under the additive model.

We accept a trajectory only if the final allele frequency at present matches that from our observed data, with probability proportional to the binomial probability of the observed current frequency. Sampling of parameters was repeated until > 50,000 trajectories were generated and kept for each model. We define a model being rejected on the allele frequency level if not a single trajectory could be accepted at all for 100,000 simulations (i.e. acceptance rate < 1e-5), and within each simulation the number of independent attempts reached a maximum of 10, 000.

Following that, we performed coalescent simulations of the whole SLC24A5 gene using the simulator mssel (from R. Hudson upon request). We randomly sampled 50,000 trajectories from those that have been accepted on the allele frequency level for each model, to guarantee an equal number of simulations per model. Conditional on a trajectory, we repeated the simulation 10 times to emulate stochasticity. We ended up with 500,000 simulations under each model.

We assumed the mutation rate in the 31,700 bp region of SLC24A5 to be 1.9e-8 per bp per generation (8). The recombination rate was calculated from the African American recombination map (9), where a relative hotspot is present at a general low recombination rate background (Fig. S3). Given that mssel takes a uniform recombination rate and a missing hotspot could affect the haplotype based summary statistic, we modeled the recombination rate as a gamma distribution, where the mode corresponds to the mean regional recombination rate, and a 99% quantile equals to the hotspot recombination value (Fig. S3).

Introducing sequencing errors to simulated sequences. High coverage next generation sequencing (NGS) can have around 4% false positive (FP) and 2% false negative (FN) errors when compared to more accurate Sanger sequencing, based on empirical observations (10). We emulated these error patterns in simulated San derived haplotypes, by taking the number of variant sites multiplied by FP and FN rates to be the mean FP and FN errors introduced per haplotype. Then we model the actual number of errors per haplotype from a Poisson distribution with lambda equaling the mean. After orienting all simulated allele codes from ancestral / derived to reference / alternative, based on a random European haplotype to mimic the genome reference, we assigned the errors by randomly flipping reference alleles to alternative (FP) and alternative alleles to reference (FN).

Calculation of summary statistics. We chose summary statistics that can reflect signatures of natural selection, and patterns of phased data: pairwise nucleotide differences (p), number of segregating sites (s), and extended haplotype homozygosity (EHH). EHH was calculated separately for the flanking regions on either side of the focal allele using R package ‘rehh’ (11), then integrated as the weighted mean from both upstream and downstream.

All summary statistics were only calculated on the derived haplotypes in KhoeSan, instead of on both ancestral and derived. Since the three summary statistics are supposed to reflect signatures of selection, ancestral haplotypes that contribute to more than half of total samples are likely to dominate the summary statistics, lowering our power to estimate strength of selection on 35-50% of the haplotypes in the observed data. This confounding impact from ancestral haplotypes might be even bigger given the great genetic diversity of KhoeSan in general.

Model comparison

We used the rejection algorithm in the ‘abc’ package (12) to compare the models. First, we set a 5% acceptance rate as the distance threshold from the observed data, where the distance refers to the normalized Euclidean distance between the simulated and the observed summary statistics. Only simulations that are within 5% of the observed are retained. Second, the probability of the model using multinomial logistic regression is calculated, where a regression between model indicators and simulated summary statistics is carried out for estimation of posterior probability of each model.

Parameter inference. We estimated the posterior distribution of the selection coefficient under both scenarios, using a rejection algorithm from ‘abc’. Under each model, the top 1% of simulations closest to the observed data, based on the Euclidean distance on the summary statistics, were kept and their corresponding selection coefficient values form the posterior distribution.

Then, we conducted a posterior predictive check by sampling 1,000 s coefficients a posteriori. We then ran forward trajectory simulations and backward mssel using the sampled values, with the same procedure as described in the section Coalescent simulations. The distribution of summary statistics from the new simulated data was calculated and the observed data was compared to the 95% C.R. of the new distribution.

Demographic parameters for population history

We modeled major events in the KhoeSan population history that affected the introduction of and selection on rs1426654*A. These included timing of major migrations and the corresponding migration rates, population size change, and divergence amongst populations (Table S2). Apart from these, we modeled the selection of the allele in Europeans as a hard sweep model, with a selection coefficient 0.08 from Beleza et al. (5) We referenced most of these parameters from point estimates of in previous literature or from the global ancestry estimates of this study (see “source” in Table S2). Yet, some events have greater uncertainty of estimates than others, and we detail our choices of the most relevant migration events in the complex scenario as below. We set a one-pulse migration event from western Eurasians to eastern African pastoralists as 30%, according to the estimates ranging about 20~40% in Ethiopian Afro-Asiatic groups from Llorente et al. (13). The best origin of western Eurasians in their study is identified as the same source of Neolithic expansion into Europe, while an unadmixed ancient Ethiopian genome was set as the local source. The timing of this event has a slightly wider range: the mean estimates using genetic data across different populations could be > 3,000 years (14, 15), while the widely used ALDER method on admixture LD decay from these studies can be biased towards the younger event if migration happened more than once (16). The estimate of a large component of non-African ancestry (Ethio-Somali) in eastern Africans is before the timing of local agriculture (~7 kya), with an upper bound to the Last Glacial Maximum(16). Yet, if this migration brought in the source of the derived SLC24A5 as we hypothesized, the event cannot predate the timing when rs1426654*A reached prevalence in Europe, whose allele age is dated ~ 11 kya (5). We thus propose a prior of the timing of the migration to 5 ~ 10 kya. Other key demographic parameters directly related to the introduction of the derived allele are the timing and migration rate from eastern African pastoralists to southern African KhoeSan. The relevant range of timing was chosen based on estimates from both tMRCA of Y-chromosomal haplotype network between eastern and southern African groups (7) and admixture LD decay from autosomal data (15, 17). The proportion of ancestry from East African / Eurasian in KhoeSan using Amhara as reference is reported to be higher (9-30%) in (18). Yet the observed global ancestries in our KhoeSan samples contain only ~ 3% northern / eastern African ancestry (after excluding European, Bantu, and indigenous San components, Fig. 2). The Ethiopian ancestry is also observed to be very low in San from Botswana in the ADMIXTURE analysis in Crawford et al. (19). Taken together, we adopted an intermediate range for the eastern African gene flow from 2-10%. We also incorporated migration from Bantu-speakers, who have minimal levels of rs1426654*A, into both the “European Source” and “East African Source” scenarios. We modeled this in order to appropriately consider new recombinants for the rs1426654*A haplotype: recombination between a selected haplotype and a background Bantu haplotype could integrate fragments of the latter onto the former, resulting in a mosaic of selected haplotypes. Given that Bantu-speakers and KhoeSan are highly divergent (i.e., Fst = 0.115, with a divergence time of ~100-150 kya (3, 20-23), these fragments from a vastly different haplotype background could affect values of the chosen summary statics on the derived haplotypes. Note that when modeling the Bantu gene flow described above, we set the derived allele frequency to 0 in Bantu for simplicity, which is different from the observed 7.6% in the current Luhya (Table S1). We consider this a valid: with a maximum for m=13%, the frequency of derived alleles is only 0.1%, which hardly changes the allele trajectory in the model. Moreover, Luhya was unlikely the direct source of Bantu gene flow to KhoeSan, and the presence of the allele in contemporary eastern Bantu-speakers was likely due to contact with East Africans (Figure 1), thus it may also be unlikely that the then ancestral Bantu migrants carried the allele. The KhoeSan derive their western African ancestry from southern Bantu-speakers, but due to the absence of next-generation sequence data for these populations, we utilize the Luhya from the 1000 Genomes Project as the closest reference population, with the caveats mentioned above. Lastly, we modeled a recent bottleneck in Nama 20 generations ago, with an effective population size deducted to 10,000. This is concordant with a higher inbreeding background in Nama than ≠Khomani, observed in elevated cumulative runs of homozygosity (cROH) in the genome and a distinct ancestry identified across unrelated individuals at higher k in global ancestry estimate (Fig. 2).

Test of neutral scenarios

We formally test neutral scenarios (i.e. no selection) in both “European Source” (with the allele from recent European gene flow) and “East African Source” (with the allele from eastern African pastoralists) models in our simulation pipeline (model description in Methods). The selection coefficient was set to 1e-6 to mimic neutrality. After 100,000 attempts to generate trajectories of the focal allele under each neutral model for either ≠Khomani or Nama, none resulted in a final allele frequency that matched the observed data (i.e. an acceptance rate < 1e-5). We thus conclude that a neutral scenario for either population is unlikely, hence there has been selection upon rs1426654*A, concordant with the result obtained from our deterministic model (Fig. 4).

Damara and Bolsluis Baster ancestry

Damara are -speakers that live in Namibia and northern . Their language is mutually intelligible with Nama. However, their ancestry is distinct from KhoeSan and likely of central/west African origin, more similar to Herero (24) (6). Damara were historically hunter-gatherers who served as clients to the pastoralist Nama in Namibia. Gene flow into the Nama over the past few generations was inferred from self-reported ethnographic interviews in our sample. The “Bosluis Baster” community lives in the region of South Africa. Their oral history indicates that most individuals moved from Bushmanland (further east) to the Richtersveld in the 1940s and 1950s. We therefore excluded these Damara and Bosluis Baster individuals from our analysis to minimize recent ancestry heterogeneity.

Population CEU LWK SAN

Fig. S1 Median-joining haplotype network of SLC24A5 of haplotypes carrying the derived allele at rs1426654. Colors denote populations of Europeans (red), Bantu-speaking Luhya (blue), and KhoeSan (yellow). Node size is proportional to number of haplotypes, and branch length reflects number of mutational difference between two nodes.

Fig. S2 Aligned haplotypes of SLC24A5, showing variable loci only. Each horizontal line represents a haplotype, with colors indicating alleles being ancestral (blue) or derived (red) at different loci. Haplotypes are first divided by the allele status at rs1426654 (position shown at bottom), then sorted by populations in this visualization. The other locus, rs2470102, has variation associated with eye color, and which is in complete LD with from rs1426654 (25). Note that only variants in SLC24A5 are shown in the plot, hence any length between markers in the plot does not proportionally represent the true physical distance.

A Regional Recombination Rate

● SLC24A5 2.5 2.0 1.5 1.0 Recombination rate (cM/Mb) Recombination rate 0.5

● ● ● ● ● ● ● ● ● ● ● ●● ● 0.0

48400000 48410000 48420000 48430000 48440000

Position (bp) B Gamma Distribtuion Mode= mean RecRate MaxRecRate at 99% C.I.

k=1.1

1.5e+08 theta=(mean RecRate)/(k−1) 1.0e+08 density 5.0e+07 0.0e+00 0.0e+00 5.0e−09 1.0e−08 1.5e−08 2.0e−08 2.5e−08 3.0e−08 Recombination Rate (per locus per generation)

Fig. S3 Recombination rates in SLC24A5. (A) using African American recombination map(9) and (B) modeled with a gamma distribution for coalescent simulations.

Fig. S4 Estimated selection coefficient under the deterministic Ohta and Kimura model. The contour plot reflects the expected selection coefficient of rs1426654*A under a deterministic model, given a combination of the time of migration (in generations, g) at which the derived allele was introduced (x-axis) and the initial allele frequency in the migrants (y-axis), and taking the current allele frequency to be 53.5% (i.e. observed in the Nama).

Fig. S5 The European Source model where rs1426654*A derives from recent European gene flow to KhoeSan. Arrows represent gene flow, with the relative timing and migration rate as annotated. Percentages at the bottom are the observed, current allele frequency in each population. In this model, eastern African pastoralists do not contribute to the introduction of derived SLC24A5 copies into KhoeSan. Detailed demographic parameters used in simulations are listed in Table S2.

pi in Khomani pi in Nama 40000 40000 30000 30000 20000 20000 Frequency Frequency 10000 10000 0 0

0 5 10 15 20 0 5 10 15 20 pi pi

Segregating sites in Khomani Segregating sites in Nama 25000 25000 15000 15000 Frequency Frequency 5000 5000 0 0

0 100 200 300 400 500 0 100 200 300 400 500 Number of segregating sites Number of segregating sites

EHH in Khomani EHH in Nama 15000 15000 10000 10000 Frequency Frequency 5000 5000 0 0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Weighted mean EHH Weighted mean EHH Fig. S6 Histograms of summary statistics from posterior predictive checks for the East African Source model with selection (allele from eastern African pastoralists via gene flow), in ≠Khomani and Nama separately. Selection coefficients were drawn from the posterior distribution, based on which 5000 accepted trajectories were generated and went through coalescent simulations. Dashed lines represent 95% confidence intervals of the simulated summary statistics. Red or purple lines correspond to the values from the observed data in ≠Khomani or Nama, respectively.

European Source model Eastern African Source model with selection with selection 120000 100000

population 90000 75000 Khomani Nama

60000 50000

30000 25000 Contional on allele freq Conditional on allele freq 0 0 count

900 750

600 500 Posterior Posterior 300 250

0 0 0.20 0.25 0.30 0.35 0.05 0.10 0.15 0.20 s s Fig. S7 Density of selection coefficients from accepted allele trajectories after forward simulations, and posterior distributions after ABC, under (A) European Source model with selection (i.e. allele from recent European gene-flow), and (B) East African Source model with selection (i.e. allele from eastern African pastoralists gene-flow).

A 0.10

0.08

count 100

75 0.06 50 mPK 25

0.04

0.02 40 60 80 100 tPK

B 0.10

0.08

count 120

90 0.06 60 mPK 30

0.04

0.02 40 60 80 100 tPK

Fig. S8 Joint distribution of posterior estimates on the timing in generations (tPK) and migration rate (mPK) from eastern African pastoralists to (1) ≠Khomani and (2) Nama. Table S1 Allele frequency of rs1426654*A among African populations. Population (N) Frequency (%) Source KhoeSan (430) 40.3 This study ≠Khomani San (269) 32.5 This study Nama (161) 53.5 This study Luhya (99) 7.6 1000 Genome Project Hadza (21) 11.8 Henn 2011 (21) Moroccans (48) 97.9 Libya (66) 95.5 Tunisian (97) 92.3 Berber (30) 91.7 Algerian (34) 91.2 Mozabite (30) 87.0 Jews, Ethiopian (36) 55.6 Somali (46) 54.3 Maasai (143) 32.9 Chagga (44) 27.3 Ju|’hoansi San (7) 21.0 Mandenka (24) 15.0 Sandawe (37) 12.2 Hausa (38) 10.5 Zaramo (39) 9.0 Mende (85) 8.8 Lisongo (6) 8.3 Malinke (113) 7.5 ALFRED (26) Ibo (47) 4.3 Esan (99) 2.5 Yoruba (113) 1.3 Baika (68) 0.7 Mbuti (37) 0 Kenyan Bantu (12) 0 Druze (100) 100.0 Samaritans (37) 100.0 Turks (74) 98.7 Kurds (34) 98.5 Yemenite Jews (39) 97.4 Palestinian Arabs (61) 96.7 Iranian (42) 96.4 Kuwait (12) 95.8 Bedouin (49) 93.0 Saudi (97) 89.2 Arabs (69) 88.5 Nama (7) 42.9 !Xun (13) 26.9 South West Bantu (8) 12.5 Khwe (17) 11.8 South East Bantu (19) 5.3 Karretjie (12) 4.2 Schlebusch 2012 (3) Ju|’hoansi San (17) 2.9 GuiGhanaKgal (7) 0 Wolayta (8) 62.5 Pagani 2012 (14) Amhara (26) 57.7 Tigray (21) 57.1

Ethiopian Somali (17) 55.9

Oromo (42) 38.1

AriCultivator (24) 22.9

Afar (12) 25.0 AriBlacksmith (17) 11.8 Anuak (23) 0 Copts (16) 100.0 Gaalien (15) 63.3 Halfawieen (11) 54.6 Bataheen (16) 53.1 BeniAmer (16) 46.9 Danagla (15) 43.3 Shaigia (14) 42.9 Mahas (14) 32.1 Hadendowa (13) 26.9 Hollfelder 2017 (27, 28) Messiria (8) 25.0 Nuba (16) 12.5 Hausa (7) 7.1 Gemar (7) 7.1 Dinka (16) 0 Nuer (16) 0 Shilluk (16) 0 Baria (5) 0 Zaghawa (16) 0

Table S2 Demographic parameters used in coalescent simulations Values / Distribution of Demographic parameters parameters Source ≠Khomani San Nama KhoeSan - other populations (T ) 110000 KBPE (20, 22, 23)

Bantu agropastoralists - (Eastern African Divergence 75000 pastoralists, Europeans) (T ) (23) time (years) BPE

Europeans - Eastern Afrcian pastoralists 65000 (T ) (23) PE

KhoeSan (N ) 20,000 K (23)

Bantu agropastoralists (N ) 17,000 B (23)

Eastern African pastoralists (N ) 17,000 Effective P (23) population size Europeans (N ) 10,000 E (29)

Ancestral Out of Africa population (N ) 1,000 OOA (30)

Sikora et al. Recent Bottleneck NA 10,000 unpublished West Eurasians -> Eastern African 30% pastoralists (13)

One pulse Bantu agropastoralists -> KhoeSan 13% 2% Global ancestries migration rate1 Global ancestries, Eastern African pastoralists -> KhoeSan 2% - 10% (15, 31)

Europeans -> KhoeSan 12% 17% Global ancestries

West Eurasians -> Eastern African 10000-5000 pastoralists (13, 16)

Migration Bantu agropastoralists -> KhoeSan 420 (6) time (years) Eastern African pastoralists -> KhoeSan 900-3000 (7, 15, 17)

Europeans -> KhoeSan 210 300 (6)

1 Fraction of recipients replaced by migrants, scaled by Ne Table S3 Summary of number of simulations. European Source East African Source European Source East African Source model under model under model under selection model under selection neutrality neutrality ≠Khomani Nama ≠Khomani Nama ≠Khomani Nama ≠Khomani Nama Trajectories 100,000 100,000 750,000 1,110,000 100,000 100,000 1,000,000 1,000,000 a priori1

Accepted 0 0 52,092 50,628 0 0 74,323 57,641 trajectories Accepted trajectories for 0 0 50,000 50,000 0 0 50,000 50,000 coalescent simulations Coalescent 0 0 500,000 500,000 0 0 500,000 500,000 simulations coalescent simulations 0 0 5,000 5,000 0 0 5,000 5,000 a posteriori 1Models with no accepted trajectories after 100,000 simulations were rejected, with no input for coalescent simulations.

SI References

1. Martin AR, et al. (2017) An Unexpectedly Complex Architecture for Skin Pigmentation in Africans. Cell 171(6):1340–1353.e14. 2. Alexander DH, Novembre J, Lange K (2009) Fast model-based estimation of ancestry in unrelated individuals. Genome Research 19(9):1655–1664. 3. Schlebusch CM, et al. (2012) Genomic Variation in Seven Khoe-San Groups Reveals Adaptation and Complex African History. Science 338(6105):374–379. 4. Behr AA, Liu KZ, Liu-Fang G, Nakka P, Ramachandran S (2016) pong: fast analysis and visualization of latent clusters in population genetic data. Bioinformatics 32(18):2817– 2823. 5. Beleza S, et al. (2012) The Timing of Pigmentation Lightening in Europeans. Molecular Biology and Evolution 30(1):24–35. 6. Uren C, et al. (2016) Fine-Scale Human Population Structure in Reflects Ecogeographic Boundaries. Genetics 204(1):303–314. 7. Henn BM, et al. (2008) Y-chromosomal evidence of a pastoralist migration through Tanzania to southern Africa. Proceedings of the National Academy of Sciences 105(31):10693–10698. 8. Nakagome S, et al. (2015) Estimating the Ages of Selection Signals from Different Epochs in Human History. Molecular Biology and Evolution 33(3):657–669. 9. Hinch AG, et al. (2011) The landscape of recombination in African Americans. Nature 476(7359):170–175. 10. Bobo D, Lipatov M, Rodriguez-Flores JL, Auton A, Henn BM (2016) False Negatives Are a Significant Feature of Next Generation Sequencing Callsets. bioRxiv:066043. 11. Gautier M, Klassmann A, Vitalis R (2017) rehh 2.0: a reimplementation of the R package rehh to detect positive selection from haplotype structure. Molecular ecology resources 17(1):78–90. 12. Csilléry K, François O, Blum MGB (2012) abc: an R package for approximate Bayesian computation (ABC). Methods in Ecology and Evolution 3(3):475–479. 13. Llorente MG, et al. (2015) Ancient Ethiopian genome reveals extensive Eurasian admixture in Eastern Africa. Science 350(6262):820–822. 14. Pagani L, et al. (2012) Ethiopian Genetic Diversity Reveals Linguistic Stratification and Complex Influences on the Ethiopian Gene Pool. The American Journal of Human Genetics 91(1):83–96. 15. Pickrell JK, et al. (2014) Ancient west Eurasian ancestry in southern and eastern Africa. Proceedings of the National Academy of Sciences 111(7):2632–2637. 16. Hodgson JA, Mulligan CJ, Al-Meeri A, Raaum RL (2014) Early Back-to-Africa Migration into the Horn of Africa. PLoS Genet 10(6):e1004393–18. 17. Busby GB, et al. (2016) Admixture into and within sub-Saharan Africa. Elife 5:e15266. 18. Schlebusch CM, et al. (2017) Southern African ancient genomes estimate modern human divergence to 350,000 to 260,000 years ago. Science 67:eaao6266–12. 19. Crawford NG, et al. (2017) Loci associated with skin pigmentation identified in African populations. Science 160:eaan8433–26. 20. Song S, Sliwerska E, Emery S, Kidd JM (2016) Modeling Human Population Separation History Using Physically Phased Genomes. Genetics 205(1):385–395. 21. Henn BM, et al. (2011) Hunter-gatherer genomic diversity suggests a southern African origin for modern humans. Proceedings of the National Academy of Sciences 108(13):5154–5162. 22. Gronau I, Hubisz MJ, Gulko B, Danko CG, Siepel A (2011) Bayesian inference of ancient human demography from individual genome sequences. Nat Genet 43(10):1031–1034. 23. Veeramah KR, et al. (2012) An Early Divergence of KhoeSan Ancestors from Those of Other Modern Humans Is Supported by an ABC-Based Analysis of Autosomal Resequencing Data. Molecular Biology and Evolution 29(2):617–630. 24. Barbieri C, et al. (2013) Ancient Substructure in Early mtDNA Lineages of Southern Africa. The American Journal of Human Genetics 92(2):285–292. 25. Beleza S, et al. (2013) Genetic Architecture of Skin and Eye Color in an African- European Admixed Population. PLoS Genet 9(3):e1003372–15. 26. Rajeevan H, et al. (2003) ALFRED: the ALelle FREquency Database. Update. Nucl Acids Res 31(1):270–271. 27. Hollfelder N, et al. (2017) Northeast African genomic variation shaped by the continuity of indigenous groups and Eurasian migrations. PLoS Genet 13(8):e1006976. 28. N H, et al. (2017) Data from: Northeast African genomic variation shaped by the continuity of indigenous groups and Eurasian migrations. PLOS Genetics. 29. Consortium 1GP (2010) A map of human genome variation from population-scale sequencing. Nature 467(7319):1061–1073. 30. Henn BM, Cavalli-Sforza LL, Feldman MW (2012) The great human expansion. Proceedings of the National Academy of Sciences 109(44):17758–17764. 31. Breton G, et al. (2014) Lactase Persistence Alleles Reveal Partial East African Ancestry of Southern African Khoe Pastoralists. Current Biology 24(8):852–858.