Genetic Ancestry of Hadza and Sandawe Peoples Reveals Ancient Population Structure in Africa
Total Page:16
File Type:pdf, Size:1020Kb
GBE Genetic Ancestry of Hadza and Sandawe Peoples Reveals Ancient Population Structure in Africa Daniel Shriner1, Fasil Tekola-Ayele2, Adebowale Adeyemo1,andCharlesN.Rotimi1,* 1Center for Research on Genomics and Global Health, National Human Genome Research Institute, Bethesda, Maryland 2Division of Intramural Population Health Research, Eunice Kennedy Shriver National Institute of Child Health and Human Development, Bethesda, Maryland *Corresponding author: E-mail: [email protected]. Accepted: March 5, 2018 Abstract The Hadza and Sandawe populations in present-day Tanzania speak languages containing click sounds and therefore thought to be distantly related to southern African Khoisan languages. We analyzed genome-wide genotype data for individuals sampled from the Hadza and Sandawe populations in the context of a global data set of 3,528 individuals from 163 ethno-linguistic groups. We found that Hadza and Sandawe individuals share ancestry distinct from and most closely related to Omotic ancestry; share Khoisan ancestry with populations such as ¼6 Khomani, Karretjie, and Ju/’hoansi in southern Africa; share Niger-Congo ancestry with populations such as Yoruba from Nigeria and Luhya from Kenya, consistent with migration associated with the Bantu Expansion; and share Cushitic ancestry with Somali, multiple Ethiopian populations, the Maasai population in Kenya, and the Nama population in Namibia. We detected evidence for low levels of Arabian, Nilo-Saharan, and Pygmy ancestries in a minority of individuals. Our results indicate that west Eurasian ancestry in eastern Africa is more precisely the Arabian parent of Cushitic ancestry. Relative to the Out-of-Africa migrations, Hadza ancestry emerged early whereas Sandawe ancestry emerged late. Key words: Africa, ancestry, migration, population structure. Introduction Tishkoff et al. (2009) concluded that the Hadza population The Hadza and Sandawe populations in present-day Tanzania had 72% ancestry distantly related to Khoisan and Pygmy speak click languages thought to be distantly related to south- ancestries, with 22% Niger-Congo ancestry and 6% ern African Khoisan languages (Ehret 2000; Gu¨ldemannand Cushitic ancestry. Similarly, the Sandawe population had Vossen 2000; Heine and Nurse 2000). (Throughout, we use 73% ancestry distantly related to Khoisan and Pygmy “Khoe-San” to refer to people and “Khoisan” to refer to both ancestries, with 18% Niger-Congo ancestry and 9% language and ancestry, without implying identity.) Eastern Cushitic ancestry (Tishkoff et al. 2009). Henn et al. (2011) and southern African hunter-gatherer groups have been ge- concluded that 1) the Hadza and Sandawe populations share netically separated for at least 30,000 years (Tishkoff et al. ancestry with the South African ¼6 Khomani population but 2007). Herding and cultivating Cushitic speakers reached distinct from Pygmy ancestry, 2) the Hadza and Sandawe northern Tanzania 4,000 years ago, followed by pastoralist populations share substantial amounts of eastern African an- Nilo-Saharan speakers, and then followed by agricultural cestry with the Maasai population in Kenya, 3) the Hadza and Niger-Congo speakers 2,500 years ago (Newman 1995). Sandawe populations share ancestry with Niger-Congo- In the Hadza population, the distribution of Y chromo- speaking populations such as Yoruba from Nigeria and somes includes mostly B2 haplogroups, with a smaller num- Luhya from Kenya, and 4) the Sandawe population shares a ber of E1b1a haplogroups, which are common in Niger- small amount of ancestry with Europeans (represented by Congo-speaking populations, and E1b1b haplogroups, which Tuscans from Italy). Using whole-genome sequence data, are common in Cushitic populations (Tishkoff et al. 2007). In Lachance et al. (2012) concluded that Khoisan-speaking pop- the Sandawe population, E1b1a and E1b1b haplogroups are ulations diverged first, followed by divergence of Pygmies, more common, with lower frequencies of B2 and A3b2 hap- and then followed by divergence of the ancestors of the logroups (Tishkoff et al. 2007). Using autosomal data, Hadza and Sandawe populations. Pickrell et al. (2012) also Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution 2018. This work is written by US Government employees and is in the public domain in the US. This Open Access article contains public sector information licensed under the Open Government Licence v2.0 (http://www.nationalarchives.gov.uk/doc/open-government- licence/version/2/) Genome Biol. Evol. 10(3):875–882. doi:10.1093/gbe/evy051 Advance Access publication March 14, 2018 875 Shriner et al. GBE inferred that the Hadza and Sandawe populations shared an- Table 1 cestry with Khoisan-speaking populations, with gene flow Mean Ancestry Proportions in the Hadza and Sandawe Samples in Semi- around 3,000 years ago of west Eurasian ancestry into eastern Supervised Clustering Analysis, After Denoising and Rescaling Africa (Pickrell et al. 2014). Pseudo-Sample Hadza Sandawe The recent origin of modern humans in sub-Saharan Africa Amerindian 0 0 involves a basal divergence event such that one lineage Arabian 0 0.001 includes Khoisan ancestry in south Africa; Pygmy ancestry in Berber 0 0 central Africa; Niger-Congo ancestry across west, east, and Chinese 0 0 south Africa; and Cushitic, Nilo-Saharan, and Omotic ances- Cushitic 0 0.258 Indian 0 0 tries in east Africa (Shriner et al. 2014). The other lineage Japanese 0 0 includes Berber ancestry in north Africa; Indian and Kalash Kalash 0 0 ancestries in south Asia; Chinese, Japanese, and southeast Khoisan 0.073 0.088 Asian ancestries in east Asia; Siberian ancestry in north Asia; Levantine-Caucasian 0 0 Native American ancestry in the Americas; Melanesian ances- Niger-Congo 0 0.392 try in Oceania; southern and northern European ancestries; Nilo-Saharan 0 0 and Arabian and Levantine-Caucasian ancestries in the Northern European 0 0 Middle East and the Caucasus (Shriner et al. 2014). These Oceanian 0 0 ancestries reflect shared history at a scale bigger than tribes Omotic 0.862 0.261 or ethno-linguistic groups but smaller than continents. The Pygmy 0.066 0 divergence of ancestries is mainly due to random genetic drift Siberian 0 0 Southeastern Asian 0 0 following serial founder effects as modern humans peopled Southern European 0 0 the world (Li et al. 2008). A notable exception is Cushitic ancestry, which did not form by a splitting event but rather by a mixing event between Arabian ancestry and Nilo- Saharan or Omotic ancestry (Shriner et al. 2016). We previ- while allowing the samples to update the allele frequency ously described integration of genotype data from 12 human estimates. Thus, for each of the 19 ancestries we previously diversity projects, yielding 3,528 unrelated individuals from described (Shriner et al. 2014), we generated pseudo-samples around the world (Shriner et al. 2014). To more precisely by identifying which individuals had the highest percentage identify west Eurasian ancestry and to investigate the origins of that ancestry, regardless of the sample of origin. After and phylogenetic relationships of Hadza and Sandawe ances- denoising and renormalization, seven of the pseudo- tries in the global context, we merged samples from these samples were ancestrally homogeneous (supplementary two populations (Henn et al. 2011) into our data set. Using table S2, Supplemenatry Material online). We then estimated cluster analyses and analysis of ancestry-specific allele fre- the ancestral composition of the Hadza and Sandawe individ- quencies, we provide greater detail about the history of the uals using these pseudo-samples as reference training data. Hadza and Sandawe populations as well as novel insights into At the individual level, 12 Hadza individuals had ancestry cor- the ancestries of modern humans. responding to the Omotic pseudo-sample, nine had ancestry corresponding to the Khoisan pseudo-sample, eight had an- cestry corresponding to the Niger-Congo pseudo-sample, and Cluster Analyses in a Global Context one had contributions corresponding to the Cushitic, Nilo- We integrated 13 Hadza and 25 Sandawe individuals into our Saharan, and Pygmy pseudo-samples. Thus, the predominant global data set of 3,528 individuals. The process of merging ancestry of the Hadza individuals was most closely related to genotype data from different data sets and genotyping plat- Omotic ancestry. At the sample level, Niger-Congo ancestry forms left 19,206 SNPs. To address the possible effect of SNP was not significant because of large variance between indi- ascertainment bias on FST estimation, we compared pairwise viduals (table 1). There was no evidence for ancestry corre- estimates for the samples from the 1000 Genomes Project sponding to any of the Asian or European pseudo-samples. At (Auton et al. 2015) based on our panel of SNPs versus the the individual level, all 25 Sandawe individuals had contribu- whole genome sequences. The median difference was tions from Cushitic, Khoisan, Niger-Congo, and Omotic 0.0031 (95% confidence interval ½0:0002; 0:0177), indi- pseudo-samples. Additionally, one Sandawe individual had cating that FST estimation was not significantly biased by ancestry corresponding to the Arabian pseudo-sample and either SNP ascertainment or the number of SNPs (supplemen- one had ancestry corresponding to the Pygmy pseudo- tary table S1, Supplemenatry Material online). sample. Compared with the Hadza sample, the Sandawe We first performed semisupervised clustering. In this type sample had larger contributions from the Niger-Congo and of cluster analysis, the goal is to describe the Hadza and Cushitic pseudo-samples (table 1). As with the Hadza individ- Sandawe samples in terms of predefined allele frequencies uals, there was no evidence in the Sandawe individuals for 876 Genome Biol. Evol. 10(3):875–882 doi:10.1093/gbe/evy051 Advance Access publication March 14, 2018 Genetic Ancestry of Hadza and Sandawe Peoples GBE 2 3 4 5 6 7 8 9 10 11 12 FIG.1.—Unsupervised clustering analysis. The 47 samples are labeled alternating across the top and bottom. The numbers of ancestries are labeled in the left margin.