Y-Chromosome Variation in Southern African Khoe-San Populations Based on Whole-Genome Sequences
Total Page:16
File Type:pdf, Size:1020Kb
GBE Y-Chromosome Variation in Southern African Khoe-San Populations Based on Whole-Genome Sequences Thijessen Naidoo1,2,3,4,†,JingziXu1,†,Mario Vicente1, Helena Malmstro¨m1,5, Himla Soodyall6,7,8, Mattias Jakobsson1,3,5, and Carina M. Schlebusch 1,3,5,* Downloaded from https://academic.oup.com/gbe/article/12/7/1031/5874966 by Uppsala Universitetsbibliotek user on 16 November 2020 1Human Evolution, Department of Organismal Biology, Evolutionary Biology Centre, Uppsala University, Sweden 2Department of Archaeology and Classical Studies, Stockholm University, Sweden 3Science for Life Laboratory, Uppsala, Sweden 4Centre for Palaeogenetics, Stockholm, Sweden 5Palaeo-Research Institute, University of Johannesburg, Auckland Park, South Africa 6Division of Human Genetics, School of Pathology, Faculty of Health Sciences, University of the Witwatersrand, Johannesburg, South Africa 7National Health Laboratory Service, Johannesburg, South Africa 8Academy of Science of South Africa *Corresponding author: E-mail: [email protected]. Accepted: 12 May 2020 †These authors contributed equally to this work. Data deposition: The complete Y-chromosome sequences were deposited on the European Genome Phenome Archive (https://www.ebi.ac.uk/ ega/), accession number EGAS00001004459, and are available for research use under controlled access policies. Abstract Although the human Y chromosome has effectively shown utility in uncovering facets of human evolution and population histories, the ascertainment bias present in early Y-chromosome variant data sets limited the accuracy of diversity and TMRCA estimates obtained from them. The advent of next-generation sequencing, however, has removed this bias and allowed for the discovery of thousands of new variants for use in improving the Y-chromosome phylogeny and computing estimates that are more accurate. Here, we describe the high-coverage sequencing of the whole Y chromosome in a data set of 19 male Khoe-San individuals in comparison with existing whole Y-chromosome sequence data. Due to the increased resolution, we potentially resolve the source of haplogroup B-P70 in the Khoe-San, and reconcile recently published haplogroup A-M51 data with the most recent version of the ISOGG Y-chromosome phylogeny. Our results also improve the positioning of tentatively placed new branches of the ISOGG Y- chromosome phylogeny. The distribution of major Y-chromosome haplogroups in the Khoe-San and other African groups coincide with the emerging picture of African demographic history; with E-M2 linked to the agriculturalist Bantu expansion, E-M35 linked to pastoralist eastern African migrations, B-M112 linked to earlier east-south gene flow, A-M14 linked to shared ancestry with central African rainforest hunter-gatherers, and A-M51 potentially unique to the Khoe-San. Key words: Y chromosome, next-generation sequencing, haplogroups, Khoe-San, southern Africa. Introduction haplotypic block in the human genome (Scozzari et al. The male-specific portion of the Y chromosome (MSY) 2012); and its paternal mode of inheritance. The trans- has long been regarded as an effective tool in the study mission of an intact haplotype from father to son, chang- of human evolutionary history (Underhill and Kivisild ing only through mutation, preserves a simpler record of 2007). It has proved useful mainly due to a lack of re- its history and allows us to study the male contribution to combination along its length, making it the longest the shaping of humanity. ß The Author(s) 2020. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] Genome Biol. Evol. 12(7):1031–1039. doi:10.1093/gbe/evaa098 1031 Naidoo et al. GBE The mutations found on Y chromosomes sourced from (Schlebusch et al. 2020). In this study, we discuss the results numerous human populations have been used to generate of the Y-chromosome sequences of the 19 male individuals Y-chromosome phylogenies (Underhill et al. 2000; Hammer that were included in the study. After processing and filtering et al. 2001; Y-Chromosome Consortium 2002; Karafetetal. the sequence data (see Materials and Methods), we obtained 2008), with well-defined and geographically informative hap- 5,783 variants; with an average depth of 31Â.Oncemerged logroups, that is, groups of haplotypes that share common with the comparative data from seven additional populations Downloaded from https://academic.oup.com/gbe/article/12/7/1031/5874966 by Uppsala Universitetsbibliotek user on 16 November 2020 ancestors and so present as clades in a phylogeny. While the (Drmanac et al. 2010; Auton et al. 2015; Lachance et al. first consensus phylogeny with standardized nomenclature 2012), our total data set contained 7,878 variants from was published in 2002 (Y-Chromosome Consortium 2002) 8.8 Mb (8,800,463 bases) of Y-chromosome sequence in 48 with the last full version published in 2008 (Karafetetal. individuals. 2008) and a minimal reference version published in 2014 (van Oven et al. 2014), the International Society of Genetic Khoe-San Haplogroups Genealogy (ISOGG) maintains a comprehensive online version (https://isogg.org/tree/index.html) that is updated every year. The major haplogroups found in our sequenced individuals The utility of the Y-chromosome phylogeny notwithstand- were A-M14 (A1b1a1), A-M51 (A1b1b2a), B-M112 (B2b), E- ing, the ascertainment bias present when initially sourcing the M2 (E1b1a1), and E-M35 (E1b1b1); and were thus strongly mutations used to build it, resulted in limitations. The use of concordant with previous surveys of Y-chromosome variation predefined sets of variants, limited the ability to generate un- in Khoe-San populations (Underhill et al. 2000; Naidoo et al. biased estimates of global diversity and accurate estimates of 2010; Batini et al. 2011; Barbieri et al. 2016). The additional the time to most recent common ancestor (TMRCA) (Poznik resolution provided by sequencing, however, uncovered sev- et al. 2013). As a solution, short tandem repeat polymorphisms eral new variants especially supporting branches in hap- (STRs) have been used for a long time, however, STRs had their logroups A and B, and allowed for greater clarity regarding own biases and issues (i.e., see Hallast et al. [2015] for a com- the relationships of some subclades and markers (see table 1 parison between sequence-based and STR-based TMRCAs). and supplementary table S1, Supplementary Material online, The emergence of next-generation sequencing (NGS) technol- for haplogroup assignments and population information). ogy, resulted in the discovery of thousands of unbiased variants The Internal Structure of A-M51 (Cruciani et al. 2011; Francalacci et al. 2013; Poznik et al. 2013; Scozzari et al. 2014; Karmin et al. 2015; Barbieri et al. 2016), Haplogroup A1b1b2a (A-M51) has often been found to be which allowed for substantially more accurate estimates of the the most common haplogroup A subclade in southern Africa age the Y-chromosome phylogeny and its haplogroups. (Barbieri et al. 2016), though usually found primarily in the Although studies usually attempted to balance the global Khoe-San or in populations with significant levels of Khoe-San representation of populations in their samples, there did ap- admixture. Three previously reported subclades (Scozzari et al. pear to be an underrepresentation of African samples in the 2012; Barbieri et al. 2016), haplogroups A1b1b2a1a (A-P71), final results, with this being especially true of haplogroups A A1b1b2a2 (A-V37), and A1b1b2a1b (A-V306), were found in and B. Barbieri et al. (2016) addressed this discrepancy by our data set; and the branching structure was further refined sequencing a portion of the Y chromosome in 547 Khoe- and reconciled with the ISOGG Y phylogeny (ISOGG 2019- San- and Bantu-speakers; resulting in much older estimates 2020) (fig. 1 and table 1). Haplogroup A1b1b2a2 is the first to of the ages of haplogroups A and B and their subclades. branch off, as reported by Barbieri et al. (2016),whereas Following high-coverage sequencing of the full Y chromo- haplogroups A1b1b2a1a and A1b1b2a1b are united by some in a data set of 19 male Khoe-San individuals and com- marker M239. Within haplogroup A1b1b2a1a, all three indi- parison with existing full Y-chromosome sequence data, the viduals were also derived at marker P102, with one of them present study describes the distribution of haplogroups found, ancestral at marker P291. This is at odds with the current in the wider African context. The high level of resolution ISOGG phylogeny, which has P291 basal to P102. The three afforded by a sequencing analysis allowed us to potentially subclades segregated independently among the populations, resolve the source of haplogroup B-P70 in the Khoe-San. The with A1b1b2a2 found in two !Xun individuals, A1b1b2a1a in use of two slightly differing phylogenies when assigning our three Nama individuals and A1b1b2a1b in two Juj’hoansi haplogroups, allowed for the reconciliation of the haplogroup individuals. A-M51 data from Barbieri et al. (2016) with the most recent version of the ISOGG Y-chromosome phylogeny. Gene Flow into Khoe-San Populations Haplogroup E1b1b1b2b2a1 (E-M293) was found in two Khoe-San: one Nama and one !Xun. This E-M35 subclade Results was previously linked to the movement of pastoralist groups We performed high-coverage whole-genome sequencing of from eastern Africa to southern Africa 2,000 years ago 25 Khoe-San individuals from five different populations (Henn et al. 2008; Bajic et al. 2018). Moreover, the 1032 Genome Biol. Evol. 12(7):1031–1039 doi:10.1093/gbe/evaa098 Y-Chromosome Variation in Southern African Khoe-San PopulationsAQ GBE Table 1 Population Information and Y-Chromosome Haplogroups of Individuals Used in the Study Ref.